Facilitates extraction of contents from an html file
Source:R/cas_extract_html.R
cas_extract_html.Rd
Facilitates extraction of contents from an html file
Usage
cas_extract_html(
html_document,
container = NULL,
container_class = NULL,
container_id = NULL,
container_name = NULL,
container_property = NULL,
container_itemprop = NULL,
container_instance = NULL,
attribute = NULL,
sub_element = NULL,
no_children = NULL,
trim = TRUE,
squish = FALSE,
no_match = "",
exclude_css_path = NULL,
exclude_xpath = NULL,
custom_xpath = NULL,
custom_css_path = NULL,
keep_everything = FALSE,
extract_text = TRUE,
as_character = TRUE
)
Arguments
- html_document
An html document parsed with
xml2::read_html()
orrvest::read_html()
.- container
Defaults to NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Either
container_class
orcontainer_id
must also be provided.- container_class
Defaults to NULL. If provided, also
container
must be given (andcontainer_id
must be NULL). Only text found inside the provided combination of container/class will be extracted.- container_id
Defaults to NULL. If provided, also
container
must be given (andcontainer_id
must be NULL). Only text found inside the provided combination of container/class will be extracted.- container_itemprop
Defaults to NULL. If provided, also
container
must be given (andcontainer_id
andcontainer_class
must be NULL or will be silently ignored). Only text found inside the provided combination of container/itemprop will be extracted.- container_instance
Defaults to NULL. If given, it must be an integer. If a given combination is found more than once in the same page, the relevant occurrence is kept. Use with caution, as not all pages always include the same number of elements of the same class/with the same id.
- attribute
Defaults to NULL. If given, type of attribute to extract. Typically used in combination with container, as in
cas_extract_html(container = "time", attribute = "datetime")
.- sub_element
Defaults to NULL. If provided, also
container
must be given. Only text within elements of given type under the chosen combination of container/containerClass will be extracted. When given, it will tipically be "p", to extract all p elements inside the selected div.- no_children
Defaults to FALSE, i.e. by default all subelements of the selected combination (e.g. div with given class) are extracted. If TRUE, only text found under the given combination (but not its subelements) will be extracted. Corresponds to the xpath string
/node()[not(self::div)]
.- trim
Defaults to TRUE. If TRUE, applies
stringr::str_trim()
to output, removing whitespace from start and end of string.- squish
Defaults to FALSE. If TRUE, applies
stringr::str_squish()
to output, removing whitespace from start and end of string, and replacing any whitespace (including new lines) with a single space.- no_match
Defaults to "". A common alternative would be NA. Value to return when the given container, selector or element is not found.
- exclude_css_path
Defaults to NULL. To remove script, for example, use
script
, which is transformed to:not(script)
. May cause issues, use with caution.- exclude_xpath
Defaults to NULL. A common pattern when extracting text would be
//script|//iframe|//img|//style
, as it is assumed that these containers (javascript contents, iframes, css blocks, and images) are most likely undesirable when extracting text. Customise as needed. For example, if besides the above you also want to remove adiv
of classrelated-articles
, you may use//script|//iframe|//img|//div[@class='related-articles']
Be careful when usingexclude_xpath
as the relevant Xpath is removed from the original objext passed tocas_extract_html()
. To be clear, the input object is changed, and, for example, if used once in one of the extractors these containers won't be available to other extractors.- custom_xpath
Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead.
- custom_css_path
Defaults to NULL. If given, all other parameters are ignored and given CSSpath used instead.
- keep_everything
Defaults to FALSE. If TRUE, all text included in the page is returned as a single string.
- extract_text
Defaults to TRUE. If TRUE, text is extracted.
- as_character
Defaults to TRUE. If FALSE, and if
extract_text
is set to FALSE, then anxml_nodeset
object is returned.
Examples
if (FALSE) { # \dontrun{
if (interactive()) {
url <- "https://example.com"
html_document <- rvest::read_html(x = url)
# example for a tag that looks like:
# <meta name="twitter:title" content="Example title" />
cas_extract_html(
html_document = html_document,
container = "meta",
container_name = "twitter:title",
attribute = "content"
)
# example for a tag that looks like:
# <meta name="keywords" content="various;keywords;">
cas_extract_html(
html_document = html_document,
container = "meta",
container_name = "keywords",
attribute = "content"
)
# example for a tag that looks like:
# <meta property="article:published_time" content="2016-10-29T13:09+03:00"/>
cas_extract_html(
html_document = html_document,
container = "meta",
container_property = "article:published_time",
attribute = "content"
)
}
} # }