Facilitates extraction of contents from an html file
Source:R/cas_extract_html.R
cas_extract_html.RdFacilitates extraction of contents from an html file
Usage
cas_extract_html(
html_document,
container = NULL,
container_class = NULL,
container_id = NULL,
container_name = NULL,
container_property = NULL,
container_itemprop = NULL,
container_instance = NULL,
collapse = "\n",
attribute = NULL,
sub_element = NULL,
no_children = NULL,
trim = TRUE,
squish = FALSE,
no_match = "",
exclude_css_path = NULL,
exclude_xpath = NULL,
custom_xpath = NULL,
custom_css_path = NULL,
keep_everything = FALSE,
extract_text = TRUE,
as_character = TRUE
)Arguments
- html_document
An html document parsed with
xml2::read_html()orrvest::read_html().- container
Defaults to
NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Eithercontainer_classorcontainer_idmust also be provided.- container_class
Defaults to
NULL. If provided, alsocontainermust be given (andcontainer_idmust beNULL). Only text found inside the provided combination of container/class will be extracted.- container_id
Defaults to
NULL. If provided, alsocontainermust be given (andcontainer_idmust beNULL). Only text found inside the provided combination of container/class will be extracted.- container_itemprop
Defaults to
NULL. If provided, alsocontainermust be given (andcontainer_idandcontainer_classmust beNULLor will be silently ignored). Only text found inside the provided combination of container/itemprop will be extracted.- container_instance
Defaults to
NULL. If given, it must be an integer. If a given combination is found more than once in the same page, the relevant occurrence is kept. Use with caution, as not all pages always include the same number of elements of the same class/with the same id.- collapse
Defaults to
\n. If given, and more than one instance of the given selector is found, they are collapsed in a single string with this character.- attribute
Defaults to
NULL. If given, type of attribute to extract. Typically used in combination with container, as incas_extract_html(container = "time", attribute = "datetime").- sub_element
Defaults to
NULL. If provided, alsocontainermust be given. Only text within elements of given type under the chosen combination of container/containerClass will be extracted. When given, it will tipically be "p", to extract all p elements inside the selected div.- no_children
Defaults to FALSE, i.e. by default all subelements of the selected combination (e.g. div with given class) are extracted. If TRUE, only text found under the given combination (but not its subelements) will be extracted. Corresponds to the xpath string
/node()[not(self::div)].- trim
Defaults to
TRUE. IfTRUE, appliesstringr::str_trim()to output, removing whitespace from start and end of string.- squish
Defaults to
FALSE. IfTRUE, appliesstringr::str_squish()to output, removing whitespace from start and end of string, and replacing any whitespace (including new lines) with a single space.- no_match
Defaults to "". A common alternative would be
NA. Value to return when the given container, selector or element is not found.- exclude_css_path
Defaults to
NULL. To remove script, for example, usescript, which is transformed to:not(script). May cause issues, use with caution.- exclude_xpath
Defaults to
NULL. A common pattern when extracting text would be//script|//iframe|//img|//style, as it is assumed that these containers (javascript contents, iframes, css blocks, and images) are most likely undesirable when extracting text. Customise as needed. For example, if besides the above you also want to remove adivof classrelated-articles, you may use//script|//iframe|//img|//div[@class='related-articles']Be careful when usingexclude_xpathas the relevant Xpath is removed from the original objext passed tocas_extract_html(). To be clear, the input object is changed, and, for example, if used once in one of the extractors these containers won't be available to other extractors.- custom_xpath
Defaults to
NULL. If given, all other parameters are ignored and given Xpath used instead.- custom_css_path
Defaults to
NULL. If given, all other parameters are ignored and given CSSpath used instead.- keep_everything
Defaults to
FALSE. IfTRUE, all text included in the page is returned as a single string.- extract_text
Defaults to
TRUE. IfTRUE, text is extracted.- as_character
Defaults to
TRUE. IfFALSE, and ifextract_textis set to FALSE, then anxml_nodesetobject is returned.
Examples
if (FALSE) { # \dontrun{
if (interactive()) {
url <- "https://example.com"
html_document <- rvest::read_html(x = url)
# example for a tag that looks like:
# <meta name="twitter:title" content="Example title" />
cas_extract_html(
html_document = html_document,
container = "meta",
container_name = "twitter:title",
attribute = "content"
)
# example for a tag that looks like:
# <meta name="keywords" content="various;keywords;">
cas_extract_html(
html_document = html_document,
container = "meta",
container_name = "keywords",
attribute = "content"
)
# example for a tag that looks like:
# <meta property="article:published_time" content="2016-10-29T13:09+03:00"/>
cas_extract_html(
html_document = html_document,
container = "meta",
container_property = "article:published_time",
attribute = "content"
)
}
} # }