Facilitates extraction of contents from an html file

Usage

cas_extract_html(
  html_document,
  container = NULL,
  container_class = NULL,
  container_id = NULL,
  container_name = NULL,
  container_property = NULL,
  container_itemprop = NULL,
  container_instance = NULL,
  attribute = NULL,
  sub_element = NULL,
  no_children = NULL,
  trim = TRUE,
  squish = FALSE,
  no_match = "",
  exclude_css_path = NULL,
  exclude_xpath = NULL,
  custom_xpath = NULL,
  custom_css_path = NULL,
  keep_everything = FALSE,
  extract_text = TRUE,
  as_character = TRUE
)

Arguments

html_document: An html document parsed with xml2::read_html() or rvest::read_html().
container: Defaults to NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Either container_class or container_id must also be provided.
container_class: Defaults to NULL. If provided, also container must be given (and container_id must be NULL). Only text found inside the provided combination of container/class will be extracted.
container_id: Defaults to NULL. If provided, also container must be given (and container_id must be NULL). Only text found inside the provided combination of container/class will be extracted.
container_itemprop: Defaults to NULL. If provided, also container must be given (and container_id and container_class must be NULL or will be silently ignored). Only text found inside the provided combination of container/itemprop will be extracted.
container_instance: Defaults to NULL. If given, it must be an integer. If a given combination is found more than once in the same page, the relevant occurrence is kept. Use with caution, as not all pages always include the same number of elements of the same class/with the same id.
attribute: Defaults to NULL. If given, type of attribute to extract. Typically used in combination with container, as in cas_extract_html(container = "time", attribute = "datetime").
sub_element: Defaults to NULL. If provided, also container must be given. Only text within elements of given type under the chosen combination of container/containerClass will be extracted. When given, it will tipically be "p", to extract all p elements inside the selected div.
no_children: Defaults to FALSE, i.e. by default all subelements of the selected combination (e.g. div with given class) are extracted. If TRUE, only text found under the given combination (but not its subelements) will be extracted. Corresponds to the xpath string /node()[not(self::div)].
trim: Defaults to TRUE. If TRUE, applies stringr::str_trim() to output, removing whitespace from start and end of string.
squish: Defaults to FALSE. If TRUE, applies stringr::str_squish() to output, removing whitespace from start and end of string, and replacing any whitespace (including new lines) with a single space.
no_match: Defaults to "". A common alternative would be NA. Value to return when the given container, selector or element is not found.
exclude_css_path: Defaults to NULL. To remove script, for example, use script, which is transformed to :not(script). May cause issues, use with caution.
exclude_xpath: Defaults to NULL. A common pattern when extracting text would be //script|//iframe|//img|//style, as it is assumed that these containers (javascript contents, iframes, css blocks, and images) are most likely undesirable when extracting text. Customise as needed. For example, if besides the above you also want to remove a div of class related-articles, you may use //script|//iframe|//img|//div[@class='related-articles']Be careful when using exclude_xpath as the relevant Xpath is removed from the original objext passed to cas_extract_html(). To be clear, the input object is changed, and, for example, if used once in one of the extractors these containers won't be available to other extractors.
custom_xpath: Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead.
custom_css_path: Defaults to NULL. If given, all other parameters are ignored and given CSSpath used instead.
keep_everything: Defaults to FALSE. If TRUE, all text included in the page is returned as a single string.
extract_text: Defaults to TRUE. If TRUE, text is extracted.
as_character: Defaults to TRUE. If FALSE, and if extract_text is set to FALSE, then an xml_nodeset object is returned.

Value

A character vector of length one.

Examples

if (FALSE) { # \dontrun{
if (interactive()) {
  url <- "https://example.com"
  html_document <- rvest::read_html(x = url)

  # example for a tag that looks like:
  # <meta name="twitter:title" content="Example title" />

  cas_extract_html(
    html_document = html_document,
    container = "meta",
    container_name = "twitter:title",
    attribute = "content"
  )


  # example for a tag that looks like:
  # <meta name="keywords" content="various;keywords;">
  cas_extract_html(
    html_document = html_document,
    container = "meta",
    container_name = "keywords",
    attribute = "content"
  )

  # example for a tag that looks like:
  # <meta property="article:published_time" content="2016-10-29T13:09+03:00"/>
  cas_extract_html(
    html_document = html_document,
    container = "meta",
    container_property = "article:published_time",
    attribute = "content"
  )
}
} # }