Skip to contents

Facilitates extraction of contents from an html file

Usage

cas_extract_html_custom(
  html_document,
  container,
  container_type,
  container_match,
  attribute = NULL,
  sub_element = NULL
)

Arguments

html_document

An html document parsed with xml2::read_html() or rvest::read_html().

container

Defaults to NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Either container_class or container_id must also be provided.

container_match

String to be used for filtering nodes in combination with container_type.

attribute

Defaults to NULL. If given, type of attribute to extract. Typically used in combination with container, as in cas_extract_html(container = "time", attribute = "datetime").

sub_element

Defaults to NULL. If provided, also container must be given. Only text within elements of given type under the chosen combination of container/containerClass will be extracted. When given, it will tipically be "p", to extract all p elements inside the selected div.

Value

A character vector.

Examples

if (FALSE) { # \dontrun{
## extract a canonical link
cas_extract_html_custom(
 html_document = x,
 container = "link",
 container_type = "rel",
 container_name = "canonical",
 attribute = "href"
 )
} # }