Extract direct links to individual content pages from index pages

Usage

cas_extract_links(
  id = NULL,
  batch = "latest",
  domain = NULL,
  index = TRUE,
  index_group = NULL,
  output_index = FALSE,
  output_index_group = NULL,
  include_when = NULL,
  exclude_when = NULL,
  container = NULL,
  container_class = NULL,
  container_id = NULL,
  custom_xpath = NULL,
  custom_css = NULL,
  match = NULL,
  min_length = NULL,
  max_length = NULL,
  attribute_type = "href",
  append_string = NULL,
  remove_string = NULL,
  write_to_db = FALSE,
  file_format = "html",
  keep_only_within_domain = TRUE,
  sample = FALSE,
  check_previous = TRUE,
  check_again = FALSE,
  encoding = "UTF-8",
  reverse_order = FALSE,
  db_connection = NULL,
  disconnect_db = TRUE,
  ...
)

Arguments

id: Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id will be processed.
domain: Defaults to "". Web domain of the website. It is added at the beginning of each link found. If links in the page already include the full web address this should be ignored.
output_index: Defaults to FALSE. If FALSE, new links are added to the contents table. If TRUE, the links extracted will be stored again as index, using output_index_group as index_group.
output_index_group: Defaults to NULL. Relevant only when output_index is set to TRUE. Used to store new index urls in the database with reference to the appropriate group.
include_when: Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided.
exclude_when: If an URL includes this string, it is excluded from the output. One or more strings may be provided.
container: Defaults to NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Either container_class or container_id must also be provided.
container_class: Defaults to NULL. If provided, also container must be given (and container_id must be NULL). Only text found inside the provided combination of container/class will be extracted.
container_id: Defaults to NULL. If provided, also container must be given (and container_id must be NULL). Only text found inside the provided combination of container/class will be extracted.
custom_xpath: Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead.
match: Defaults to NULL. Used when extracting json files. Name of property from where url is to be extracted. N.B. Only partly implemented, please report issues along with specific example where it emerged.
min_length: If a link is shorter than the number of characters given in min_length, it is excluded from the output.
max_length: If a link is longer than the number of characters given in max_length, it is excluded from the output.
attribute_type: Defaults to "href". Type of attribute to extract from links.
append_string: If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page.
remove_string: If provided, remove given string (or strings) from links.
write_to_db: Logical, defaults to FALSE. If TRUE stored newly extracted links in the database, associates each of them with an id, and records the source for each link.
keep_only_within_domain: Logical, defaults to TRUE. If TRUE, and domain given, links to external websites are dropped.
check_previous: Defaults to TRUE. If TRUE, checks if newly found links are previously stored in database, and if they are, it discards them. If FALSE, and write_to_db is also set to FALSE, it does not check for previously stored links.
check_again: Defaults to FALSE. If FALSE, files from where are at least a link has been extracted are not re-processed. If TRUE, they are processed again. By default, only new links are then actually included in the output or stored in the local database.
reverse_order: Logical, defaults to FALSE. If TRUE, index files are processed in reverse order of id and batch, which may give more meaningful order to content id. The difference is ultimately cosmetic, and has no substantive impact either way.
db_connection: Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).
disconnect_db: Defaults to TRUE. If FALSE, leaves the connection to database open.
...: Passed to cas_get_db_file().

Value

A data frame.

Examples

if (FALSE) { # \dontrun{
links <- cas_extract_links(domain = "http://www.example.com/")
} # }