Update corpus — cas_update • castarter

Currently supports only update when re-downloading index urls is expected to bring new articles. It takes the first urls for each index group, and continues downloading new index pages as long as new links are found in each page. If no new link is found, it stops downloading and moves to the next index group.

Usage

cas_update(
  extract_links_partial,
  extractors,
  post_processing = NULL,
  wait = 3,
  user_agent = NULL,
  ...
)

Arguments

extract_links_partial: A partial function, typically created with purrr::partial(.f = cas_extract_links), followed by the paramters originally used by cas_extract_links(). See examples.
extractors: A named list of functions. See examples for details.
post_processing: Defaults to NULL. If given, it must be a function that takes a data frame as input (logically, a row of the dataset) and returns it with additional or modified columns.
wait: Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue.
user_agent: Defaults to NULL. If given, passed to download method.
...: Passed to cas_get_db_file().

Examples


# Example of extract_links_partial:
extract_links_partial <- purrr::partial(
  .f = cas_extract_links,
  reverse_order = TRUE,
  container = "div",
  container_class = "hentry h-entry hentry_event",
  exclude_when = c("/photos", "/videos"),
  domain = "http://en.kremlin.ru/"
)