Skip to contents

Currently supports only update when re-downloading index urls is expected to bring new articles. It takes the first urls for each index group, and continues downloading new index pages as long as new links are found in each page. If no new link is found, it stops downloading and moves to the next index group.

Usage

cas_update(
  extract_links_partial,
  extractors,
  post_processing = NULL,
  wait = 3,
  user_agent = NULL,
  ...
)

Arguments

A partial function, typically created with purrr::partial(.f = cas_extract_links), followed by the paramters originally used by cas_extract_links(). See examples.

extractors

A named list of functions. See examples for details.

post_processing

Defaults to NULL. If given, it must be a function that takes a data frame as input (logically, a row of the dataset) and returns it with additional or modified columns.

wait

Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue.

user_agent

Defaults to NULL. If given, passed to download method.

...

Passed to cas_get_db_file().

Examples


# Example of extract_links_partial:
extract_links_partial <- purrr::partial(
  .f = cas_extract_links,
  reverse_order = TRUE,
  container = "div",
  container_class = "hentry h-entry hentry_event",
  exclude_when = c("/photos", "/videos"),
  domain = "http://en.kremlin.ru/"
)