Downloads files systematically, and stores details about the download in a local database

Usage

cas_download(
  download_df = NULL,
  index = FALSE,
  index_group = NULL,
  file_format = "html",
  overwrite_file = FALSE,
  create_folder_if_missing = NULL,
  ignore_id = TRUE,
  wait = 1,
  pause_base = 2,
  pause_cap = 256,
  pause_min = 4,
  sample = FALSE,
  retry_times = 3,
  terminate_on = NULL,
  user_agent = NULL,
  download_again = FALSE,
  download_again_if_status_is_not = NULL,
  ...
)

Arguments

index: Logical, defaults to FALSE. If TRUE, downloaded files will be considered index files. If not, they will be considered contents files. See Readme for a more extensive explanation.
file_format: Defaults to html. Used for storing files in dedicated folders, but also for determining processing options. For example, if a sitemap is downloaded as an index with file_format set to xml, it will be processed accordingly. If it is stored as xml.gz, it will be automatically decompressed for correct processing.
overwrite_file: Logical, defaults to FALSE. If TRUE, files are downloaded again even if already present, overwriting previously downloaded items.
wait: Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue.
sample: Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded.
retry_times: Defaults to 3. Number of times to retry download in case of errors.
user_agent: Defaults to NULL. If given, passed to download method.
...: Passed to cas_get_db_file().
urls_df: A data frame with at least two columns named id and url. Typically generated with cas_build_urls() for index files. If a character vector is given instead, identifiers will be given automatically.