Skip to contents

Downloads files systematically, and stores details about the download in a local database

Usage

cas_download(
  download_df = NULL,
  index = FALSE,
  index_group = NULL,
  file_format = "html",
  overwrite_file = FALSE,
  create_folder_if_missing = NULL,
  wait = 1,
  pause_base = 2,
  pause_cap = 256,
  pause_min = 4,
  sample = FALSE,
  retry_times = 8,
  terminate_on = 404,
  user_agent = NULL,
  download_again_if_status_is_not = NULL,
  ...
)

Arguments

index

Logical, defaults to FALSE. If TRUE, downloaded files will be considered index files. If not, they will be considered contents files. See Readme for a more extensive explanation.

overwrite_file

Logical, defaults to FALSE. If TRUE, files are downloaded again even if already present, overwriting previously downloaded items.

wait

Defaults to 1. Number of seconds to wait between downloading one page and the next. Can be increased to reduce server load, or can be set to 0 when this is not an issue.

sample

Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded.

retry_times

Defaults to 10. Number of times to retry download in case of errors.

user_agent

Defaults to NULL. If given, passed to download method.

...

Passed to cas_get_db_file().

urls_df

A data frame with at least two columns named id and url. Typically generated with cas_build_urls() for index files. If a character vector is given instead, identifiers will be given automatically.