Extract fields and contents from downloaded files

Usage

cas_extract(
  extractors,
  post_processing = NULL,
  id = NULL,
  ignore_id = TRUE,
  custom_path = NULL,
  index = FALSE,
  store_as_character = TRUE,
  check_previous = TRUE,
  db_connection = NULL,
  file_format = "html",
  sample = FALSE,
  write_to_db = FALSE,
  keep_if_status = 200,
  encoding = "UTF-8",
  readability = FALSE,
  ...
)

Arguments

extractors: A named list of functions. See examples for details.
post_processing: Defaults to NULL. If given, it must be a function that takes a data frame as input (logically, a row of the dataset) and returns it with additional or modified columns.
id: Defaults to NULL, identifiers to process when extracting. If given, must be a numeric vector, logically corresponding to the identifiers in the id column, e.g. as returned by cas_read_db_contents_id()
ignore_id: Defaults to TRUE. If TRUE, it checks if identifiers have been added to the local ignore list, typically with cas_ignore_id(), and as retrieved with cas_read_db_ignore_id(). It can also be a numeric vector of identifiers: the given identifiers will not be processed. If FALSE, items will be processed normally.
index: Logical, defaults to FALSE. If TRUE, downloaded files will be considered index files. If not, they will be considered contents files. See Readme for a more extensive explanation.
store_as_character: Logical, defaults to TRUE. If TRUE, it converts to character all extracted contents before writing them to database. This reduces issues of type conversions with the default database backend (for example, SQLite automatically converts dates to numeric) or using different backends. This implies you will need to set data types when you read the database, but it also means that you can consistently expect all columns to be character vectors, which in one form or another are consistently implemented across database backends. Set to FALSE if you want to remain in control of column types.
check_previous: Logical, defaults to TRUE. If FALSE, no check will be conducted to verify if the same content had been previously extracted. If FALSE, write_to_db must be set (or will be set) to FALSE, to prevent duplication of data.
file_format: Defaults to html. Used for storing files in dedicated folders, but also for determining processing options. For example, if a sitemap is downloaded as an index with file_format set to xml, it will be processed accordingly. If it is stored as xml.gz, it will be automatically decompressed for correct processing.
sample: Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded.
keep_if_status: Defaults to 200. Keep only if recorded download status matches the given status.
...: Passed to cas_get_db_file().

Examples

if (FALSE) { # \dontrun{
if (interactive) {
  ### Post-processing example ####
  # For example, in order to add a column called `internal_id`
  # that takes the ending digits of the url (assuming the url ends with digits)
  # a function such as the following would be passed to cas_extract
  pp <- function(df) {
    df |>
      dplyr::mutate(internal_id = stringr::str_extract(url, "[[:digit:]]+$"))
  }
}

cas_extract(
  extractors = extractors_l, # assuming it has already been set
  post_processing = pp
)
} # }