Extract fields and contents from downloaded files
Usage
cas_extract(
extractors,
post_processing = NULL,
id = NULL,
ignore_id = TRUE,
custom_path = NULL,
index = FALSE,
store_as_character = TRUE,
check_previous = TRUE,
db_connection = NULL,
file_format = "html",
sample = FALSE,
write_to_db = FALSE,
keep_if_status = 200,
encoding = "UTF-8",
readability = FALSE,
...
)
Arguments
- extractors
A named list of functions. See examples for details.
- post_processing
Defaults to NULL. If given, it must be a function that takes a data frame as input (logically, a row of the dataset) and returns it with additional or modified columns.
- id
Defaults to NULL, identifiers to process when extracting. If given, must be a numeric vector, logically corresponding to the identifiers in the
id
column, e.g. as returned bycas_read_db_contents_id()
- ignore_id
Defaults to TRUE. If TRUE, it checks if identifiers have been added to the local ignore list, typically with
cas_ignore_id()
, and as retrieved withcas_read_db_ignore_id()
. It can also be a numeric vector of identifiers: the given identifiers will not be processed. If FALSE, items will be processed normally.- index
Logical, defaults to FALSE. If TRUE, downloaded files will be considered
index
files. If not, they will be consideredcontents
files. See Readme for a more extensive explanation.- store_as_character
Logical, defaults to TRUE. If TRUE, it converts to character all extracted contents before writing them to database. This reduces issues of type conversions with the default database backend (for example, SQLite automatically converts dates to numeric) or using different backends. This implies you will need to set data types when you read the database, but it also means that you can consistently expect all columns to be character vectors, which in one form or another are consistently implemented across database backends. Set to FALSE if you want to remain in control of column types.
- check_previous
Logical, defaults to TRUE. If FALSE, no check will be conducted to verify if the same content had been previously extracted. If FALSE,
write_to_db
must be set (or will be set) to FALSE, to prevent duplication of data.- sample
Defaults to FALSE. If TRUE, the download order is randomised. If a numeric is given, the download order is randomised and at most the given number of items is downloaded.
- keep_if_status
Defaults to 200. Keep only if recorded download status matches the given status.
- ...
Passed to
cas_get_db_file()
.
Examples
if (FALSE) { # \dontrun{
if (interactive) {
### Post-processing example ####
# For example, in order to add a column called `internal_id`
# that takes the ending digits of the url (assuming the url ends with digits)
# a function such as the following would be passed to cas_extract
pp <- function(df) {
df |>
dplyr::mutate(internal_id = stringr::str_extract(url, "[[:digit:]]+$"))
}
}
cas_extract(
extractors = extractors_l, # assuming it has already been set
post_processing = pp
)
} # }