Checks if given corpus exists, and, optionally updates it
Source:R/cas_check_corpus.R
cas_check_corpus.Rd
Checks if given corpus exists, and, optionally updates it
Usage
cas_check_corpus(
...,
update = FALSE,
keep_only_latest = FALSE,
path = NULL,
file_format = "parquet",
partition = NULL,
token = "full_text",
corpus_folder = "corpus"
)
Arguments
- ...
Passed to
cas_get_db_file()
.- update
Logical, defaults to FALSE. If set to TRUE, it checks if the local database has contents with a higher content id than is currently available in previously exported corpus, if any. If so, it writes a new, updated corpus.
- keep_only_latest
Logical, defaults to FALSE. If set to TRUE, it deletes previous, older, corpora of the same type.
- path
Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.
- file_format
Defaults to "parquet". Currently, other options are not implemented.
- partition
Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a
year
column does not exist, it is created based on the assumption that adate
column exists and it is (or can be coerced to) a vector of classDate
.- token
Defaults to "full_text", which does not tokenise the text column. If different from
full_text
, it is passed totidytext::unnest_tokens
(see its help for details). Accepted values include "words", "sentences", and "paragraphs". See?tidytext::unnest_tokens()
for details.