Checks if given corpus exists, and, optionally updates it
Source:R/cas_check_corpus.R
cas_check_corpus.RdChecks if given corpus exists, and, optionally updates it
Usage
cas_check_corpus(
...,
update = FALSE,
keep_only_latest = FALSE,
path = NULL,
file_format = "parquet",
partition = NULL,
token = "full_text",
corpus_folder = "corpus"
)Arguments
- ...
Passed to
cas_get_db_file().- update
Logical, defaults to FALSE. If set to TRUE, it checks if the local database has contents with a higher content id than is currently available in previously exported corpus, if any. If so, it writes a new, updated corpus.
- keep_only_latest
Logical, defaults to FALSE. If set to TRUE, it deletes previous, older, corpora of the same type.
- path
Defaults to
NULL. IfNULL, path is set to theproject/website/export/dataset/file_formatfolder.- file_format
Defaults to "parquet". Currently, other options are not implemented.
- partition
Defaults to
NULL. IfNULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If ayearcolumn does not exist, it is created based on the assumption that adatecolumn exists and it is (or can be coerced to) a vector of classDate.- token
Defaults to "full_text", which does not tokenise the text column. If different from
full_text, it is passed totidytext::unnest_tokens()(see its help for details). Accepted values include "words", "sentences", and "paragraphs".