Read datasets created with cas_write_dataset
Usage
cas_read_corpus(
...,
update = FALSE,
path = NULL,
file_format = "parquet",
partition = NULL,
token = "full_text",
corpus_folder = "corpus"
)Arguments
- ...
Passed to
cas_get_db_file().- update
Logical, defaults to FALSE. If FALSE, just checks if relevant corpus has been previously stored. If TRUE, it checks if more recent contents are available in the local database.
- path
Defaults to
NULL. IfNULL, path is set to theproject/website/export/dataset/file_formatfolder.- file_format
Defaults to "parquet". Currently, other options are not implemented.
- partition
Defaults to
NULL. IfNULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If ayearcolumn does not exist, it is created based on the assumption that adatecolumn exists and it is (or can be coerced to) a vector of classDate.- token
Defaults to "full_text", which does not tokenise the text column. If different from
full_text, it is passed totidytext::unnest_tokens()(see its help for details). Accepted values include "words", "sentences", and "paragraphs".