Read datasets created with cas_write_dataset
Usage
cas_read_corpus(
...,
update = FALSE,
path = NULL,
file_format = "parquet",
partition = NULL,
token = "full_text",
corpus_folder = "corpus"
)
Arguments
- ...
Passed to
cas_get_db_file()
.- update
Logical, defaults to FALSE. If FALSE, just checks if relevant corpus has been previously stored. If TRUE, it checks if more recent contents are available in the local database.
- path
Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.
- file_format
Defaults to "parquet". Currently, other options are not implemented.
- partition
Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a
year
column does not exist, it is created based on the assumption that adate
column exists and it is (or can be coerced to) a vector of classDate
.- token
Defaults to "full_text", which does not tokenise the text column. If different from
full_text
, it is passed totidytext::unnest_tokens
(see its help for details). Accepted values include "words", "sentences", and "paragraphs". See?tidytext::unnest_tokens()
for details.