Read datasets created with cas_write_dataset — cas_read

Read datasets created with cas_write_dataset

Usage

cas_read_corpus(
  ...,
  update = FALSE,
  path = NULL,
  file_format = "parquet",
  partition = NULL,
  token = "full_text",
  corpus_folder = "corpus"
)

Arguments

...: Passed to cas_get_db_file().
update: Logical, defaults to FALSE. If FALSE, just checks if relevant corpus has been previously stored. If TRUE, it checks if more recent contents are available in the local database.
path: Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.
file_format: Defaults to "parquet". Currently, other options are not implemented.
partition: Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a year column does not exist, it is created based on the assumption that a date column exists and it is (or can be coerced to) a vector of class Date.
token: Defaults to "full_text", which does not tokenise the text column. If different from full_text, it is passed to tidytext::unnest_tokens (see its help for details). Accepted values include "words", "sentences", and "paragraphs". See ?tidytext::unnest_tokens() for details.

Value

A dataset as ArrowObject

Examples

if (FALSE) { # \dontrun{
cas_read_corpus()
} # }