Skip to contents

Checks if given corpus exists, and, optionally updates it

Usage

cas_check_corpus(
  ...,
  update = FALSE,
  keep_only_latest = FALSE,
  path = NULL,
  file_format = "parquet",
  partition = NULL,
  token = "full_text",
  corpus_folder = "corpus"
)

Arguments

...

Passed to cas_get_db_file().

update

Logical, defaults to FALSE. If set to TRUE, it checks if the local database has contents with a higher content id than is currently available in previously exported corpus, if any. If so, it writes a new, updated corpus.

keep_only_latest

Logical, defaults to FALSE. If set to TRUE, it deletes previous, older, corpora of the same type.

path

Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.

file_format

Defaults to "parquet". Currently, other options are not implemented.

partition

Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a year column does not exist, it is created based on the assumption that a date column exists and it is (or can be coerced to) a vector of class Date.

token

Defaults to "full_text", which does not tokenise the text column. If different from full_text, it is passed to tidytext::unnest_tokens (see its help for details). Accepted values include "words", "sentences", and "paragraphs". See ?tidytext::unnest_tokens() for details.

Value

Path to corpus. NULL, if no corpus is found and update is set to FALSE.