Skip to contents

Typically used for file maintainance, especially when datasets are routinely updated.

Usage

cas_delete_corpus(
  keep = 1,
  ask = TRUE,
  file_format = "parquet",
  partition = "year",
  token = "full_text",
  corpus_folder = "corpus",
  path = NULL,
  ...
)

Arguments

keep

Numeric, defaults to 1. Number of corpus files to keep. Only the most recent files are kept.

file_format

Defaults to "parquet". Currently, other options are not implemented.

partition

Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a year column does not exist, it is created based on the assumption that a date column exists and it is (or can be coerced to) a vector of class Date.

token

Defaults to "full_text", which does not tokenise the text column. If different from full_text, it is passed to tidytext::unnest_tokens (see its help for details). Accepted values include "words", "sentences", and "paragraphs". See ?tidytext::unnest_tokens() for details.

path

Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.

...

Passed to cas_get_db_file().