Delete previously stored corpora written with cas_write_corpus().

Typically used for file maintainance, especially when datasets are routinely updated.

Usage

cas_delete_corpus(
  keep = 1,
  ask = TRUE,
  file_format = "parquet",
  partition = "year",
  token = "full_text",
  corpus_folder = "corpus",
  path = NULL,
  ...
)

Arguments

keep: Numeric, defaults to 1. Number of corpus files to keep. Only the most recent files are kept.
file_format: Defaults to "parquet". Currently, other options are not implemented.
partition: Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a year column does not exist, it is created based on the assumption that a date column exists and it is (or can be coerced to) a vector of class Date.
token: Defaults to "full_text", which does not tokenise the text column. If different from full_text, it is passed to tidytext::unnest_tokens() (see its help for details). Accepted values include "words", "sentences", and "paragraphs".
path: Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.
...: Passed to cas_get_db_file().