Export the textual dataset for the current website

Usage

cas_write_corpus(
  corpus = NULL,
  to_lower = FALSE,
  drop_na = TRUE,
  drop_empty = TRUE,
  date = date,
  text = text,
  tif_compliant = FALSE,
  file_format = "parquet",
  partition = NULL,
  token = "full_text",
  corpus_folder = "corpus",
  path = NULL,
  db_connection = NULL,
  db_folder = NULL,
  ...
)

Arguments

corpus: Defaults to NULL. If NULL, retrieves corpus from the current website with cas_read_db_contents_data(). If given, it is expected to be a corresponding data frame.
to_lower: Defaults to FALSE. Whether to convert tokens to lower case. Passed to tidytext if token is not full_text.
drop_na: Defaults to TRUE. If TRUE, items that have NA in their text or date columns are dropped. This is often useful, as in many cases these may have other issues and/or cause inconsistencies in further analyses.
drop_empty: Defaults to TRUE. If TRUE, items that have empty elements ("") in their text or date columns are dropped. This is often useful, as in many cases these may have other issues and/or cause inconsistencies in further analyses.
date: Unquoted date column, defaults to date.
text: Unquoted text column, defaults to text. If tif_compliant is set to TRUE, it will be renamed to "text" even if originally it had a different name.
tif_compliant: Defaults to FALSE. If TRUE, it ensures that the first column is a character vector named "doc_id" and that the second column is a character vector named "text". See https://docs.ropensci.org/tif/ for details.
file_format: Defaults to "parquet". Currently, other options are not implemented.
partition: Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a year column does not exist, it is created based on the assumption that a date column exists and it is (or can be coerced to) a vector of class Date.
token: Defaults to "full_text", which does not tokenise the text column. If different from full_text, it is passed to tidytext::unnest_tokens() (see its help for details). Accepted values include "words", "sentences", and "paragraphs".
path: Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.
db_connection: Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).
...: Passed to cas_get_db_file().

Value

Invisibly returns the path to the corpus.

Examples

if (FALSE) { # \dontrun{
cas_write_corpus(cas_read_db_contents_data(), partition = "year")
} # }