Skip to contents

Export the textual dataset for the current website

Usage

cas_write_corpus(
  corpus = NULL,
  arrange_by = NULL,
  to_lower = FALSE,
  drop_na = TRUE,
  date = date,
  text = text,
  tif_compliant = TRUE,
  file_format = "parquet",
  partition = "year",
  token = "full_text",
  corpus_folder = "corpus",
  path = NULL,
  db_connection = NULL,
  db_folder = NULL,
  ...
)

Arguments

corpus

Defaults to NULL. If NULL, retrieves corpus from the current website with cas_read_db_contents_data(). If given, it is expected to be a corresponding data frame.

arrange_by

Defaults to NULL. If given, expected to be an unquoted column name, such as date or id.

to_lower

Defaults to FALSE. Whether to convert tokens to lowercase. Passed to tidytext if token is not full_text.

drop_na

Defaults to TRUE. If TRUE, items that have NA in their text or date columns are dropped. This is often useful, as in many cases these may have other issues and/or cause inconsistencies in further analyses.

date

Unquoted date column, defaults to date.

text

Unquoted text column, defaults to text. If tif_compliant is set to TRUE, it will be renamed to "text" even if originally it had a different name.

tif_compliant

Defaults to TRUE. If TRUE, it ensures that the first column is a character vector named "doc_id" and that the second column is a character vector named "text". See https://docs.ropensci.org/tif/ for details

file_format

Defaults to "parquet". Currently, other options are not implemented.

partition

Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a year column does not exist, it is created based on the assumption that a date column exists and it is (or can be coerced to) a vector of class Date.

token

Defaults to "full_text", which does not tokenise the text column. If different from full_text, it is passed to tidytext::unnest_tokens (see its help for details). Accepted values include "words", "sentences", and "paragraphs". See ?tidytext::unnest_tokens() for details.

path

Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.

db_connection

Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).

...

Passed to cas_get_db_file().