Export the textual dataset for the current website
Usage
cas_write_corpus(
corpus = NULL,
to_lower = FALSE,
drop_na = TRUE,
drop_empty = TRUE,
date = date,
text = text,
tif_compliant = FALSE,
file_format = "parquet",
partition = NULL,
token = "full_text",
corpus_folder = "corpus",
path = NULL,
db_connection = NULL,
db_folder = NULL,
...
)
Arguments
- corpus
Defaults to NULL. If NULL, retrieves corpus from the current website with
cas_read_db_contents_data()
. If given, it is expected to be a corresponding data frame.- to_lower
Defaults to FALSE. Whether to convert tokens to lowercase. Passed to
tidytext
if token is notfull_text
.- drop_na
Defaults to TRUE. If TRUE, items that have NA in their
text
ordate
columns are dropped. This is often useful, as in many cases these may have other issues and/or cause inconsistencies in further analyses.- drop_empty
Defaults to TRUE. If TRUE, items that have empty elements ("") in their
text
ordate
columns are dropped. This is often useful, as in many cases these may have other issues and/or cause inconsistencies in further analyses.- date
Unquoted date column, defaults to
date
.- text
Unquoted text column, defaults to
text
. Iftif_compliant
is set to TRUE, it will be renamed to "text" even if originally it had a different name.- tif_compliant
Defaults to FALSE. If TRUE, it ensures that the first column is a character vector named "doc_id" and that the second column is a character vector named "text". See https://docs.ropensci.org/tif/ for details
- file_format
Defaults to "parquet". Currently, other options are not implemented.
- partition
Defaults to NULL. If NULL, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If a
year
column does not exist, it is created based on the assumption that adate
column exists and it is (or can be coerced to) a vector of classDate
.- token
Defaults to "full_text", which does not tokenise the text column. If different from
full_text
, it is passed totidytext::unnest_tokens
(see its help for details). Accepted values include "words", "sentences", and "paragraphs". See?tidytext::unnest_tokens()
for details.- path
Defaults to NULL. If NULL, path is set to the project/website/export/dataset/file_format folder.
- db_connection
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).
- ...
Passed to
cas_get_db_file()
.