Export the textual dataset for the current website
Usage
cas_write_corpus(
corpus = NULL,
to_lower = FALSE,
drop_na = TRUE,
drop_empty = TRUE,
date = date,
text = text,
tif_compliant = FALSE,
file_format = "parquet",
partition = NULL,
token = "full_text",
corpus_folder = "corpus",
path = NULL,
db_connection = NULL,
db_folder = NULL,
...
)
Arguments
- corpus
Defaults to
NULL
. IfNULL
, retrieves corpus from the current website withcas_read_db_contents_data()
. If given, it is expected to be a corresponding data frame.- to_lower
Defaults to
FALSE
. Whether to convert tokens to lower case. Passed totidytext
if token is notfull_text
.- drop_na
Defaults to
TRUE
. IfTRUE
, items that haveNA
in theirtext
ordate
columns are dropped. This is often useful, as in many cases these may have other issues and/or cause inconsistencies in further analyses.- drop_empty
Defaults to
TRUE
. IfTRUE
, items that have empty elements ("") in theirtext
ordate
columns are dropped. This is often useful, as in many cases these may have other issues and/or cause inconsistencies in further analyses.- date
Unquoted date column, defaults to
date
.- text
Unquoted text column, defaults to
text
. Iftif_compliant
is set toTRUE
, it will be renamed to "text" even if originally it had a different name.- tif_compliant
Defaults to
FALSE
. IfTRUE
, it ensures that the first column is a character vector named "doc_id" and that the second column is a character vector named "text". See https://docs.ropensci.org/tif/ for details.- file_format
Defaults to "parquet". Currently, other options are not implemented.
- partition
Defaults to
NULL
. IfNULL
, the parquet file is not partitioned. "year" is a common alternative: if set to "year", the parquet file is partitioned by year. If ayear
column does not exist, it is created based on the assumption that adate
column exists and it is (or can be coerced to) a vector of classDate
.- token
Defaults to "full_text", which does not tokenise the text column. If different from
full_text
, it is passed totidytext::unnest_tokens()
(see its help for details). Accepted values include "words", "sentences", and "paragraphs".- path
Defaults to
NULL
. IfNULL
, path is set to theproject/website/export/dataset/file_format
folder.- db_connection
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).
- ...
Passed to
cas_get_db_file()
.
Examples
if (FALSE) { # \dontrun{
cas_write_corpus(cas_read_db_contents_data(), partition = "year")
} # }