Common workflows in castarter
Source:vignettes/articles/castarter-workflow.Rmd
castarter-workflow.Rmd
One of the first issues that appear when starting a text mining or
web scraping project relates to the issue of managing files and folder.
castarter
defaults to an opinionated folder structure that
should work for most projects. It also facilitates downloading files
(skipping previously downloaded files) and ensuring consistent and
unique matching between a downloaded html, its source url, and data
extracted from them. Finally, it facilitates archiving and backuping
downloaded files and scripts.
Getting started
In this vignette, I will outline some of the basic steps that can be used to extract and process the contents of press releases from a number of institutions of the European Union. When text mining or scraping, it is common to gather quickly many thousands of file, and keeping them in good order is fundamental, particularly in the long term.
A preliminary suggestion: depending on how you usually work and keep
your files backed-up it may make sense to keep your scripts in a folder
that is live-synced (e.g. with services such as Dropbox, Nextcloud, or
Google Drive). It however rarely make sense to live-sync tens or
hundreds of thousands of files as you proceed with your scraping. You
may want to keep this in mind as you set the base_folder
with cas_set_options()
. I will keep in the current working
directory here for the sake of simplicity, but there are no drawbacks in
having scripts and folders in different locations.
castarter
stores details about the download process in a
database. By default, this is stored locally in DuckDB database kept in
the same folder as website files, but it can be stored in a different
folder, or alternative database backends such as RSQlite or MySQL can
also be used.
Assuming that my project on the European Union involves text mining the website of the European Parliament, the European Council, and the External action service (EEAS) the folder structure may look something like this:
library("castarter")
cas_set_options(
base_folder = fs::path(
fs::path_temp(),
"castarter_data"
),
project = "European Union",
website = "EEAS"
)
#> /tmp/RtmpRbjOUP/castarter_data
#> └── European Union
#> ├── EEAS
#> ├── European Council
#> └── European Parliament
In brief, castarter_data
is the base folder where I can
store all of my text mining projects. european_union
is the
name of the project, while all others are the names of the specific
websites I will source. Folders will by created automatically as needed
when you start downloading files.
Downloading index files
In text mining a common scenario involves first downloading web pages containing lists of urls to the actual posts we are interested in. In the case of the European Commission, these would probably the pages in the “news section”. By clicking on the the numbers at the bottom of the page, we get to see direct links to the subsequent pages listing all posts.
These URLs look something like this:
Sometimes such urls can be derived from the archive section, as is the case for example for EEAS:
index_df <- cas_build_urls(
url = "https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=pm_category%253AStatement/Declaration&f%5B1%5D=press_site%253AEEAS&f%5B2%5D=press_site%253AEEAS&fulltext=&created_from=&created_to=&0=press_site%253AEEAS&1=press_site%253AEEAS&2=press_site%253AEEAS&page=",
start_page = 0,
end_page = 3,
index_group = "Statements"
)
index_df %>%
knitr::kable()
All information about the download process are tyipically stored in a local database, for consistency and future reference.
cas_write_db_index(urls = index_df)
#> ℹ Folder /tmp/RtmpiZ8sCL/cas_data for storing project and website files
#> created.
#> ✔ Urls added to index_id table: 4
cas_read_db_index()
#> # Source: table<index_id> [4 x 3]
#> # Database: sqlite 3.44.2 [/tmp/RtmpiZ8sCL/cas_data/cas_European Union_EEAS_db.sqlite]
#> id url index_group
#> <dbl> <chr> <chr>
#> 1 1 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=… Statements
#> 2 2 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=… Statements
#> 3 3 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=… Statements
#> 4 4 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=… Statements
cas_get_base_path(create_folder_if_missing = TRUE)
#> ℹ Folder for contents files with file format html does not exist:
#> ℹ /tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_contents
#> ℹ The folder '/tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_contents/' has been created.
#> /tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_contents
# cas_get_files_to_download(index = TRUE)
Download files
[#TODO]
cas_download(index = TRUE, create_folder_if_missing = TRUE)
#> ℹ Folder for index files with file format html does not exist:
#> ℹ /tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_index
#> ℹ The folder '/tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_index/' has been created.
#> ℹ The folder '/tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_index/1/' for the current download batch has been created.
Check how the download is going
download_status_df <- cas_read_db_download(index = TRUE)
download_status_df
#> # A tibble: 4 × 5
#> id batch datetime status size
#> <dbl> <dbl> <dttm> <int> <fs::bytes>
#> 1 1 1 2024-01-06 23:50:28 200 806K
#> 2 2 1 2024-01-06 23:50:32 200 807K
#> 3 3 1 2024-01-06 23:50:35 200 808K
#> 4 4 1 2024-01-06 23:50:40 200 807K
Extract links from index files
cas_extract_links(
container = "h5",
container_class = "card-title",
domain = "https://www.eeas.europa.eu",
write_to_db = TRUE
)
#> # A tibble: 124 × 5
#> id url link_text source_index_id source_index_batch
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 https://www.eeas.europa.e… Lebanon:… 1 1
#> 2 2 https://www.eeas.europa.e… Cyprus: … 1 1
#> 3 3 https://www.eeas.europa.e… Media Ad… 1 1
#> 4 4 https://www.eeas.europa.e… EU Alloc… 1 1
#> 5 5 https://www.eeas.europa.e… Supporte… 1 1
#> 6 6 https://www.eeas.europa.e… Russian … 1 1
#> 7 7 https://www.eeas.europa.e… Seminári… 1 1
#> 8 8 https://www.eeas.europa.e… Europe b… 1 1
#> 9 9 https://www.eeas.europa.e… Iran: St… 1 1
#> 10 10 https://www.eeas.europa.e… El Comit… 1 1
#> # ℹ 114 more rows
cas_read_db_contents_id()
#> # Source: table<contents_id> [?? x 5]
#> # Database: sqlite 3.44.2 [/tmp/RtmpiZ8sCL/cas_data/cas_European Union_EEAS_db.sqlite]
#> id url link_text source_index_id source_index_batch
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 https://www.eeas.europa.e… Lebanon:… 1 1
#> 2 2 https://www.eeas.europa.e… Cyprus: … 1 1
#> 3 3 https://www.eeas.europa.e… Media Ad… 1 1
#> 4 4 https://www.eeas.europa.e… EU Alloc… 1 1
#> 5 5 https://www.eeas.europa.e… Supporte… 1 1
#> 6 6 https://www.eeas.europa.e… Russian … 1 1
#> 7 7 https://www.eeas.europa.e… Seminári… 1 1
#> 8 8 https://www.eeas.europa.e… Europe b… 1 1
#> 9 9 https://www.eeas.europa.e… Iran: St… 1 1
#> 10 10 https://www.eeas.europa.e… El Comit… 1 1
#> # ℹ more rows
cas_get_files_to_download()
#> # A tibble: 124 × 4
#> id batch url path
#> <dbl> <dbl> <chr> <fs::path>
#> 1 1 1 https://www.eeas.europa.eu/eeas/lebanon-press-remarks… …/1_1.html
#> 2 2 1 https://www.eeas.europa.eu/eeas/cyprus-joint-statemen… …/2_1.html
#> 3 3 1 https://www.eeas.europa.eu/eeas/media-advisory-high-r… …/3_1.html
#> 4 4 1 https://www.eeas.europa.eu/delegations/tanzania/eu-al… …/4_1.html
#> 5 5 1 https://www.eeas.europa.eu/delegations/ukraine/suppor… …/5_1.html
#> 6 6 1 https://www.eeas.europa.eu/delegations/ukraine/russia… …/6_1.html
#> 7 7 1 https://www.eeas.europa.eu/eeas/semin%C3%A1rio-diplom… …/7_1.html
#> 8 8 1 https://www.eeas.europa.eu/eeas/europe-between-two-wa… …/8_1.html
#> 9 9 1 https://www.eeas.europa.eu/eeas/iran-statement-spokes… …/9_1.html
#> 10 10 1 https://www.eeas.europa.eu/delegations/ecuador/el-com… …10_1.html
#> # ℹ 114 more rows
cas_download(sample = 5)
#> ℹ The folder '/tmp/RtmpRbjOUP/castarter_data/European
#> Union/EEAS/html_contents/1/' for the current download batch has been
#> created.
cas_read_db_download()
#> # A tibble: 5 × 5
#> id batch datetime status size
#> <dbl> <dbl> <dttm> <int> <fs::bytes>
#> 1 13 1 2024-01-06 23:50:44 200 58.4K
#> 2 16 1 2024-01-06 23:50:46 200 107.5K
#> 3 26 1 2024-01-06 23:50:47 200 49.3K
#> 4 115 1 2024-01-06 23:50:45 200 108.8K
#> 5 117 1 2024-01-06 23:50:42 200 108.1K
extractors_l <- list(
title = \(x) cas_extract_html(
html_document = x,
container = "h1",
container_class = "node__title"
) %>%
stringr::str_remove_all(pattern = stringr::fixed("\n")) %>%
stringr::str_squish(),
date = \(x) cas_extract_html(
html_document = x,
container = "div",
container_class = "node__meta"
) %>%
stringr::str_extract("[[:digit:]]+\\.[[:digit:]]+\\.[[:digit:]]+") %>%
lubridate::dmy()
)
cas_extract(extractors = extractors_l)
cas_read_db_contents_data()
#> # Source: table<contents_data> [5 x 4]
#> # Database: sqlite 3.44.2 [/tmp/RtmpiZ8sCL/cas_data/cas_European Union_EEAS_db.sqlite]
#> id url title date
#> <chr> <chr> <chr> <chr>
#> 1 13 https://www.eeas.europa.eu/delegations/kosovo/visa-liberali… With… 2024…
#> 2 16 https://www.eeas.europa.eu/eeas/message-de-josep-borrell-co… Mess… 2023…
#> 3 26 https://www.eeas.europa.eu/delegations/nepal/delegation-eur… The … 2023…
#> 4 115 https://www.eeas.europa.eu/eeas/belarus-ep-speech-high-repr… Bela… 2023…
#> 5 117 https://www.eeas.europa.eu/eeas/ep-plenary-speech-high-repr… EP P… 2023…