Skip to contents

One of the first issues that appear when starting a text mining or web scraping project relates to the issue of managing files and folder. castarter defaults to an opinionated folder structure that should work for most projects. It also facilitates downloading files (skipping previously downloaded files) and ensuring consistent and unique matching between a downloaded html, its source url, and data extracted from them. Finally, it facilitates archiving and backuping downloaded files and scripts.

Getting started

In this vignette, I will outline some of the basic steps that can be used to extract and process the contents of press releases from a number of institutions of the European Union. When text mining or scraping, it is common to gather quickly many thousands of file, and keeping them in good order is fundamental, particularly in the long term.

A preliminary suggestion: depending on how you usually work and keep your files backed-up it may make sense to keep your scripts in a folder that is live-synced (e.g. with services such as Dropbox, Nextcloud, or Google Drive). It however rarely make sense to live-sync tens or hundreds of thousands of files as you proceed with your scraping. You may want to keep this in mind as you set the base_folder with cas_set_options(). I will keep in the current working directory here for the sake of simplicity, but there are no drawbacks in having scripts and folders in different locations.

castarter stores details about the download process in a database. By default, this is stored locally in DuckDB database kept in the same folder as website files, but it can be stored in a different folder, or alternative database backends such as RSQlite or MySQL can also be used.

Assuming that my project on the European Union involves text mining the website of the European Parliament, the European Council, and the External action service (EEAS) the folder structure may look something like this:

library("castarter")
cas_set_options(
  base_folder = fs::path(
    fs::path_temp(),
    "castarter_data"
  ),
  project = "European Union",
  website = "EEAS"
)
#> /tmp/RtmpRbjOUP/castarter_data
#> └── European Union
#>     ├── EEAS
#>     ├── European Council
#>     └── European Parliament

In brief, castarter_data is the base folder where I can store all of my text mining projects. european_union is the name of the project, while all others are the names of the specific websites I will source. Folders will by created automatically as needed when you start downloading files.

Downloading index files

In text mining a common scenario involves first downloading web pages containing lists of urls to the actual posts we are interested in. In the case of the European Commission, these would probably the pages in the “news section”. By clicking on the the numbers at the bottom of the page, we get to see direct links to the subsequent pages listing all posts.

These URLs look something like this:

index_urls
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=1
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=2
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=3
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=4
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=5
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=6
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=7
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=8
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=9
https://ec.europa.eu/commission/presscorner/home/en?pagenumber=10

Sometimes such urls can be derived from the archive section, as is the case for example for EEAS:

index_df <- cas_build_urls(
  url = "https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=pm_category%253AStatement/Declaration&f%5B1%5D=press_site%253AEEAS&f%5B2%5D=press_site%253AEEAS&fulltext=&created_from=&created_to=&0=press_site%253AEEAS&1=press_site%253AEEAS&2=press_site%253AEEAS&page=",
  start_page = 0,
  end_page = 3,
  index_group = "Statements"
)

index_df %>%
  knitr::kable()
id url index_group
1 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=pm_category%253AStatement/Declaration&f%5B1%5D=press_site%253AEEAS&f%5B2%5D=press_site%253AEEAS&fulltext=&created_from=&created_to=&0=press_site%253AEEAS&1=press_site%253AEEAS&2=press_site%253AEEAS&page=0 Statements
2 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=pm_category%253AStatement/Declaration&f%5B1%5D=press_site%253AEEAS&f%5B2%5D=press_site%253AEEAS&fulltext=&created_from=&created_to=&0=press_site%253AEEAS&1=press_site%253AEEAS&2=press_site%253AEEAS&page=1 Statements
3 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=pm_category%253AStatement/Declaration&f%5B1%5D=press_site%253AEEAS&f%5B2%5D=press_site%253AEEAS&fulltext=&created_from=&created_to=&0=press_site%253AEEAS&1=press_site%253AEEAS&2=press_site%253AEEAS&page=2 Statements
4 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=pm_category%253AStatement/Declaration&f%5B1%5D=press_site%253AEEAS&f%5B2%5D=press_site%253AEEAS&fulltext=&created_from=&created_to=&0=press_site%253AEEAS&1=press_site%253AEEAS&2=press_site%253AEEAS&page=3 Statements

All information about the download process are tyipically stored in a local database, for consistency and future reference.

cas_write_db_index(urls = index_df)
#>  Folder /tmp/RtmpiZ8sCL/cas_data for storing project and website files
#>   created.
#>  Urls added to index_id table: 4

cas_read_db_index()
#> # Source:   table<index_id> [4 x 3]
#> # Database: sqlite 3.44.2 [/tmp/RtmpiZ8sCL/cas_data/cas_European Union_EEAS_db.sqlite]
#>      id url                                                          index_group
#>   <dbl> <chr>                                                        <chr>      
#> 1     1 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=… Statements 
#> 2     2 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=… Statements 
#> 3     3 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=… Statements 
#> 4     4 https://www.eeas.europa.eu/eeas/press-material_en?f%5B0%5D=… Statements
cas_get_base_path(create_folder_if_missing = TRUE)
#>  Folder for contents files with file format html does not exist:
#>  /tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_contents
#>  The folder '/tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_contents/' has been created.
#> /tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_contents
# cas_get_files_to_download(index = TRUE)

Download files

[#TODO]

cas_download(index = TRUE, create_folder_if_missing = TRUE)
#>  Folder for index files with file format html does not exist:
#>  /tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_index
#>  The folder '/tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_index/' has been created.
#>  The folder '/tmp/RtmpRbjOUP/castarter_data/European Union/EEAS/html_index/1/' for the current download batch has been created.

Check how the download is going

download_status_df <- cas_read_db_download(index = TRUE)

download_status_df
#> # A tibble: 4 × 5
#>      id batch datetime            status        size
#>   <dbl> <dbl> <dttm>               <int> <fs::bytes>
#> 1     1     1 2024-01-06 23:50:28    200        806K
#> 2     2     1 2024-01-06 23:50:32    200        807K
#> 3     3     1 2024-01-06 23:50:35    200        808K
#> 4     4     1 2024-01-06 23:50:40    200        807K
cas_extract_links(
  container = "h5",
  container_class = "card-title",
  domain = "https://www.eeas.europa.eu",
  write_to_db = TRUE
)
#> # A tibble: 124 × 5
#>       id url                        link_text source_index_id source_index_batch
#>    <dbl> <chr>                      <chr>               <dbl>              <dbl>
#>  1     1 https://www.eeas.europa.e… Lebanon:…               1                  1
#>  2     2 https://www.eeas.europa.e… Cyprus: …               1                  1
#>  3     3 https://www.eeas.europa.e… Media Ad…               1                  1
#>  4     4 https://www.eeas.europa.e… EU Alloc…               1                  1
#>  5     5 https://www.eeas.europa.e… Supporte…               1                  1
#>  6     6 https://www.eeas.europa.e… Russian …               1                  1
#>  7     7 https://www.eeas.europa.e… Seminári…               1                  1
#>  8     8 https://www.eeas.europa.e… Europe b…               1                  1
#>  9     9 https://www.eeas.europa.e… Iran: St…               1                  1
#> 10    10 https://www.eeas.europa.e… El Comit…               1                  1
#> # ℹ 114 more rows
cas_read_db_contents_id()
#> # Source:   table<contents_id> [?? x 5]
#> # Database: sqlite 3.44.2 [/tmp/RtmpiZ8sCL/cas_data/cas_European Union_EEAS_db.sqlite]
#>       id url                        link_text source_index_id source_index_batch
#>    <dbl> <chr>                      <chr>               <dbl>              <dbl>
#>  1     1 https://www.eeas.europa.e… Lebanon:…               1                  1
#>  2     2 https://www.eeas.europa.e… Cyprus: …               1                  1
#>  3     3 https://www.eeas.europa.e… Media Ad…               1                  1
#>  4     4 https://www.eeas.europa.e… EU Alloc…               1                  1
#>  5     5 https://www.eeas.europa.e… Supporte…               1                  1
#>  6     6 https://www.eeas.europa.e… Russian …               1                  1
#>  7     7 https://www.eeas.europa.e… Seminári…               1                  1
#>  8     8 https://www.eeas.europa.e… Europe b…               1                  1
#>  9     9 https://www.eeas.europa.e… Iran: St…               1                  1
#> 10    10 https://www.eeas.europa.e… El Comit…               1                  1
#> # ℹ more rows
cas_get_files_to_download()
#> # A tibble: 124 × 4
#>       id batch url                                                    path      
#>    <dbl> <dbl> <chr>                                                  <fs::path>
#>  1     1     1 https://www.eeas.europa.eu/eeas/lebanon-press-remarks… …/1_1.html
#>  2     2     1 https://www.eeas.europa.eu/eeas/cyprus-joint-statemen… …/2_1.html
#>  3     3     1 https://www.eeas.europa.eu/eeas/media-advisory-high-r… …/3_1.html
#>  4     4     1 https://www.eeas.europa.eu/delegations/tanzania/eu-al… …/4_1.html
#>  5     5     1 https://www.eeas.europa.eu/delegations/ukraine/suppor… …/5_1.html
#>  6     6     1 https://www.eeas.europa.eu/delegations/ukraine/russia… …/6_1.html
#>  7     7     1 https://www.eeas.europa.eu/eeas/semin%C3%A1rio-diplom… …/7_1.html
#>  8     8     1 https://www.eeas.europa.eu/eeas/europe-between-two-wa… …/8_1.html
#>  9     9     1 https://www.eeas.europa.eu/eeas/iran-statement-spokes… …/9_1.html
#> 10    10     1 https://www.eeas.europa.eu/delegations/ecuador/el-com… …10_1.html
#> # ℹ 114 more rows
cas_download(sample = 5)
#>  The folder '/tmp/RtmpRbjOUP/castarter_data/European
#> Union/EEAS/html_contents/1/' for the current download batch has been
#> created.
cas_read_db_download()
#> # A tibble: 5 × 5
#>      id batch datetime            status        size
#>   <dbl> <dbl> <dttm>               <int> <fs::bytes>
#> 1    13     1 2024-01-06 23:50:44    200       58.4K
#> 2    16     1 2024-01-06 23:50:46    200      107.5K
#> 3    26     1 2024-01-06 23:50:47    200       49.3K
#> 4   115     1 2024-01-06 23:50:45    200      108.8K
#> 5   117     1 2024-01-06 23:50:42    200      108.1K
extractors_l <- list(
  title = \(x) cas_extract_html(
    html_document = x,
    container = "h1",
    container_class = "node__title"
  ) %>%
    stringr::str_remove_all(pattern = stringr::fixed("\n")) %>%
    stringr::str_squish(),
  date = \(x) cas_extract_html(
    html_document = x,
    container = "div",
    container_class = "node__meta"
  ) %>%
    stringr::str_extract("[[:digit:]]+\\.[[:digit:]]+\\.[[:digit:]]+") %>%
    lubridate::dmy()
)
cas_extract(extractors = extractors_l)
cas_read_db_contents_data()
#> # Source:   table<contents_data> [5 x 4]
#> # Database: sqlite 3.44.2 [/tmp/RtmpiZ8sCL/cas_data/cas_European Union_EEAS_db.sqlite]
#>   id    url                                                          title date 
#>   <chr> <chr>                                                        <chr> <chr>
#> 1 13    https://www.eeas.europa.eu/delegations/kosovo/visa-liberali… With… 2024…
#> 2 16    https://www.eeas.europa.eu/eeas/message-de-josep-borrell-co… Mess… 2023…
#> 3 26    https://www.eeas.europa.eu/delegations/nepal/delegation-eur… The … 2023…
#> 4 115   https://www.eeas.europa.eu/eeas/belarus-ep-speech-high-repr… Bela… 2023…
#> 5 117   https://www.eeas.europa.eu/eeas/ep-plenary-speech-high-repr… EP P… 2023…