1. Getting index pages • castarter

What are index pages?

As outlined in the in the article on key concepts, index pages are

pages that usually include some form of list of the pages with actual contents we are interested in (or, possibly, a second layer of index pages). They can be immutable, but they are often expected to change.

Index pages can take many forms, but these are the most common:

incremental index pages, such as:
dated archive pages such as:

As an alternative, website’s sitemaps (specifically, sitemap.xml files) can be used as index pages.

We’ll consider these options one by one.

Interactive url-builder helper

Before continuing with a more details exploration of parameters, you may want to check out an interactive interface that helps in finding the right parameters for your case and shows the relevant function call in castarter

library("castarter")
cass_build_urls()

Creating index urls

[TODO]

Storing index urls in a local database

[TODO]

Downloading index urls

[TODO]

How a typical script would look after this step

[TODO]

Sitemaps

Using sitemaps makes it very easy to get urls to all pages of a website, but the apparent ease may be misleading, as they may include too many or too few urls for the task at hand.

For example, sitemaps may include links to all pages of a website, while you may be interested only in the “news” section, or some other subset of articles.

On the other hand, sitemap files are often not complete; for example, they may include only recent publications.

In brief, sitemaps may be an easy way to get access to all urls of a websites, but you should make sure they are fit for your purpose.

Finding a sitemap file

Sitemap files are a machine readable files in xml format. They can mostly be found at one of the following locations:

https://example.com/sitemap.xml
at a location defined in the robots.txt files, which is commonly found at the root of the website, https://example.com/robots.txt

This may all sound exceedingly technical if you are not familiar with some component parts of how the internet works. You may dig deeper (Wikipedia has a page on robots.txt as well as on sitemaps), or you may simply try to add “sitemap.xml” or “robots.txt” to the domain of your interest and see if something relevant pops up.

If a sitemap is there, the following option may be of help:

library("castarter")

cas_set_options(
  base_folder = fs::path(
    fs::path_home_r(),
    "R",
    "castarter"
  ),
  project = "example_project",
  website = "example_website"
)

cas_build_urls(
  url = "https://example.com/sitemap.xml",
  write_to_db = TRUE
)

cas_download_index(file_format = "xml") # as html is default, you must explicitly set xml as file_format

cas_extract_links(
  index = TRUE,
  file_format = "xml",
  custom_css = "loc",
  output_index = TRUE,
  output_index_group = "sitemap",
  write_to_db = TRUE
)

# cas_read_db_contents_id() |> dplyr::collect() |> View()

In some instances, sitemaps may have multiple level, e.g. there is a sitemap with only links to other sitemaps, e.g. a base sitemap linking to a one sitemap for each month of archived articles. In such cases you would want to extract links from the base sitemap, add the monthly sitemaps to index pages, and then extract from these monthly sitemaps the direct urls of articles.

A relevant script may look as follows:

library("castarter")

cas_set_options(
  base_folder = fs::path(
    fs::path_home_r(),
    "R",
    "castarter"
  ),
  project = "example_project",
  website = "example_website"
)

cas_build_urls(
  url = "https://example.com/sitemap.xml",
  index_group = "base_sitemap",
  write_to_db = TRUE
)

cas_download_index(file_format = "xml") # as html is default, you must explicitly set xml as file_format

cas_extract_links(
  index = TRUE,
  file_format = "xml",
  custom_css = "loc",
  output_index = TRUE,
  output_index_group = "monthly_sitemap",
  write_to_db = TRUE
)

# cas_read_db_index() |> dplyr::collect() |> View()

cas_download_index(file_format = "xml", wait = 3)

cas_extract_links(
  file_format = "xml",
  custom_css = "loc",
  index_group = "monthly_sitemap", # exclude the base sitemap
  write_to_db = TRUE
)

# cas_read_db_contents_id() |> dplyr::collect() |> View()

Notice that in some cases the first-level sitemap may be linking to second-level sitemaps in the compressed “xml.gz” format, rather than plain “xml”. In that case, just set “xml.gz” as “file_format” for proper processing.

Practical examples with additional difficulties

Press releases of the European Parliament

Let’s say that we are interested in analysing the press releases issued by the European Parliament. After checking that no relevant sitemap.xml is easily accessible and there aren’t relevant pointers in the robots.txt (see section above for details), we decide to look at the website itself. From the home page, we can see there’s a link to “News”, that leads to a home page for news where all relevant articles are listed in reverse chronological order.

So far, everything is as expected. Older articles, however, are not self-evidently paginated with progressive index urls: there is instead a “Load more” button that loads more articles leaving the main address of the page unchanged.

We also see there is a filter button at the top of the list, which includes an advanced filter option.

Both of the “Load more” and the “Advanced filter” button offer a meaningful approach for extracting relevant links. Let’s explore both solutions.

The “Load more” button

First, even if the “Load more” button does not change the url in the browser’s address bar, it does ask for older articles in the background. Using our browser, we may be able to see what it requests.

We open our browser’s developer tools (e.g. pressing F12 on Firefox), and then select the “Network” tab from the panel that appears. If we click on “Load more” now, we see the request as it is processed by the browser: copying the url value of that request, we see a nice “https://www.europarl.europa.eu/news/en/page/2”.

If we paste it in the browser, we see this opens an unstyled page with just a list of articles: it’s in many ways an ideal format. It also appear immediately that just by changing the last digit in the url, we can paginate and get older articles.

library("castarter")

cas_set_options(
  base_folder = fs::path(
    fs::path_home_r(),
    "R",
    "castarter_vignettes"
  ),
  project = "european_union",
  website = "european_parliament_paginated"
)

Let’s put this in practice and get the 10 most recent index pages. Let’s set also the index_group parameter to “news”; this isn’t mandatory, but if we want to add then index pages from other sections of the website it will make things easier.

cas_build_urls(
  url = "https://www.europarl.europa.eu/news/en/page/",
  start_page = 1,
  end_page = 10,
  index_group = "news",
  write_to_db = TRUE
)
#> ℹ Folder /tmp/Rtmp6635tk/R/castarter_data for storing project and website files
#>   created.
#> ✔ Urls added to index_id table: 10
#> # A tibble: 10 × 3
#>       id url                                            index_group
#>    <dbl> <chr>                                          <chr>      
#>  1     1 https://www.europarl.europa.eu/news/en/page/1  news       
#>  2     2 https://www.europarl.europa.eu/news/en/page/2  news       
#>  3     3 https://www.europarl.europa.eu/news/en/page/3  news       
#>  4     4 https://www.europarl.europa.eu/news/en/page/4  news       
#>  5     5 https://www.europarl.europa.eu/news/en/page/5  news       
#>  6     6 https://www.europarl.europa.eu/news/en/page/6  news       
#>  7     7 https://www.europarl.europa.eu/news/en/page/7  news       
#>  8     8 https://www.europarl.europa.eu/news/en/page/8  news       
#>  9     9 https://www.europarl.europa.eu/news/en/page/9  news       
#> 10    10 https://www.europarl.europa.eu/news/en/page/10 news

cas_read_db_index() |>
  dplyr::collect()
#> # A tibble: 10 × 3
#>       id url                                            index_group
#>    <dbl> <chr>                                          <chr>      
#>  1     1 https://www.europarl.europa.eu/news/en/page/1  news       
#>  2     2 https://www.europarl.europa.eu/news/en/page/2  news       
#>  3     3 https://www.europarl.europa.eu/news/en/page/3  news       
#>  4     4 https://www.europarl.europa.eu/news/en/page/4  news       
#>  5     5 https://www.europarl.europa.eu/news/en/page/5  news       
#>  6     6 https://www.europarl.europa.eu/news/en/page/6  news       
#>  7     7 https://www.europarl.europa.eu/news/en/page/7  news       
#>  8     8 https://www.europarl.europa.eu/news/en/page/8  news       
#>  9     9 https://www.europarl.europa.eu/news/en/page/9  news       
#> 10    10 https://www.europarl.europa.eu/news/en/page/10 news

Looks good: ready for download.

cas_download_index(create_folder_if_missing = TRUE)

cas_download_index()

Now, as these index pages have only links to the articles we are interested in and nothing else, we’re happy with extracting all links included there:

cas_extract_links(write_to_db = TRUE)

Let’s look at the result:

cas_read_db_contents_id() |>
  head() |>
  dplyr::select(url, link_text) |>
  dplyr::collect()
#> # A tibble: 0 × 2
#> # ℹ 2 variables: url <chr>, link_text <chr>

Looks good. Let’s to a quick check: how many links did we extract from each page?

cas_read_db_contents_id() |>
  dplyr::group_by(source_index_id) |>
  dplyr::tally() |>
  dplyr::collect()
#> # A tibble: 0 × 2
#> # ℹ 2 variables: source_index_id <dbl>, n <int>

A straight record of 15 links from each index page. Perfect, all as expected. We can now proceed with cas_download() and move ahead with the following steps.

The dated archive option

Alright, let’s assume that we didn’t figure out the “load more” button thing (which may not always work out so nicely).

Using the “Advanced filter” option, we can see that the url in the address bar of our browser shows a nice human readable pattern. For example, if I was for all the posts released on the 25 of January 2024, this is the url I see:

https://www.europarl.europa.eu/news/en/press-room?minDate=25-01-2024&maxDate=25-01-2024&contentType=all

It’s easy to guess that we can iterate on that day to find older articles as needed. If we assume that for each day all articles will always fit in a single page (an assumption we can later check), we can take this as a good starting point.

library("castarter")


cas_set_options(
  base_folder = fs::path(
    fs::path_home_r(),
    "R",
    "castarter_vignettes"
  ),
  project = "european_union",
  website = "european_parliament_dates"
)

As the date is effectively repeated twice in the url, we need to fall back on a slightly more complex syntax, enabling the glue parameter and the inserting here where we want the date to appear.

cas_build_urls(
  glue = TRUE,
  url = "https://www.europarl.europa.eu/news/en?minDate={here}&maxDate={here}&contentType=all",
  start_date = "2024-06-01",
  end_date = "2024-06-30",
  date_format = "dmY",
  date_separator = "-",
  index_group = "news",
  write_to_db = FALSE
) |>
  head() |>
  dplyr::pull(url)
#> [1] "https://www.europarl.europa.eu/news/en?minDate=01-06-2024&maxDate=01-06-2024&contentType=all"
#> [2] "https://www.europarl.europa.eu/news/en?minDate=02-06-2024&maxDate=02-06-2024&contentType=all"
#> [3] "https://www.europarl.europa.eu/news/en?minDate=03-06-2024&maxDate=03-06-2024&contentType=all"
#> [4] "https://www.europarl.europa.eu/news/en?minDate=04-06-2024&maxDate=04-06-2024&contentType=all"
#> [5] "https://www.europarl.europa.eu/news/en?minDate=05-06-2024&maxDate=05-06-2024&contentType=all"
#> [6] "https://www.europarl.europa.eu/news/en?minDate=06-06-2024&maxDate=06-06-2024&contentType=all"

The urls look good, and it seems they should work. Indeed, on most websites these should work, but, as it happens, it is not uncommon to find some sloppy backend development. In this case, the expectation would be that if you set both minDate and maxDate as, e.g. “17-01-2024”, one would expect to find all posts published on that date. Unfortunately, not in the case of the European Parliament’s website: such queries would always return an empty page for all dates.

This is already becoming a rather unusual case that can’t easily be addressed with cas_build_urls() internal functions, so unfortunately we’ll have to proceed with a custom solution, e.g.:

start_date <- as.Date("2024-06-01")
end_date <- as.Date("2024-06-30")
date_format <- "%d-%m-%Y"

date_sequence <- format(
  x = seq.Date(
    from = start_date,
    to = end_date,
    by = "day"
  ),
  date_format
)

date_sequence_plus_1 <- format(
  x = seq.Date(
    from = start_date,
    to = end_date,
    by = "day"
  ) + 1,
  date_format
)

index_urls_v <- glue::glue("https://www.europarl.europa.eu/news/en/press-room?minDate={date_sequence}&maxDate={date_sequence_plus_1}&contentType=all")

index_urls_v
#> https://www.europarl.europa.eu/news/en/press-room?minDate=01-06-2024&maxDate=02-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=02-06-2024&maxDate=03-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=03-06-2024&maxDate=04-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=04-06-2024&maxDate=05-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=05-06-2024&maxDate=06-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=06-06-2024&maxDate=07-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=07-06-2024&maxDate=08-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=08-06-2024&maxDate=09-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=09-06-2024&maxDate=10-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=10-06-2024&maxDate=11-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=11-06-2024&maxDate=12-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=12-06-2024&maxDate=13-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=13-06-2024&maxDate=14-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=14-06-2024&maxDate=15-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=15-06-2024&maxDate=16-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=16-06-2024&maxDate=17-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=17-06-2024&maxDate=18-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=18-06-2024&maxDate=19-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=19-06-2024&maxDate=20-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=20-06-2024&maxDate=21-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=21-06-2024&maxDate=22-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=22-06-2024&maxDate=23-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=23-06-2024&maxDate=24-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=24-06-2024&maxDate=25-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=25-06-2024&maxDate=26-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=26-06-2024&maxDate=27-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=27-06-2024&maxDate=28-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=28-06-2024&maxDate=29-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=29-06-2024&maxDate=30-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=30-06-2024&maxDate=01-07-2024&contentType=all

After checking that this new custom format indeed actually works, we can then add these urls to the stored index:

cas_build_urls(
  url = index_urls_v,
  index_group = "press-room",
  write_to_db = TRUE
)
#> ℹ Folder /tmp/Rtmp6635tk/R/castarter_data for storing project and website files
#>   created.
#> ✔ Urls added to index_id table: 30
#> # A tibble: 30 × 3
#>       id url                                                         index_group
#>    <dbl> <chr>                                                       <chr>      
#>  1     1 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#>  2     2 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#>  3     3 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#>  4     4 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#>  5     5 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#>  6     6 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#>  7     7 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#>  8     8 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#>  9     9 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#> 10    10 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room 
#> # ℹ 20 more rows

And proceed with the download as usual:

cas_download_index()