What are index pages?
As outlined in the in the article on key concepts, index pages are
pages that usually include some form of list of the pages with actual contents we are interested in (or, possibly, a second layer of index pages). They can be immutable, but they are often expected to change.
Index pages can take many forms, but these are the most common:
- incremental index pages, such as:
- dated archive pages such as:
As an alternative, website’s sitemaps (specifically,
sitemap.xml
files) can be used as index pages.
We’ll consider these options one by one.
Interactive url-builder helper
Before continuing with a more details exploration of parameters, you
may want to check out an interactive interface that helps in finding the
right parameters for your case and shows the relevant function call in
castarter
Sitemaps
Using sitemaps makes it very easy to get urls to all pages of a website, but the apparent ease may be misleading, as they may include too many or too few urls for the task at hand.
For example, sitemaps may include links to all pages of a website, while you may be interested only in the “news” section, or some other subset of articles.
On the other hand, sitemap files are often not complete; for example, they may include only recent publications.
In brief, sitemaps may be an easy way to get access to all urls of a websites, but you should make sure they are fit for your purpose.
Finding a sitemap file
Sitemap files are a machine readable files in xml format. They can mostly be found at one of the following locations:
- https://example.com/sitemap.xml
- at a location defined in the robots.txt files, which is commonly found at the root of the website, https://example.com/robots.txt
This may all sound exceedingly technical if you are not familiar with some component parts of how the internet works. You may dig deeper (Wikipedia has a page on robots.txt as well as on sitemaps), or you may simply try to add “sitemap.xml” or “robots.txt” to the domain of your interest and see if something relevant pops up.
If a sitemap is there, the following option may be of help:
library("castarter")
cas_set_options(
base_folder = fs::path(
fs::path_home_r(),
"R",
"castarter"
),
project = "example_project",
website = "example_website"
)
cas_build_urls(
url = "https://example.com/sitemap.xml",
write_to_db = TRUE
)
cas_download_index(file_format = "xml") # as html is default, you must explicitly set xml as file_format
cas_extract_links(
index = TRUE,
file_format = "xml",
custom_css = "loc",
output_index = TRUE,
output_index_group = "sitemap",
write_to_db = TRUE
)
# cas_read_db_contents_id() |> dplyr::collect() |> View()
In some instances, sitemaps may have multiple level, e.g. there is a sitemap with only links to other sitemaps, e.g. a base sitemap linking to a one sitemap for each month of archived articles. In such cases you would want to extract links from the base sitemap, add the monthly sitemaps to index pages, and then extract from these monthly sitemaps the direct urls of articles.
A relevant script may look as follows:
library("castarter")
cas_set_options(
base_folder = fs::path(
fs::path_home_r(),
"R",
"castarter"
),
project = "example_project",
website = "example_website"
)
cas_build_urls(
url = "https://example.com/sitemap.xml",
index_group = "base_sitemap",
write_to_db = TRUE
)
cas_download_index(file_format = "xml") # as html is default, you must explicitly set xml as file_format
cas_extract_links(
index = TRUE,
file_format = "xml",
custom_css = "loc",
output_index = TRUE,
output_index_group = "monthly_sitemap",
write_to_db = TRUE
)
# cas_read_db_index() |> dplyr::collect() |> View()
cas_download_index(file_format = "xml", wait = 3)
cas_extract_links(
file_format = "xml",
custom_css = "loc",
index_group = "monthly_sitemap", # exclude the base sitemap
write_to_db = TRUE
)
# cas_read_db_contents_id() |> dplyr::collect() |> View()
Notice that in some cases the first-level sitemap may be linking to second-level sitemaps in the compressed “xml.gz” format, rather than plain “xml”. In that case, just set “xml.gz” as “file_format” for proper processing.
Practical examples with additional difficulties
Press releases of the European Parliament
Let’s say that we are interested in analysing the press releases
issued by the European Parliament. After checking that no relevant
sitemap.xml
is easily accessible and there aren’t relevant
pointers in the robots.txt
(see section above for details),
we decide to look at the website itself. From the home page, we can see there’s
a link to “News”, that leads to a home page for
news where all relevant articles are listed in reverse chronological
order.
So far, everything is as expected. Older articles, however, are not self-evidently paginated with progressive index urls: there is instead a “Load more” button that loads more articles leaving the main address of the page unchanged.
We also see there is a filter button at the top of the list, which includes an advanced filter option.
Both of the “Load more” and the “Advanced filter” button offer a meaningful approach for extracting relevant links. Let’s explore both solutions.
The “Load more” button
First, even if the “Load more” button does not change the url in the browser’s address bar, it does ask for older articles in the background. Using our browser, we may be able to see what it requests.
We open our browser’s developer tools (e.g. pressing F12 on Firefox), and then select the “Network” tab from the panel that appears. If we click on “Load more” now, we see the request as it is processed by the browser: copying the url value of that request, we see a nice “https://www.europarl.europa.eu/news/en/page/2”.
If we paste it in the browser, we see this opens an unstyled page with just a list of articles: it’s in many ways an ideal format. It also appear immediately that just by changing the last digit in the url, we can paginate and get older articles.
library("castarter")
cas_set_options(
base_folder = fs::path(
fs::path_home_r(),
"R",
"castarter_vignettes"
),
project = "european_union",
website = "european_parliament_paginated"
)
Let’s put this in practice and get the 10 most recent index pages.
Let’s set also the index_group
parameter to “news”; this
isn’t mandatory, but if we want to add then index pages from other
sections of the website it will make things easier.
cas_build_urls(
url = "https://www.europarl.europa.eu/news/en/page/",
start_page = 1,
end_page = 10,
index_group = "news",
write_to_db = TRUE
)
#> ℹ No new url added to index_id table.
#> # A tibble: 10 × 3
#> id url index_group
#> <dbl> <chr> <chr>
#> 1 1 https://www.europarl.europa.eu/news/en/page/1 news
#> 2 2 https://www.europarl.europa.eu/news/en/page/2 news
#> 3 3 https://www.europarl.europa.eu/news/en/page/3 news
#> 4 4 https://www.europarl.europa.eu/news/en/page/4 news
#> 5 5 https://www.europarl.europa.eu/news/en/page/5 news
#> 6 6 https://www.europarl.europa.eu/news/en/page/6 news
#> 7 7 https://www.europarl.europa.eu/news/en/page/7 news
#> 8 8 https://www.europarl.europa.eu/news/en/page/8 news
#> 9 9 https://www.europarl.europa.eu/news/en/page/9 news
#> 10 10 https://www.europarl.europa.eu/news/en/page/10 news
cas_read_db_index() |>
dplyr::collect()
#> # A tibble: 10 × 3
#> id url index_group
#> <dbl> <chr> <chr>
#> 1 1 https://www.europarl.europa.eu/news/en/page/1 news
#> 2 2 https://www.europarl.europa.eu/news/en/page/2 news
#> 3 3 https://www.europarl.europa.eu/news/en/page/3 news
#> 4 4 https://www.europarl.europa.eu/news/en/page/4 news
#> 5 5 https://www.europarl.europa.eu/news/en/page/5 news
#> 6 6 https://www.europarl.europa.eu/news/en/page/6 news
#> 7 7 https://www.europarl.europa.eu/news/en/page/7 news
#> 8 8 https://www.europarl.europa.eu/news/en/page/8 news
#> 9 9 https://www.europarl.europa.eu/news/en/page/9 news
#> 10 10 https://www.europarl.europa.eu/news/en/page/10 news
Looks good: ready for download.
cas_download_index(create_folder_if_missing = TRUE)
#> ℹ No new files or pages to download.
Now, as these index pages have only links to the articles we are interested in and nothing else, we’re happy with extracting all links included there:
cas_extract_links(write_to_db = TRUE)
Let’s look at the result:
cas_read_db_contents_id() |>
head() |>
dplyr::select(url, link_text) |>
dplyr::collect()
#> # A tibble: 6 × 2
#> url link_text
#> <chr> <chr>
#> 1 https://www.europarl.europa.eu/news/en/press-room/20240711IPR22848/… EP TODAY
#> 2 https://www.europarl.europa.eu/news/en/press-room/20240710IPR22808/… Last-min…
#> 3 https://www.europarl.europa.eu/news/en/press-room/20240710IPR22809/… Press br…
#> 4 https://www.europarl.europa.eu/news/en/press-room/20240624IPR22302/… Metsola …
#> 5 https://www.europarl.europa.eu/news/en/press-room/20240624IPR22301/… European…
#> 6 https://www.europarl.europa.eu/news/en/press-room/20240617IPR22103/… European…
Looks good. Let’s to a quick check: how many links did we extract from each page?
cas_read_db_contents_id() |>
dplyr::group_by(source_index_id) |>
dplyr::tally() |>
dplyr::collect()
#> # A tibble: 10 × 2
#> source_index_id n
#> <dbl> <int>
#> 1 1 15
#> 2 2 15
#> 3 3 15
#> 4 4 15
#> 5 5 15
#> 6 6 15
#> 7 7 15
#> 8 8 15
#> 9 9 15
#> 10 10 15
A straight record of 15 links from each index page. Perfect, all as
expected. We can now proceed with cas_download()
and move
ahead with the following steps.
The dated archive option
Alright, let’s assume that we didn’t figure out the “load more” button thing (which may not always work out so nicely).
Using the “Advanced filter” option, we can see that the url in the address bar of our browser shows a nice human readable pattern. For example, if I was for all the posts released on the 25 of January 2024, this is the url I see:
It’s easy to guess that we can iterate on that day to find older articles as needed. If we assume that for each day all articles will always fit in a single page (an assumption we can later check), we can take this as a good starting point.
library("castarter")
cas_set_options(
base_folder = fs::path(
fs::path_home_r(),
"R",
"castarter_vignettes"
),
project = "european_union",
website = "european_parliament_dates"
)
As the date is effectively repeated twice in the url, we need to fall
back on a slightly more complex syntax, enabling the glue
parameter and the inserting here where we want the date
to appear.
cas_build_urls(
glue = TRUE,
url = "https://www.europarl.europa.eu/news/en?minDate={here}&maxDate={here}&contentType=all",
start_date = "2024-06-01",
end_date = "2024-06-30",
date_format = "dmY",
date_separator = "-",
index_group = "news",
write_to_db = FALSE
) |>
head() |>
dplyr::pull(url)
#> [1] "https://www.europarl.europa.eu/news/en?minDate=01-06-2024&maxDate=01-06-2024&contentType=all"
#> [2] "https://www.europarl.europa.eu/news/en?minDate=02-06-2024&maxDate=02-06-2024&contentType=all"
#> [3] "https://www.europarl.europa.eu/news/en?minDate=03-06-2024&maxDate=03-06-2024&contentType=all"
#> [4] "https://www.europarl.europa.eu/news/en?minDate=04-06-2024&maxDate=04-06-2024&contentType=all"
#> [5] "https://www.europarl.europa.eu/news/en?minDate=05-06-2024&maxDate=05-06-2024&contentType=all"
#> [6] "https://www.europarl.europa.eu/news/en?minDate=06-06-2024&maxDate=06-06-2024&contentType=all"
The urls look good, and it seems they should work. Indeed, on most
websites these should work, but, as it happens, it is not uncommon to
find some sloppy backend development. In this case, the expectation
would be that if you set both minDate
and
maxDate
as, e.g. “17-01-2024”, one would expect to find all
posts published on that date. Unfortunately, not in the case of the
European Parliament’s website: such queries would always return an empty
page for all dates.
This is already becoming a rather unusual case that can’t easily be
addressed with cas_build_urls()
internal functions, so
unfortunately we’ll have to proceed with a custom solution, e.g.:
start_date <- as.Date("2024-06-01")
end_date <- as.Date("2024-06-30")
date_format <- "%d-%m-%Y"
date_sequence <- format(
x = seq.Date(
from = start_date,
to = end_date,
by = "day"
),
date_format
)
date_sequence_plus_1 <- format(
x = seq.Date(
from = start_date,
to = end_date,
by = "day"
) + 1,
date_format
)
index_urls_v <- glue::glue("https://www.europarl.europa.eu/news/en/press-room?minDate={date_sequence}&maxDate={date_sequence_plus_1}&contentType=all")
index_urls_v
#> https://www.europarl.europa.eu/news/en/press-room?minDate=01-06-2024&maxDate=02-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=02-06-2024&maxDate=03-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=03-06-2024&maxDate=04-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=04-06-2024&maxDate=05-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=05-06-2024&maxDate=06-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=06-06-2024&maxDate=07-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=07-06-2024&maxDate=08-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=08-06-2024&maxDate=09-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=09-06-2024&maxDate=10-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=10-06-2024&maxDate=11-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=11-06-2024&maxDate=12-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=12-06-2024&maxDate=13-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=13-06-2024&maxDate=14-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=14-06-2024&maxDate=15-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=15-06-2024&maxDate=16-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=16-06-2024&maxDate=17-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=17-06-2024&maxDate=18-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=18-06-2024&maxDate=19-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=19-06-2024&maxDate=20-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=20-06-2024&maxDate=21-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=21-06-2024&maxDate=22-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=22-06-2024&maxDate=23-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=23-06-2024&maxDate=24-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=24-06-2024&maxDate=25-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=25-06-2024&maxDate=26-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=26-06-2024&maxDate=27-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=27-06-2024&maxDate=28-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=28-06-2024&maxDate=29-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=29-06-2024&maxDate=30-06-2024&contentType=all
#> https://www.europarl.europa.eu/news/en/press-room?minDate=30-06-2024&maxDate=01-07-2024&contentType=all
After checking that this new custom format indeed actually works, we can then add these urls to the stored index:
cas_build_urls(
url = index_urls_v,
index_group = "press-room",
write_to_db = TRUE
)
#> ℹ No new url added to index_id table.
#> # A tibble: 30 × 3
#> id url index_group
#> <dbl> <chr> <chr>
#> 1 1 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 2 2 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 3 3 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 4 4 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 5 5 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 6 6 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 7 7 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 8 8 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 9 9 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> 10 10 https://www.europarl.europa.eu/news/en/press-room?minDate=… press-room
#> # ℹ 20 more rows
And proceed with the download as usual:
#> ℹ No new files or pages to download.