Project and website
One of the first issues that appear when starting a text mining or
web scraping project relates to the issue of managing files and folder.
castarter
defaults to an opinionated folder structure that
should work for most projects. It also facilitates downloading files
(skipping previously downloaded files) and ensuring consistent and
unique matching between a downloaded html, its source url, and data
extracted from them. Finally, it facilitates archiving and backuping
downloaded files and scripts.
The folder structure is based on two levels:
- project
- website
A project may include one or more websites. It is an intermediate level added to keep files in order, as the number of processed websites increased.
Let’s clarify with an example. Let’s suppose I want to do some text
minining of websites related to the European Union. The name of the
project will be european_union
, and within that project I
may be gathering contents from different websites,
e.g. “european_commission”, “european_parliament”, “european_council”,
etc.
library("castarter")
cas_set_options(
base_folder = fs::path(fs::path_temp(), "castarter_data"),
project = "european_union",
website = "european_commission"
)
Assuming that my project on the European Union involves text mining the website of the European Council, the European Commission, and the European Parliament, the folder structure may look something like this:
#> /tmp/RtmpQOCKM8/castarter_data
#> └── european_union
#> ├── european_commission
#> ├── european_council
#> └── european_parliament
In brief, castarter_data
is the base folder where I can
store all of my text mining projects. european_union
is the
name of the project, while all others are the names of the specific
websites I will source. Folders will be created automatically as needed
when you start downloading files.
When text mining or scraping, it is common to gather quickly many
thousands of file, and keeping them in good order is fundamental,
particularly in the long term. Hence, a preliminary suggestion:
depending on how you usually work and keep your files backed-up it may
make sense to keep your scripts in a folder that is live-synced
(e.g. with services such as Dropbox, Nextcloud, or Google Drive). It
however rarely make sense to live-sync tens or hundreds of thousands of
files as you proceed with your scraping. You may want to keep this in
mind as you set the base_folder
with
cas_set_options()
.
castarter
stores details about the download process in a
database. By default, this is stored locally in RSQlite database kept in
the same folder as website files, but it can be stored in a different
folder, or alternative database backends such as MySQL can also be
used.
Index pages and content pages
castarter
starts with the idea that there are basically
two types of pages that are commonly found when text mining.
index pages. These are pages that usually include some form of list of the pages with actual contents we are interested in (or, possibly, a second layer of index pages). They can be immutable, but they are often expected to change. For example, the news archive of the official website of Russia’s president is reachable via url such as the following:
- http://en.kremlin.ru/events/president/news/page/1 (the latest posts published)
- http://en.kremlin.ru/events/president/news/page/2 (previous posts)
- http://en.kremlin.ru/events/president/news/page/3
- …
- http://en.kremlin.ru/events/president/news/page/1000 (posts published more than 15 years ago)
- …
This is a structure that is common to many websites. In such cases, if we intend to keep our text mining efforts up to date, we usually would want to download the first such pages again and again, as long as we find new links that are not in our previous dataset.
content pages. These are pages that include the actual content we are interested in. These have urls such as:
Some section of the page may change, but our default expectation is that the part of the page we are interested in does not change. Unless we have some specific reason to do otherwise, we usually need to download such pages only once.