Convenience function typically used to generate urls to index pages listing articles.
Usage
cas_build_urls(
url,
url_ending = "",
glue = FALSE,
start_page = NULL,
end_page = NULL,
increase_by = 1,
date_format = "Ymd",
start_date = NULL,
end_date = Sys.Date() - 1,
date_separator = NULL,
increase_date_by = "day",
reversed_order = FALSE,
index_group = "index",
index = TRUE,
write_to_db = FALSE,
...
)
Arguments
- url
First part of index link that does not change in other index pages.
- url_ending
Part of index link appneded after the part of the link that varies. If not relevant, may be left empty.
- glue
Logical, defaults to FALSE. If TRUE, the url is parsed with
glue
, enabling custom or repeated location for the variable part of the url. Ifglue
is set to TRUE, it is expected that the url will include the string{here}
within curly brackets, e.g.https://example.com/archive/?from_date={here}&to_date={here}
.- start_page
If the urls include a numerical component, define first number of the sequence. Defaults to NULL. If given, coerced to numeric, expected to be an integer.
- end_page
If the urls include a numerical component, define first number of the sequence. Defaults to NULL. If given, coerced to numeric, expected to be an integer.
- increase_by
Defines by how much the number in the link should be increased in the numerical sequence. Defaults to 1.
- date_format
A character string, defaults to "YMD". Check
strptime
for valid values used to define the format of the date that is part of the URL. Simplified formats such as the following are also accepted: "Y" (e.g. 2022), "Ym" (2022-10), "Ymd" (e.g. 2022-10-24). See details.- start_date
Defaults to NULL. If given, a date, or a character vector of length one coercible to date with
as.Date
. When given, urls are built based on dates, and parametersstart_page
,end_page
, andincrease_by
, are ignored.- end_date
Defaults to
Sys.Date()
. If given, a date, or a character vector of length one coercible to date withas.Date
.- increase_date_by
Defaults to "day". See
seq.Date
for valid values.- reversed_order
Logical, defaults to FALSE. If TRUE, the order of urls in the output.
- index_group
A character vector, defaults to "index". Used for differentiating among different types of index or links in local databases.
- index
Defaults to TRUE. Relevant only if
write_to_db
is also set to TRUE. If TRUE, urls are stored in the local database in the index table, otherwise they are stored in the contents table.- write_to_db
Defaults to FALSE. If set to TRUE, stores the newly created URLs to the local database.
Value
A data frame with three columns, id
, url
, and index_group
.
Typically, url
corresponds to a vector of unique urls.
Date formats
It is not uncommon in particular for index pages to
include dates in the URL, along the lines of
example.com/archive/2022-01-01
, example.com/archive/2022-01-02
, etc. To
build such urls, cas_build_urls
needs a start_date
and end_date
.
The formatting of the date can be defined either by providing to the
parameter date_format
a string that strptime
is able to
interpret directly, or a simplified string (such as "Ymd", without the
"%"),adding a date_separator
such as "-" as needed.
Examples
cas_build_urls(
url = "https://www.example.com/news/",
start_page = 1,
end_page = 10
)
#> # A tibble: 10 × 3
#> id url index_group
#> <dbl> <chr> <chr>
#> 1 1 https://www.example.com/news/1 index
#> 2 2 https://www.example.com/news/2 index
#> 3 3 https://www.example.com/news/3 index
#> 4 4 https://www.example.com/news/4 index
#> 5 5 https://www.example.com/news/5 index
#> 6 6 https://www.example.com/news/6 index
#> 7 7 https://www.example.com/news/7 index
#> 8 8 https://www.example.com/news/8 index
#> 9 9 https://www.example.com/news/9 index
#> 10 10 https://www.example.com/news/10 index
cas_build_urls(
url = "https://example.com/news/?skip=",
start_page = 0,
end_page = 100,
increase_by = 10
)
#> # A tibble: 11 × 3
#> id url index_group
#> <dbl> <chr> <chr>
#> 1 1 https://example.com/news/?skip=0 index
#> 2 2 https://example.com/news/?skip=10 index
#> 3 3 https://example.com/news/?skip=20 index
#> 4 4 https://example.com/news/?skip=30 index
#> 5 5 https://example.com/news/?skip=40 index
#> 6 6 https://example.com/news/?skip=50 index
#> 7 7 https://example.com/news/?skip=60 index
#> 8 8 https://example.com/news/?skip=70 index
#> 9 9 https://example.com/news/?skip=80 index
#> 10 10 https://example.com/news/?skip=90 index
#> 11 11 https://example.com/news/?skip=100 index
cas_build_urls(
url = "https://example.com/archive/",
start_date = "2022-01-01",
end_date = "2022-12-31",
date_separator = "-"
) %>%
head()
#> # A tibble: 6 × 3
#> id url index_group
#> <dbl> <chr> <chr>
#> 1 1 https://example.com/archive/2022-01-01 index
#> 2 2 https://example.com/archive/2022-01-02 index
#> 3 3 https://example.com/archive/2022-01-03 index
#> 4 4 https://example.com/archive/2022-01-04 index
#> 5 5 https://example.com/archive/2022-01-05 index
#> 6 6 https://example.com/archive/2022-01-06 index
cas_build_urls(
url = "https://example.com/archive/?from={here}&to={here}",
glue = TRUE,
start_date = "2011-01-01",
end_date = "2022-12-31",
date_separator = ".",
date_format = "dmY",
index_group = "news"
)
#> # A tibble: 4,383 × 3
#> id url index_group
#> <dbl> <chr> <chr>
#> 1 1 https://example.com/archive/?from=01.01.2011&to=01.01.2011 news
#> 2 2 https://example.com/archive/?from=02.01.2011&to=02.01.2011 news
#> 3 3 https://example.com/archive/?from=03.01.2011&to=03.01.2011 news
#> 4 4 https://example.com/archive/?from=04.01.2011&to=04.01.2011 news
#> 5 5 https://example.com/archive/?from=05.01.2011&to=05.01.2011 news
#> 6 6 https://example.com/archive/?from=06.01.2011&to=06.01.2011 news
#> 7 7 https://example.com/archive/?from=07.01.2011&to=07.01.2011 news
#> 8 8 https://example.com/archive/?from=08.01.2011&to=08.01.2011 news
#> 9 9 https://example.com/archive/?from=09.01.2011&to=09.01.2011 news
#> 10 10 https://example.com/archive/?from=10.01.2011&to=10.01.2011 news
#> # ℹ 4,373 more rows