URL builder — cas_build_urls • castarter

Convenience function typically used to generate urls to index pages listing articles.

Usage

cas_build_urls(
  url,
  url_ending = "",
  glue = FALSE,
  start_page = NULL,
  end_page = NULL,
  increase_by = 1,
  date_format = "Ymd",
  start_date = NULL,
  end_date = Sys.Date() - 1,
  date_separator = NULL,
  increase_date_by = "day",
  reversed_order = FALSE,
  index_group = "index",
  index = TRUE,
  write_to_db = FALSE,
  ...
)

Arguments

url: First part of index link that does not change in other index pages.
url_ending: Part of index link appneded after the part of the link that varies. If not relevant, may be left empty.
glue: Logical, defaults to FALSE. If TRUE, the url is parsed with glue, enabling custom or repeated location for the variable part of the url. If glue is set to TRUE, it is expected that the url will include the string {here} within curly brackets, e.g. https://example.com/archive/?from_date={here}&to_date={here}.
start_page: If the urls include a numerical component, define first number of the sequence. Defaults to NULL. If given, coerced to numeric, expected to be an integer.
end_page: If the urls include a numerical component, define first number of the sequence. Defaults to NULL. If given, coerced to numeric, expected to be an integer.
increase_by: Defines by how much the number in the link should be increased in the numerical sequence. Defaults to 1.
date_format: A character string, defaults to "YMD". Check strptime for valid values used to define the format of the date that is part of the URL. Simplified formats such as the following are also accepted: "Y" (e.g. 2022), "Ym" (2022-10), "Ymd" (e.g. 2022-10-24). See details.
start_date: Defaults to NULL. If given, a date, or a character vector of length one coercible to date with as.Date. When given, urls are built based on dates, and parameters start_page, end_page, and increase_by, are ignored.
end_date: Defaults to Sys.Date(). If given, a date, or a character vector of length one coercible to date with as.Date.
increase_date_by: Defaults to "day". See seq.Date for valid values.
reversed_order: Logical, defaults to FALSE. If TRUE, the order of urls in the output.
index_group: A character vector, defaults to "index". Used for differentiating among different types of index or links in local databases.
index: Defaults to TRUE. Relevant only if write_to_db is also set to TRUE. If TRUE, urls are stored in the local database in the index table, otherwise they are stored in the contents table.
write_to_db: Defaults to FALSE. If set to TRUE, stores the newly created URLs to the local database.

Value

A data frame with three columns, id, url, and index_group. Typically, url corresponds to a vector of unique urls.

Date formats

It is not uncommon in particular for index pages to include dates in the URL, along the lines of example.com/archive/2022-01-01, example.com/archive/2022-01-02, etc. To build such urls, cas_build_urls needs a start_date and end_date. The formatting of the date can be defined either by providing to the parameter date_format a string that strptime is able to interpret directly, or a simplified string (such as "Ymd", without the "%"),adding a date_separator such as "-" as needed.

Examples

cas_build_urls(
  url = "https://www.example.com/news/",
  start_page = 1,
  end_page = 10
)
#> # A tibble: 10 × 3
#>       id url                             index_group
#>    <dbl> <chr>                           <chr>      
#>  1     1 https://www.example.com/news/1  index      
#>  2     2 https://www.example.com/news/2  index      
#>  3     3 https://www.example.com/news/3  index      
#>  4     4 https://www.example.com/news/4  index      
#>  5     5 https://www.example.com/news/5  index      
#>  6     6 https://www.example.com/news/6  index      
#>  7     7 https://www.example.com/news/7  index      
#>  8     8 https://www.example.com/news/8  index      
#>  9     9 https://www.example.com/news/9  index      
#> 10    10 https://www.example.com/news/10 index      

cas_build_urls(
  url = "https://example.com/news/?skip=",
  start_page = 0,
  end_page = 100,
  increase_by = 10
)
#> # A tibble: 11 × 3
#>       id url                                index_group
#>    <dbl> <chr>                              <chr>      
#>  1     1 https://example.com/news/?skip=0   index      
#>  2     2 https://example.com/news/?skip=10  index      
#>  3     3 https://example.com/news/?skip=20  index      
#>  4     4 https://example.com/news/?skip=30  index      
#>  5     5 https://example.com/news/?skip=40  index      
#>  6     6 https://example.com/news/?skip=50  index      
#>  7     7 https://example.com/news/?skip=60  index      
#>  8     8 https://example.com/news/?skip=70  index      
#>  9     9 https://example.com/news/?skip=80  index      
#> 10    10 https://example.com/news/?skip=90  index      
#> 11    11 https://example.com/news/?skip=100 index      


cas_build_urls(
  url = "https://example.com/archive/",
  start_date = "2022-01-01",
  end_date = "2022-12-31",
  date_separator = "-"
) %>%
  head()
#> # A tibble: 6 × 3
#>      id url                                    index_group
#>   <dbl> <chr>                                  <chr>      
#> 1     1 https://example.com/archive/2022-01-01 index      
#> 2     2 https://example.com/archive/2022-01-02 index      
#> 3     3 https://example.com/archive/2022-01-03 index      
#> 4     4 https://example.com/archive/2022-01-04 index      
#> 5     5 https://example.com/archive/2022-01-05 index      
#> 6     6 https://example.com/archive/2022-01-06 index      

cas_build_urls(
  url = "https://example.com/archive/?from={here}&to={here}",
  glue = TRUE,
  start_date = "2011-01-01",
  end_date = "2022-12-31",
  date_separator = ".",
  date_format = "dmY",
  index_group = "news"
)
#> # A tibble: 4,383 × 3
#>       id url                                                        index_group
#>    <dbl> <chr>                                                      <chr>      
#>  1     1 https://example.com/archive/?from=01.01.2011&to=01.01.2011 news       
#>  2     2 https://example.com/archive/?from=02.01.2011&to=02.01.2011 news       
#>  3     3 https://example.com/archive/?from=03.01.2011&to=03.01.2011 news       
#>  4     4 https://example.com/archive/?from=04.01.2011&to=04.01.2011 news       
#>  5     5 https://example.com/archive/?from=05.01.2011&to=05.01.2011 news       
#>  6     6 https://example.com/archive/?from=06.01.2011&to=06.01.2011 news       
#>  7     7 https://example.com/archive/?from=07.01.2011&to=07.01.2011 news       
#>  8     8 https://example.com/archive/?from=08.01.2011&to=08.01.2011 news       
#>  9     9 https://example.com/archive/?from=09.01.2011&to=09.01.2011 news       
#> 10    10 https://example.com/archive/?from=10.01.2011&to=10.01.2011 news       
#> # ℹ 4,373 more rows