Skip to contents

Convenience function typically used to generate urls to index pages listing articles.

Usage

cas_build_urls(
  url,
  url_ending = "",
  glue = FALSE,
  start_page = NULL,
  end_page = NULL,
  increase_by = 1,
  date_format = "Ymd",
  start_date = NULL,
  end_date = Sys.Date() - 1,
  date_separator = NULL,
  increase_date_by = "day",
  reversed_order = FALSE,
  index_group = "index",
  index = TRUE,
  write_to_db = FALSE,
  ...
)

Arguments

url

First part of index link that does not change in other index pages.

url_ending

Part of index link appneded after the part of the link that varies. If not relevant, may be left empty.

glue

Logical, defaults to FALSE. If TRUE, the url is parsed with glue, enabling custom or repeated location for the variable part of the url. If glue is set to TRUE, it is expected that the url will include the string {here} within curly brackets, e.g. https://example.com/archive/?from_date={here}&to_date={here}.

start_page

If the urls include a numerical component, define first number of the sequence. start_page defaults to 1.

end_page

If the urls include a numerical component, define first number of the sequence. end_page defaults to 10.

increase_by

Defines by how much the number in the link should be increased in the numerical sequence. Defaults to 1.

date_format

A character string, defaults to "YMD". Check strptime for valid values used to define the format of the date that is part of the URL. Simplified formats such as the following are also accepted: "Y" (e.g. 2022), "Ym" (2022-10), "Ymd" (e.g. 2022-10-24). See details.

start_date

Defaults to NULL. If given, a date, or a character vector of length one coercible to date with as.Date. When given, urls are built based on dates, and parameters start_page, end_page, and increase_by, are ignored.

end_date

Defaults to Sys.Date(). If given, a date, or a character vector of length one coercible to date with as.Date.

increase_date_by

Defaults to "day". See seq.Date for valid values.

reversed_order

Logical, defaults to FALSE. If TRUE, the order of urls in the output.

index_group

A character vector, defaults to "index". Used for differentiating among different types of index or links in local databases.

index

Defaults to TRUE. Relevant only if write_to_db is also set to TRUE. If TRUE, urls are stored in the local database in the index table, otherwise they are stored in the contents table.

write_to_db

Defaults to FALSE. If set to TRUE, stores the newly created URLs to the local database.

Value

A data frame with three columns, id, url, and index_group. Typically, url corresponds to a vector of unique urls.

Date formats

It is not uncommon in particular for index pages to include dates in the URL, along the lines of example.com/archive/2022-01-01, example.com/archive/2022-01-02, etc. To build such urls, cas_build_urls needs a start_date and end_date. The formatting of the date can be defined either by providing to the parameter date_format a string that strptime is able to interpret directly, or a simplified string (such as "Ymd", without the "%"),adding a date_separator such as "-" as needed.

Examples

cas_build_urls(
  url = "https://www.example.com/news/",
  start_page = 1,
  end_page = 10
)
#> # A tibble: 10 × 3
#>       id url                             index_group
#>    <dbl> <chr>                           <chr>      
#>  1     1 https://www.example.com/news/1  index      
#>  2     2 https://www.example.com/news/2  index      
#>  3     3 https://www.example.com/news/3  index      
#>  4     4 https://www.example.com/news/4  index      
#>  5     5 https://www.example.com/news/5  index      
#>  6     6 https://www.example.com/news/6  index      
#>  7     7 https://www.example.com/news/7  index      
#>  8     8 https://www.example.com/news/8  index      
#>  9     9 https://www.example.com/news/9  index      
#> 10    10 https://www.example.com/news/10 index      

cas_build_urls(
  url = "https://example.com/news/?skip=",
  start_page = 0,
  end_page = 100,
  increase_by = 10
)
#> # A tibble: 11 × 3
#>       id url                                index_group
#>    <dbl> <chr>                              <chr>      
#>  1     1 https://example.com/news/?skip=0   index      
#>  2     2 https://example.com/news/?skip=10  index      
#>  3     3 https://example.com/news/?skip=20  index      
#>  4     4 https://example.com/news/?skip=30  index      
#>  5     5 https://example.com/news/?skip=40  index      
#>  6     6 https://example.com/news/?skip=50  index      
#>  7     7 https://example.com/news/?skip=60  index      
#>  8     8 https://example.com/news/?skip=70  index      
#>  9     9 https://example.com/news/?skip=80  index      
#> 10    10 https://example.com/news/?skip=90  index      
#> 11    11 https://example.com/news/?skip=100 index      


cas_build_urls(
  url = "https://example.com/archive/",
  start_date = "2022-01-01",
  end_date = "2022-12-31",
  date_separator = "-"
) %>%
  head()
#> # A tibble: 6 × 3
#>      id url                                    index_group
#>   <dbl> <chr>                                  <chr>      
#> 1     1 https://example.com/archive/2022-01-01 index      
#> 2     2 https://example.com/archive/2022-01-02 index      
#> 3     3 https://example.com/archive/2022-01-03 index      
#> 4     4 https://example.com/archive/2022-01-04 index      
#> 5     5 https://example.com/archive/2022-01-05 index      
#> 6     6 https://example.com/archive/2022-01-06 index      

cas_build_urls(
  url = "https://example.com/archive/?from={here}&to={here}",
  glue = TRUE,
  start_date = "2011-01-01",
  end_page = "2022-12-31",
  date_separator = ".",
  date_format = "dmY",
  index_group = "news"
)
#> # A tibble: 4,772 × 3
#>       id url                                                        index_group
#>    <dbl> <chr>                                                      <chr>      
#>  1     1 https://example.com/archive/?from=01.01.2011&to=01.01.2011 news       
#>  2     2 https://example.com/archive/?from=02.01.2011&to=02.01.2011 news       
#>  3     3 https://example.com/archive/?from=03.01.2011&to=03.01.2011 news       
#>  4     4 https://example.com/archive/?from=04.01.2011&to=04.01.2011 news       
#>  5     5 https://example.com/archive/?from=05.01.2011&to=05.01.2011 news       
#>  6     6 https://example.com/archive/?from=06.01.2011&to=06.01.2011 news       
#>  7     7 https://example.com/archive/?from=07.01.2011&to=07.01.2011 news       
#>  8     8 https://example.com/archive/?from=08.01.2011&to=08.01.2011 news       
#>  9     9 https://example.com/archive/?from=09.01.2011&to=09.01.2011 news       
#> 10    10 https://example.com/archive/?from=10.01.2011&to=10.01.2011 news       
#> # ℹ 4,762 more rows