Skip to contents

Searches in common locations (namely, example.com/sitemap.xml, and example.com/sitemap_index.xml) and then in robots.txt and returns a url to the sitemap, along with the contents of the sitemap itself, if found.

Usage

cas_get_sitemap(
  domain = NULL,
  sitemap_url = NULL,
  check_robots = TRUE,
  check_common = TRUE,
  read_from_db = TRUE,
  write_to_db = FALSE,
  db_connection = NULL,
  disconnect_db = FALSE,
  ...
)

Arguments

domain

Defaults to NULL, but required unless sitemap_url given. Expected to be a full domain name. If input does not start with http, then https:// is prepended automatically.

sitemap_url

Defaults to NULL. If given, domain is ignored.

db_connection

Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).

...

Passed to cas_get_db_file().

Value

A data frame, including a sitemap_url column, the response as an httr2 object, and the body of the xml.

Examples

if (interactive()) {
  cas_get_sitemap(domain = "https://www.europeandatajournalism.eu/")
}
#>  Folder /tmp/RtmpSOhy4K/R/castarter_data for storing project and website files
#>   created.
#> # A tibble: 1 × 1
#>   sitemap_url                                            
#>   <chr>                                                  
#> 1 https://www.europeandatajournalism.eu/sitemap_index.xml