Skip to contents

Extract direct links to individual content pages from index pages

Usage

cas_extract_links(
  id = NULL,
  batch = "latest",
  domain = NULL,
  index = TRUE,
  index_group = NULL,
  output_index = FALSE,
  output_index_group = NULL,
  include_when = NULL,
  exclude_when = NULL,
  container = NULL,
  container_class = NULL,
  container_id = NULL,
  custom_xpath = NULL,
  custom_css = NULL,
  match = NULL,
  min_length = NULL,
  max_length = NULL,
  attribute_type = "href",
  append_string = NULL,
  remove_string = NULL,
  write_to_db = FALSE,
  file_format = "html",
  keep_only_within_domain = TRUE,
  sample = FALSE,
  check_previous = TRUE,
  check_again = FALSE,
  encoding = "UTF-8",
  reverse_order = FALSE,
  db_connection = NULL,
  disconnect_db = TRUE,
  ...
)

Arguments

id

Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id will be processed.

domain

Defaults to "". Web domain of the website. It is added at the beginning of each link found. If links in the page already include the full web address this should be ignored.

output_index

Defaults to FALSE. If FALSE, new links are added to the contents table. If TRUE, the links extracted will be stored again as index, using output_index_group as index_group.

output_index_group

Defaults to NULL. Relevant only when output_index is set to TRUE. Used to store new index urls in the database with reference to the appropriate group.

include_when

Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided.

exclude_when

If an URL includes this string, it is excluded from the output. One or more strings may be provided.

container

Defaults to NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Either container_class or container_id must also be provided.

container_class

Defaults to NULL. If provided, also container must be given (and container_id must be NULL). Only text found inside the provided combination of container/class will be extracted.

container_id

Defaults to NULL. If provided, also container must be given (and container_id must be NULL). Only text found inside the provided combination of container/class will be extracted.

custom_xpath

Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead.

match

Defaults to NULL. Used when extracting json files. Name of property from where url is to be extracted. N.B. Only partly implemented, please report issues along with specific example where it emerged.

min_length

If a link is shorter than the number of characters given in min_length, it is excluded from the output.

max_length

If a link is longer than the number of characters given in max_length, it is excluded from the output.

attribute_type

Defaults to "href". Type of attribute to extract from links.

append_string

If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page.

remove_string

If provided, remove given string (or strings) from links.

write_to_db

Logical, defaults to FALSE. If TRUE stored newly extracted links in the database, associates each of them with an id, and records the source for each link.

keep_only_within_domain

Logical, defaults to TRUE. If TRUE, and domain given, links to external websites are dropped.

check_previous

Defaults to TRUE. If TRUE, checks if newly found links are previously stored in database, and if they are, it discards them. If FALSE, and write_to_db is also set to FALSE, it does not check for previously stored links.

check_again

Defaults to FALSE. If FALSE, files from where are at least a link has been extracted are not re-processed. If TRUE, they are processed again. By default, only new links are then actually included in the output or stored in the local database.

reverse_order

Logical, defaults to FALSE. If TRUE, index files are processed in reverse order of id and batch, which may give more meaningful order to content id. The difference is ultimately cosmetic, and has no substantive impact either way.

db_connection

Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).

disconnect_db

Defaults to TRUE. If FALSE, leaves the connection to database open.

...

Passed to cas_get_db_file().

Value

A data frame.

Examples

if (FALSE) {
links <- cas_extract_links(domain = "http://www.example.com/")
}