Extract direct links to individual content pages from index pages
Source:R/cas_extract_links.R
cas_extract_links.Rd
Extract direct links to individual content pages from index pages
Usage
cas_extract_links(
id = NULL,
batch = "latest",
domain = NULL,
index = TRUE,
index_group = NULL,
output_index = FALSE,
output_index_group = NULL,
include_when = NULL,
exclude_when = NULL,
container = NULL,
container_class = NULL,
container_id = NULL,
custom_xpath = NULL,
custom_css = NULL,
match = NULL,
min_length = NULL,
max_length = NULL,
attribute_type = "href",
append_string = NULL,
remove_string = NULL,
write_to_db = FALSE,
file_format = "html",
keep_only_within_domain = TRUE,
sample = FALSE,
check_previous = TRUE,
check_again = FALSE,
encoding = "UTF-8",
reverse_order = FALSE,
db_connection = NULL,
disconnect_db = TRUE,
...
)
Arguments
- id
Defaults to NULL. If provided, it should be a vector of integers. Only html files corresponding to given id will be processed.
- domain
Defaults to "". Web domain of the website. It is added at the beginning of each link found. If links in the page already include the full web address this should be ignored.
- output_index
Defaults to FALSE. If FALSE, new links are added to the contents table. If TRUE, the links extracted will be stored again as index, using
output_index_group
asindex_group
.- output_index_group
Defaults to NULL. Relevant only when
output_index
is set to TRUE. Used to store new index urls in the database with reference to the appropriate group.- include_when
Part of URL found only in links of individual articles to be downloaded. If more than one provided, it includes all links that contains either of the strings provided.
- exclude_when
If an URL includes this string, it is excluded from the output. One or more strings may be provided.
- container
Defaults to NULL. Type of html container from where links are to be extracted, such as "div", "ul", and others. Either
container_class
orcontainer_id
must also be provided.- container_class
Defaults to NULL. If provided, also
container
must be given (andcontainer_id
must be NULL). Only text found inside the provided combination of container/class will be extracted.- container_id
Defaults to NULL. If provided, also
container
must be given (andcontainer_id
must be NULL). Only text found inside the provided combination of container/class will be extracted.- custom_xpath
Defaults to NULL. If given, all other parameters are ignored and given Xpath used instead.
- match
Defaults to NULL. Used when extracting json files. Name of property from where url is to be extracted. N.B. Only partly implemented, please report issues along with specific example where it emerged.
- min_length
If a link is shorter than the number of characters given in min_length, it is excluded from the output.
- max_length
If a link is longer than the number of characters given in max_length, it is excluded from the output.
- attribute_type
Defaults to "href". Type of attribute to extract from links.
- append_string
If provided, appends given string to the extracted articles. Typically used to create links for print or mobile versions of the extracted page.
- remove_string
If provided, remove given string (or strings) from links.
- write_to_db
Logical, defaults to FALSE. If TRUE stored newly extracted links in the database, associates each of them with an id, and records the source for each link.
- keep_only_within_domain
Logical, defaults to TRUE. If TRUE, and domain given, links to external websites are dropped.
- check_previous
Defaults to TRUE. If TRUE, checks if newly found links are previously stored in database, and if they are, it discards them. If FALSE, and
write_to_db
is also set to FALSE, it does not check for previously stored links.- check_again
Defaults to FALSE. If FALSE, files from where are at least a link has been extracted are not re-processed. If TRUE, they are processed again. By default, only new links are then actually included in the output or stored in the local database.
- reverse_order
Logical, defaults to FALSE. If TRUE, index files are processed in reverse order of
id
andbatch
, which may give more meaningful order to content id. The difference is ultimately cosmetic, and has no substantive impact either way.- db_connection
Defaults to NULL. If NULL, uses local SQLite database. If given, must be a connection object or a list with relevant connection settings (see example).
- disconnect_db
Defaults to TRUE. If FALSE, leaves the connection to database open.
- ...
Passed to
cas_get_db_file()
.