Title: | Load WARC Files into Apache Spark |
---|---|
Description: | Load WARC (Web ARChive) files into Apache Spark using 'sparklyr'. This allows to read files from the Common Crawl project <http://commoncrawl.org/>. |
Authors: | Javier Luraschi [aut], Yitao Li [aut] , Edgar Ruiz [aut, cre] |
Maintainer: | Edgar Ruiz <[email protected]> |
License: | Apache License 2.0 |
Version: | 0.1.6 |
Built: | 2024-11-12 04:25:48 UTC |
Source: | https://github.com/r-spark/sparkwarc |
Provides WARC paths for commoncrawl.org. To be used with
spark_read_warc
.
cc_warc(start, end = start)
cc_warc(start, end = start)
start |
The first path to retrieve. |
end |
The last path to retrieve. |
cc_warc(1) cc_warc(2, 3)
cc_warc(1) cc_warc(2, 3)
Loads the sample warc file in Rcpp
rcpp_read_warc_sample(filter = "", include = "")
rcpp_read_warc_sample(filter = "", include = "")
filter |
A regular expression used to filter to each warc entry
efficiently by running native code using |
include |
A regular expression used to keep only matching lines
efficiently by running native code using |
Reads a WARC (Web ARChive) file using Rcpp.
spark_rcpp_read_warc(path, match_warc, match_line)
spark_rcpp_read_warc(path, match_warc, match_line)
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols. |
match_warc |
include only warc files mathcing this character string. |
match_line |
include only lines mathcing this character string. |
Reads a WARC (Web ARChive) file into Apache Spark using sparklyr.
spark_read_warc( sc, name, path, repartition = 0L, memory = TRUE, overwrite = TRUE, match_warc = "", match_line = "", parser = c("r", "scala"), ... )
spark_read_warc( sc, name, path, repartition = 0L, memory = TRUE, overwrite = TRUE, match_warc = "", match_line = "", parser = c("r", "scala"), ... )
sc |
An active |
name |
The name to assign to the newly generated table. |
path |
The path to the file. Needs to be accessible from the cluster. Supports the ‘"hdfs://"’, ‘"s3n://"’ and ‘"file://"’ protocols. |
repartition |
The number of partitions used to distribute the generated table. Use 0 (the default) to avoid partitioning. |
memory |
Boolean; should the data be loaded eagerly into memory? (That is, should the table be cached?) |
overwrite |
Boolean; overwrite the table with the given name if it already exists? |
match_warc |
include only warc files mathcing this character string. |
match_line |
include only lines mathcing this character string. |
parser |
which parser implementation to use? Options are "scala" or "r" (default). |
... |
Additional arguments reserved for future use. |
## Not run: library(sparklyr) library(sparkwarc) sc <- spark_connect(master = "local") sdf <- spark_read_warc( sc, name = "sample_warc", path = system.file(file.path("samples", "sample.warc"), package = "sparkwarc"), memory = FALSE, overwrite = FALSE ) spark_disconnect(sc) ## End(Not run)
## Not run: library(sparklyr) library(sparkwarc) sc <- spark_connect(master = "local") sdf <- spark_read_warc( sc, name = "sample_warc", path = system.file(file.path("samples", "sample.warc"), package = "sparkwarc"), memory = FALSE, overwrite = FALSE ) spark_disconnect(sc) ## End(Not run)
Loads the sample warc file in Spark
spark_read_warc_sample(sc, filter = "", include = "")
spark_read_warc_sample(sc, filter = "", include = "")
sc |
An active |
filter |
A regular expression used to filter to each warc entry
efficiently by running native code using |
include |
A regular expression used to keep only matching lines
efficiently by running native code using |
Retrieves sample warc path
spark_warc_sample_path()
spark_warc_sample_path()