Importing time series data from R to sparklyr.flint is fairly simple and straightforward. It is probably best illustrated through some small examples.
Firstly, one needs to establish a Spark connection by calling
sparklyr::spark_connect
, e.g.,
to connect to a Spark cluster in YARN client mode, or
to connect to Spark in local mode.
For those unfamiliar with Spark connections, chapter
7 of “Mastering Spark with R” by Javier Luraschi, Kevin Kuo, and
Edgar Ruiz contains some very helpful explanations of several modes of
connecting to Spark from sparklyr
.
Next, the time series data needs to be imported into a Spark
dataframe. This can be accomplished with methods such as
sparklyr::spark_read_csv
,
sparklyr::spark_read_json
, etc if data source is a file on
disk, e.g.,
or alternatively, using sparklyr::copy_to
if data is in
a R dataframe, e.g.,
example_time_series_data <- data.frame(
t = c(1, 3, 4, 6, 7, 10, 15, 16, 18, 19),
v = c(4, -2, NA, 5, NA, 1, -4, 5, NA, 3)
)
sdf <- copy_to(sc, example_time_series_data, overwrite = TRUE)
Finally, in order to unambiguously interpret the time series data we
have provided in sdf
so far, the Flint time series library
will have to be informed about the name and the unit of the time column,
and also whether all rows in the Spark dataframe from above are sorted
by time already. All of this information will be encapsulated in a
TimeSeriesRDD
object derived from sdf
, as
shown below:
At this point, ts_rdd
contains all data and metadata
necessary for Flint to perform various analyses on
example_time_series_data
, and results from those analyses
will also be returned to us in separate TimeSeriesRDD
objects.