Currently
sparklyr.flint
supports a number of commonly used
summarizers (e.g., count, sum, average, etc) that are implemented in the
Flint time series library. Each summarizer can be either applied to a
moving time window (e.g., in_past(5s)
) or groups of rows
within a TimeSeriesRDD
having the same timestamps (which is
known as a “cycle” in Flint nomenclature).
The following is a quick example of applying the sum summarizer to a moving time window:
library(sparklyr)
library(sparklyr.flint)
# Step 0: decide which Spark version to use, how to connect to Spark, etc
spark_version <- "3.0.0"
sc <- spark_connect(master = "local", version = spark_version)
example_time_series <- data.frame(
t = c(1, 3, 4, 6, 7, 10, 15, 16, 18, 19),
v = c(4, -2, NA, 5, NA, 1, -4, 5, NA, 3)
)
# Step 1: import example time series data into a Spark dataframe
sdf <- copy_to(sc, example_time_series, overwrite = TRUE)
# Step 2: specify how the Spark dataframe should be interpreted as a time series by Flint
ts_rdd <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t")
# Step 3: apply a Flint summarizer to the time series above
sum <- summarize_sum(ts_rdd, column = "v", window = in_past("3s"))
# Step 4: collect summarized result from Spark to R
res <- ts_sum %>% collect()
print(res)
## # A tibble: 10 x 3
## time v v_sum
## <dttm> <dbl> <dbl>
## 1 1970-01-01 00:00:01 4 4
## 2 1970-01-01 00:00:03 -2 2
## 3 1970-01-01 00:00:04 NaN 2
## 4 1970-01-01 00:00:06 5 3
## 5 1970-01-01 00:00:07 NaN 5
## 6 1970-01-01 00:00:10 1 1
## 7 1970-01-01 00:00:15 -4 -4
## 8 1970-01-01 00:00:16 5 1
## 9 1970-01-01 00:00:18 NaN 1
## 10 1970-01-01 00:00:19 3 8
From the result above, one can see as a result of specifying
window = in_past("3s")
, for each time point t
from example_time_series
(i.e., t = 1
,
t = 3
, t = 4
, t = 6
, and so on),
Flint has created a row containing t
and the summation of
all v
value(s) occurring within the time window of
[t - 3, t]
, and the sums are stored in a new column named
v_sum
.
Given a timestamp t
, the subset of rows in a
TimeSeriesRDD
having that timestamp is known as a “cycle”
in Flint.
If the window = "<time window specification>"
argument is omitted, then the summarizer function will look at all
cycles in the TimeSeriesRDD
. In other words, it will group
all rows by their timestamps and perform aggregation within each
group.
For example:
ts_sum <- summarize_sum(ts_rdd, column = "v")
will return a TimeSeriesRDD
with a timestamp column
named time
and a summation column named v_sum
.
For each timestamp t
present in ts_rdd
,
ts_sum
will contain a row with timestamp t
and
v_sum
value equal to summation of all v
values
occurring at t
.
Because all rows from ts_rdd
are already ordered
internally by timestamps, aggregations on cycles can be performed
efficiently in Flint without re-shuffling rows in the input
TimeSeriesRDD
.