--- title: "Importing time series data" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{importing-time-series-data} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` Importing time series data from R to sparklyr.flint is fairly simple and straightforward. It is probably best illustrated through some small examples. Firstly, one needs to establish a Spark connection by calling `sparklyr::spark_connect`, e.g., ```r library(sparklyr) sc <- spark_connect(master = "yarn-client", spark_home = "/usr/lib/spark") ``` to connect to a Spark cluster in YARN client mode, or ```r library(sparklyr) sc <- spark_connect(master = "local") ``` to connect to Spark in local mode. For those unfamiliar with Spark connections, [chapter 7](https://therinspark.com/connections.html#connections) of “Mastering Spark with R” by Javier Luraschi, Kevin Kuo, and Edgar Ruiz contains some very helpful explanations of several modes of connecting to Spark from `sparklyr`. Next, the time series data needs to be imported into a Spark dataframe. This can be accomplished with methods such as `sparklyr::spark_read_csv`, `sparklyr::spark_read_json`, etc if data source is a file on disk, e.g., ```r sdf <- spark_read_csv(sc, "/tmp/data.csv", header = TRUE) ``` or alternatively, using `sparklyr::copy_to` if data is in a R dataframe, e.g., ```r example_time_series_data <- data.frame( t = c(1, 3, 4, 6, 7, 10, 15, 16, 18, 19), v = c(4, -2, NA, 5, NA, 1, -4, 5, NA, 3) ) sdf <- copy_to(sc, example_time_series_data, overwrite = TRUE) ``` Finally, in order to unambiguously interpret the time series data we have provided in `sdf` so far, the Flint time series library will have to be informed about the name and the unit of the time column, and also whether all rows in the Spark dataframe from above are sorted by time already. All of this information will be encapsulated in a `TimeSeriesRDD` object derived from `sdf`, as shown below: ```r ts_rdd <- fromSDF(sdf, is_sorted = TRUE, time_unit = "SECONDS", time_column = "t") ``` At this point, `ts_rdd` contains all data and metadata necessary for Flint to perform various analyses on `example_time_series_data`, and results from those analyses will also be returned to us in separate `TimeSeriesRDD` objects.