Spark-ROOT is a pure-Java (and Scala) implementation of ROOT file reading, which presents a collection of ROOT files as a DataFrame object in Spark. This allows high-energy physicists to use Spark without giving up on or reformatting all their data.
Motivation
Spark is a useful platform for data analysis. It provides easy access to parallelization and is a focus point for data science tools from industry, particularly machine learning.
However, none of this would be useful to high-energy physicists if they cannot get their data into the platform. Spark-ROOT does this transparently.
Technique
Spark-ROOT is not a JNI adaptation of ROOT or even a socket or pipe-based bridge. It is a reimplementation of ROOT file reading based on Tony Johnson's FreeHEP project. As a pure-Java project, it is trivial to install (see below) and doesn't suffer from segmentation faults in the JNI or the flimsiness of piping data to an external process.
ROOT's ability to selectively read columns is matched to Spark's "pruned scan" to automatically optimize reads.
ROOT files may be accessed locally or from HDFS.
How to use
When you launch Spark, pass Spark-ROOT's Maven coordinates to load the package on startup.
spark-shell --packages org.diana-hep:spark-root_2.11:0.1.7
Now you can
import org.dianahep.sparkroot._
val df = spark.sqlContext.read.root("path/to/file.root")
and use df
as you would any other Spark DataFrame.
For more information
See the GitHub site: http://github.com/diana-hep/spark-root.