The "Data Frame" data structure and compute model has become the go-to data representation for most data scientists. It is found in R, Python Pandas, and Apache Spark SQL, and has the flexibility to perform data manipulation, processing, and interpretation in everything from ad-hoc descriptive analysis to sophisticated machine learning.
LocationTech GeoTrellis provides a powerful set of tools for processing large-scale geospatial imagery data. It has extensive support for Apache Spark through its RDD API. While this API provides sufficient and flexible primitives for engineering software systems around imagery, it is not in a form intuitive to data scientists. The aim of LocationTech RasterFrames™ is to allow data scientists to use the powerful constructs in GeoTrellis but in the context of a Spark DataFrame.
LocationTech RasterFrames is a Scala library (with planned Python bindings) built on top of Apache Spark SQL and GeoTrellis. It also has dependencies on LocationTech's SFCurve, as well as a subset of GeoMesa. Its focus is to read and write against multiple imagery formats and stores, through the Spark SQL DataSource API, and make their contents available as columns of tiles with associated spatial metadata (such as indexes, keys, extents, projections, reference systems, etc). Conceptually, the contents of a RasterFrame is akin to the layers in a GIS application (e.g. QGIS), where each layer (or column in the data frame) represents a spectral band or other georectified, rasterized data product. Multiple RasterFrames may be spatially joined with each other, and their rows filtered, columns combined, statistically summarized, or have any number of additional map algebra-like operations applied to them. Furthermore, RasterFrames also includes interoperability with SparkML, allowing for classical machine learning algorithms to be applied to features derived from RasterFrame columns.
LocationTech RasterFrames brings the power of Spark DataFrames to geospatial raster data, empowered by the map algebra and tile layer operations of GeoTrellis. The underlying purpose of RasterFrames is to allow data scientists and software developers to process and analyze geospatial-temporal raster data with the same flexibility and ease as any other Spark Catalyst data type. At its core is a user-defined type (UDT) called TileUDT, which encodes a GeoTrellis Tile in a form the Spark Catalyst engine can process. Furthermore, we extend the definition of a DataFrame to encompass some additional invariants, allowing for geospatial operations within and between RasterFrames to occur, while still maintaining necessary geo-referencing constructs.
Additional information can be found at the RasterFrames website: http://rasterframes.io/
LocationTech RasterFrames is built upon three other LocationTech projects: GeoTrellis, SFCurve, and GeoMesa. It can be seen as an extension or repackaging of the tile layer and map algebra functionality in GeoTrellis. The GeoTrellis and GeoMesa communities have already provided invaluable technical and social support, and the RasterFrames team already feels at home with the LocationTech vision and perspective on the development of open source software for commercial usages. This project would greatly benefit from LocationTech's legal support, market visibility, and support infrastructure.
We are not aware of any issues. We own the (pending) trademark. The only dependencies are those which have already been through the legal review process for other LocationTech projects. All contributions have been made under the Apache 2.0 license.
An initial contribution can be made immediately. We are already releasing alpha-level binary builds and are scheduled to release 0.6.0 in the next month.
While LocationTech RasterFrames is already usable and productive, in the interest of API stability, a 1.0 release is likely many months away. Some of the abstractions are still being refined, but the core concept and constructs have been stable for the last 6 months.
Future work will focus on a number of fronts. First, we plan to implement Python bindings to make the capability even more accessible to data scientists. We would like to see more efficient and flexible DataSources implemented, including support for cloud-optimized GeoTiffs. Further work needs to be done in supporting spatial joins across LocationTech RasterFrames with different projections, reference systems, gridding, and resolutions. Higher-level support for zonal map algebra operations is a consideration, as is alternative tile memory layouts that would be amenable to GPU-enabled cluster computing. A list of finer-grained tasks and improvements can be found on our issues page: https://github.com/s22s/raster-frames/issues