Scientific computing involves the manipulation and processing of various forms of numerical data. This data is organized in specific data structures in order to make the processing as efficient as possible. This project aims to provide a set of standardized Java-based numerical data structures for scientific computing as well as general purpose data structures that map to a wide range of scientific problems. These data structures will be key to the easy integration of Science Working Group projects and tools such as visualization, workflows and scripting frameworks. A common set of structures reduces barriers to adoption of tools within the Science Working Group and wider community and will speed up the adoption of new technologies.
The January project provides Java implementations of numerical data structures such as multi-dimensional arrays and matrices, including a Java equivalent to the popular Python NumPy library for n-dimensional array objects. The project also includes general purpose data classes, structures and pattern realizations that can be mapped to a wide range of scientific problems while also maintaining metadata about that information, for example CSG trees.
Implementations are scalable to large structures that do not fit entirely in memory at once. For example, data structures up to 100s of MB generally fit in memory without needing additional design consideration, however large data structures of many GBs or even larger need design consideration to allow efficient processing without requiring loading the entire structure at once into memory. Therefore features such as meta information on data, references to data and slicing of data are first class citizens of this project. The required outcome is to allow data structures to scale to run on various distributed computing architectures.
This project will also encapsulate methods for loading, storing and manipulating data. This project is designed to work in headless (non-UI) operation for automated data processing.
January is a set of libraries for handling numerical data in Java. It is inspired in part by NumPy and aims to provide similar functionality.
Why use it?
- Familiar. Provide familiar functionality, especially to NumPy users.
- Robust. Has test suite and is used in production heavily at Diamond Light Source.
- No more passing double[]. IDataset provide a consistent object for basing APIs on with significantly improved clarity over using double arrays or similar.
- Optimized. Optimized for speed and getting better all the time.
- Scalable. Allows handling of data sets larger than available memory with "Lazy Datasets".
- Focus on your algorithms. By reusing this library it allows you to focus on your code.
For a basic example, have a look at the example project: BasicExample.java
Browse through the more advanced examples.
- NumPy Examples show how common NumPy constructs map to Eclipse Datasets.
- Slicing Examples demonstrate slicing, including how to slice a small amount of data out of a dataset too large to fit in memory all at once.
- Error Examples demonstrate applying an error to datasets.
- Iteration Examples demonstrate a few ways to iterate through your datasets.
- Lazy Examples demonstrate how to use datasets which are not entirely loaded in memory.
Common data structures were identified by members of the Eclipse Science Working Group as a fundamental building block for development and integration of scientific tools and technologies.
This project will aim to join the Eclipse Release Train from the Oxygen release.
After the initial contribution, the project will focus on standardisation across other Science Working group project, including (but not limited to):
- Integration of data structures of Eclipse Advanced Visualisation Project
- Integration with Triquetrum Project
Future items of work under consideration are (but is not limited to):
- Loading and storing of datasets
- Processing large data sets in an architecturally aware manner e.g. on multiple cores or a GPU
- Physical and mathematical units