Scientific computing involves the manipulation and processing of various forms of numerical data. This data is organized in specific data structures in order to make the processing as efficient as possible. This project aims to provide a set of standardized Java-based numerical data structures for scientific computing as well as general purpose data structures that map to a wide range of scientific problems. These data structures will be key to the easy integration of Science Working Group projects and tools such as visualization, workflows and scripting frameworks. A common set of structures reduces barriers to adoption of tools within the Science Working Group and wider community and will speed up the adoption of new technologies.
The January project provides Java implementations of numerical data structures such as multi-dimensional arrays and matrices, including a Java equivalent to the popular Python NumPy library for n-dimensional array objects. The project also includes general purpose data classes, structures and pattern realizations that can be mapped to a wide range of scientific problems while also maintaining metadata about that information, for example CSG trees.
Implementations are scalable to large structures that do not fit entirely in memory at once. For example, data structures up to 100s of MB generally fit in memory without needing additional design consideration, however large data structures of many GBs or even larger need design consideration to allow efficient processing without requiring loading the entire structure at once into memory. Therefore features such as meta information on data, references to data and slicing of data are first class citizens of this project. The required outcome is to allow data structures to scale to run on various distributed computing architectures.
This project will also encapsulate methods for loading, storing and manipulating data. This project is designed to work in headless (non-UI) operation for automated data processing.
January is a set of libraries for handling numerical data in Java. It is inspired in part by NumPy and aims to provide similar functionality.
Why use it?
- Familiar. Provide familiar functionality, especially to NumPy users.
- Robust. Has test suite and is used in production heavily at Diamond Light Source.
- No more passing double[]. IDataset provide a consistent object for basing APIs on with significantly improved clarity over using double arrays or similar.
- Optimized. Optimized for speed and getting better all the time.
- Scalable. Allows handling of data sets larger than available memory with "Lazy Datasets".
- Focus on your algorithms. By reusing this library it allows you to focus on your code.
For a basic example, have a look at the example project: BasicExample.java
Browse through the more advanced examples.
- NumPy Examples show how common NumPy constructs map to Eclipse Datasets.
- Slicing Examples demonstrate slicing, including how to slice a small amount of data out of a dataset too large to fit in memory all at once.
- Error Examples demonstrate applying an error to datasets.
- Iteration Examples demonstrate a few ways to iterate through your datasets.
- Lazy Examples demonstrate how to use datasets which are not entirely loaded in memory.
Common data structures were identified by members of the Eclipse Science Working Group as a fundamental building block for development and integration of scientific tools and technologies.
Future items of work under consideration are (but is not limited to):
- Loading and storing of datasets
- Processing large data sets in an architecturally aware manner e.g. on multiple cores or a GPU
- Physical and mathematical units
This project will aim to join the Eclipse Release Train from the Oxygen release.
After the initial contribution, the project will focus on standardisation across other Science Working group project, including (but not limited to):
- Integration of data structures of Eclipse Advanced Visualisation Project
- Integration with Triquetrum Project
This project is of interest to members of the Science Working Group and the following projects:
The initial contribuiton is expected to made in the first half of 2016.
The initial contribution consists of two parts:
Part 1:
The numeric datastructures are a fork of the Eclipse Dawnsci project that extracts Datasets and its associated mathematical libraries. As per the DAWNSci project proposal:
"The copyright of the initial contribution is held ~100% by Diamond Light Source Ltd. There may be some sections where copyright is held jointly between the European Synchrotron Radiation Facility and Diamond Light Source Ltd. No individual people or other companies own copyright of the initial contribution. Expected future contributions like the implementation of various interfaces will have to be dealt with as they arrive. Currently none are planned where the copyright is not European Synchrotron Radiation Facility and/or Diamond Light Source Ltd."
This part of the initial contribution is made up of three plug-ins:
- org.eclipse.dataset - main code of the project, include the numerical n-dimensional arrays and the mathematics that operates on them.
- org.eclipse.dataset.test - test code for the project
- org.eclipse.dataset.examples - example code and getting started with datasets
All of the dependencies of the initial contribution are libraries that are already part of Eclipse ecosystem in Orbit:
- org.apache.commons.math3
- org.apache.commons.lang
- org.slf4j.api
- org.junit
The initial contribution is currently actively developed by Diamond Light Source and collaborated on by Diamond Light Source, Kichwa Coders, and European Synchrotron Radiation Facility, among others. In the URLs below the https://github.com/jonahkichwacoders repositories are a fork of the eclipse/dawnsci repository. The fork was created to allow easy refactoring and demonstrate code structure.
Part 2:
The non-numeric datastructures are a fork of the Eclipse ICE project.
This part of the initial contribution is made up of these plug-ins:
- org.eclipse.ice.datastructures- main code of the project, includes the various components such as ResourceComponent, DataComponent
- org.eclipse.ice.datastructures.test - test code for the project
All of the dependencies of the initial contribution are libraries that are already part of Eclipse ecosystem in Orbit:
- ca.odell.glazedlists
- org.slf4j.api
- org.junit
The initial contribution is currently actively developed by Oakridge National Labs.
- Log in to post comments