January

Thursday, January 7, 2016 - 05:21 by Jonah Graham

Basics

This proposal is in the Project Proposal Phase (as defined in the Eclipse Development Process) and is written to declare its intent and scope. We solicit additional participation and input from the community. Please login and add your feedback in the comments section.

Project

Eclipse January

Parent Project

Eclipse Technology

Proposal State

Created

Background

Scientific computing involves the manipulation and processing of various forms of numerical data. This data is organized in specific data structures in order to make the processing as efficient as possible. This project aims to provide a set of standardized Java-based numerical data structures for scientific computing as well as general purpose data structures that map to a wide range of scientific problems. These data structures will be key to the easy integration of Science Working Group projects and tools such as visualization, workflows and scripting frameworks. A common set of structures reduces barriers to adoption of tools within the Science Working Group and wider community and will speed up the adoption of new technologies.

Scope

The January project provides Java implementations of numerical data structures such as multi-dimensional arrays and matrices, including a Java equivalent to the popular Python NumPy library for n-dimensional array objects. The project also includes general purpose data classes, structures and pattern realizations that can be mapped to a wide range of scientific problems while also maintaining metadata about that information, for example CSG trees.

Implementations are scalable to large structures that do not fit entirely in memory at once. For example, data structures up to 100s of MB generally fit in memory without needing additional design consideration, however large data structures of many GBs or even larger need design consideration to allow efficient processing without requiring loading the entire structure at once into memory. Therefore features such as meta information on data, references to data and slicing of data are first class citizens of this project. The required outcome is to allow data structures to scale to run on various distributed computing architectures.

This project will also encapsulate methods for loading, storing and manipulating data. This project is designed to work in headless (non-UI) operation for automated data processing.

Description

January is a set of libraries for handling numerical data in Java. It is inspired in part by NumPy and aims to provide similar functionality.

Why use it?

Familiar. Provide familiar functionality, especially to NumPy users.
Robust. Has test suite and is used in production heavily at Diamond Light Source.
No more passing double[]. IDataset provide a consistent object for basing APIs on with significantly improved clarity over using double arrays or similar.
Optimized. Optimized for speed and getting better all the time.
Scalable. Allows handling of data sets larger than available memory with "Lazy Datasets".
Focus on your algorithms. By reusing this library it allows you to focus on your code.

For a basic example, have a look at the example project: BasicExample.java

Browse through the more advanced examples.

NumPy Examples show how common NumPy constructs map to Eclipse Datasets.
Slicing Examples demonstrate slicing, including how to slice a small amount of data out of a dataset too large to fit in memory all at once.
Error Examples demonstrate applying an error to datasets.
Iteration Examples demonstrate a few ways to iterate through your datasets.
Lazy Examples demonstrate how to use datasets which are not entirely loaded in memory.

Licenses

Eclipse Public License 1.0

Why Here?

Common data structures were identified by members of the Eclipse Science Working Group as a fundamental building block for development and integration of scientific tools and technologies.

Future Work

Future items of work under consideration are (but is not limited to):

Loading and storing of datasets
Processing large data sets in an architecturally aware manner e.g. on multiple cores or a GPU
Physical and mathematical units

Project Scheduling

This project will aim to join the Eclipse Release Train from the Oxygen release.

After the initial contribution, the project will focus on standardisation across other Science Working group project, including (but not limited to):

Integration of data structures of Eclipse Advanced Visualisation Project
Integration with Triquetrum Project

People

Project Leads

Committers

Tracy Miranda (This committer does not have an Eclipse Account)

Alex McCaskey

Jay Jay Billings

Matthew Gerring (This committer does not have an Eclipse Account)

Interested Parties

This project is of interest to members of the Science Working Group and the following projects:

Source Code

Initial Contribution

The initial contribuiton is expected to made in the first half of 2016.

The initial contribution consists of two parts:

Part 1:

The numeric datastructures are a fork of the Eclipse Dawnsci project that extracts Datasets and its associated mathematical libraries. As per the DAWNSci project proposal:

"The copyright of the initial contribution is held ~100% by Diamond Light Source Ltd. There may be some sections where copyright is held jointly between the European Synchrotron Radiation Facility and Diamond Light Source Ltd. No individual people or other companies own copyright of the initial contribution. Expected future contributions like the implementation of various interfaces will have to be dealt with as they arrive. Currently none are planned where the copyright is not European Synchrotron Radiation Facility and/or Diamond Light Source Ltd."

This part of the initial contribution is made up of three plug-ins:

org.eclipse.dataset - main code of the project, include the numerical n-dimensional arrays and the mathematics that operates on them.
org.eclipse.dataset.test - test code for the project
org.eclipse.dataset.examples - example code and getting started with datasets

All of the dependencies of the initial contribution are libraries that are already part of Eclipse ecosystem in Orbit:

org.apache.commons.math3
org.apache.commons.lang
org.slf4j.api
org.junit

The initial contribution is currently actively developed by Diamond Light Source and collaborated on by Diamond Light Source, Kichwa Coders, and European Synchrotron Radiation Facility, among others. In the URLs below the https://github.com/jonahkichwacoders repositories are a fork of the eclipse/dawnsci repository. The fork was created to allow easy refactoring and demonstrate code structure.

Part 2:

The non-numeric datastructures are a fork of the Eclipse ICE project.

This part of the initial contribution is made up of these plug-ins:

org.eclipse.ice.datastructures- main code of the project, includes the various components such as ResourceComponent, DataComponent
org.eclipse.ice.datastructures.test - test code for the project

All of the dependencies of the initial contribution are libraries that are already part of Eclipse ecosystem in Orbit:

ca.odell.glazedlists
org.slf4j.api
org.junit

The initial contribution is currently actively developed by Oakridge National Labs.

Source Repository Type

GitHub

Source Repositories

https://github.com/jonahkichwacoders/org.eclipse.dataset

https://github.com/jonahkichwacoders/org.eclipse.dataset.examples

https://github.com/eclipse/ice/tree/master/org.eclipse.ice.datastructures

https://github.com/eclipse/dawnsci/tree/master/org.eclipse.dawnsci.analysis.dat…