Reviews run for a minimum of one week. The outcome of the review is decided on this date. This is the last day to make comments or ask questions about this review.
With this review, we will create the January project and move some existing code from the DawnSci and ICE projects.
Scientific computing involves the manipulation and processing of various forms of numerical data. This data is organized in specific data structures in order to make the processing as efficient as possible. This project aims to provide a set of standardized Java-based numerical data structures for scientific computing as well as general purpose data structures that map to a wide range of scientific problems. These data structures will be key to the easy integration of Science Working Group projects and tools such as visualization, workflows and scripting frameworks. A common set of structures reduces barriers to adoption of tools within the Science Working Group and wider community and will speed up the adoption of new technologies.
The January project provides Java implementations of numerical data structures such as multi-dimensional arrays and matrices, including a Java equivalent to the popular Python NumPy library for n-dimensional array objects. The project also includes general purpose data classes, structures and pattern realizations that can be mapped to a wide range of scientific problems while also maintaining metadata about that information, for example CSG trees.
Implementations are scalable to large structures that do not fit entirely in memory at once. For example, data structures up to 100s of MB generally fit in memory without needing additional design consideration, however large data structures of many GBs or even larger need design consideration to allow efficient processing without requiring loading the entire structure at once into memory. Therefore features such as meta information on data, references to data and slicing of data are first class citizens of this project. The required outcome is to allow data structures to scale to run on various distributed computing architectures.
This project will also encapsulate methods for loading, storing and manipulating data. This project is designed to work in headless (non-UI) operation for automated data processing.
January is a set of libraries for handling numerical data in Java. It is inspired in part by NumPy and aims to provide similar functionality.
Why use it?
- Familiar. Provide familiar functionality, especially to NumPy users.
- Robust. Has test suite and is used in production heavily at Diamond Light Source.
- No more passing double. IDataset provide a consistent object for basing APIs on with significantly improved clarity over using double arrays or similar.
- Optimized. Optimized for speed and getting better all the time.
- Scalable. Allows handling of data sets larger than available memory with "Lazy Datasets".
- Focus on your algorithms. By reusing this library it allows you to focus on your code.
For a basic example, have a look at the example project: BasicExample.java
Browse through the more advanced examples.
- NumPy Examples show how common NumPy constructs map to Eclipse Datasets.
- Slicing Examples demonstrate slicing, including how to slice a small amount of data out of a dataset too large to fit in memory all at once.
- Error Examples demonstrate applying an error to datasets.
- Iteration Examples demonstrate a few ways to iterate through your datasets.
- Lazy Examples demonstrate how to use datasets which are not entirely loaded in memory.
Common data structures were identified by members of the Eclipse Science Working Group as a fundamental building block for development and integration of scientific tools and technologies.
The initial contribuiton is expected to made in the first half of 2016.
The initial contribution consists of two parts:
The numeric datastructures are a fork of the Eclipse Dawnsci project that extracts Datasets and its associated mathematical libraries. As per the DAWNSci project proposal:
"The copyright of the initial contribution is held ~100% by Diamond Light Source Ltd. There may be some sections where copyright is held jointly between the European Synchrotron Radiation Facility and Diamond Light Source Ltd. No individual people or other companies own copyright of the initial contribution. Expected future contributions like the implementation of various interfaces will have to be dealt with as they arrive. Currently none are planned where the copyright is not European Synchrotron Radiation Facility and/or Diamond Light Source Ltd."
This part of the initial contribution is made up of three plug-ins:
- org.eclipse.dataset - main code of the project, include the numerical n-dimensional arrays and the mathematics that operates on them.
- org.eclipse.dataset.test - test code for the project
- org.eclipse.dataset.examples - example code and getting started with datasets
All of the dependencies of the initial contribution are libraries that are already part of Eclipse ecosystem in Orbit:
The initial contribution is currently actively developed by Diamond Light Source and collaborated on by Diamond Light Source, Kichwa Coders, and European Synchrotron Radiation Facility, among others. In the URLs below the https://github.com/jonahkichwacoders repositories are a fork of the eclipse/dawnsci repository. The fork was created to allow easy refactoring and demonstrate code structure.
The non-numeric datastructures are a fork of the Eclipse ICE project.
This part of the initial contribution is made up of these plug-ins:
- org.eclipse.ice.datastructures- main code of the project, includes the various components such as ResourceComponent, DataComponent
- org.eclipse.ice.datastructures.test - test code for the project
All of the dependencies of the initial contribution are libraries that are already part of Eclipse ecosystem in Orbit:
The initial contribution is currently actively developed by Oakridge National Labs.
This project will aim to join the Eclipse Release Train from the Oxygen release.
After the initial contribution, the project will focus on standardisation across other Science Working group project, including (but not limited to):
- Integration of data structures of Eclipse Advanced Visualisation Project
- Integration with Triquetrum Project
Future items of work under consideration are (but is not limited to):
- Loading and storing of datasets
- Processing large data sets in an architecturally aware manner e.g. on multiple cores or a GPU
- Physical and mathematical units
Scientific software on the Eclipse Platform is common and wide-spread. Because The Eclipse Rich Client Platform (RCP) is a feature rich and productive environment many science organizations have picked it up and run with it, today lots of functional applications exist in this sector. The architecture of RCP means that much of the user interface can be interchanged and is inter-operable to some extent. So for instance a science project wanting to support Python can add the Pydev feature to its product and extend functionality quickly for little cost. When it comes to inter-operable algorithms and inter-operable plotting however, this is not the case: each science project has its own definitions. This means that they cannot profit from others work or make serendipitous discoveries with an unexpectedly useful tool. The DAWNSci project is one option for solving these issues.
The DAWNSci project defines Java interfaces for data description, plotting and plot tools, data slicing and file loading. It defines an architecture oriented around OSGi services to do this. It provides a reference implementation and examples for the interfaces.
This project provides features to allow scientific software to be inter-operable. Algorithms exist today which could be shared between existing Eclipse-based applications however in practice they describe data and do plotting using specific APIs which are not inter-operable or interchangeable. This project defines a structure to make this a thing of the past. It defines an inter-operable layer upon which to build RCP applications such that your user interface, data or plotting can be reused by others.
The Eclipse Foundation is the right place to collaborate for DAWNSci because of the Science Working Group. This group has attracted several universities and software companies. The Eclipse Foundation gives the most opportunity for discovering new projects that DAWNSci can make use of and add value to the scientists working at or visiting Diamond and the ESRF.
The copyright of the initial contribution is held ~100% by Diamond Light Source Ltd. There may be some sections where copyright is held jointly between the European Synchrotron Radiation Facility and Diamond Light Source Ltd. No individual people or other companies own copyright of the initial contribution. Expected future contributions like the implementation of various interfaces will have to be dealt with as they arrive. Currently none are planned where the copyright is not European Synchrotron Radiation Facility and/or Diamond Light Source Ltd.
Third party plugin which is a dependency:
We have skimmed over the initial contribution with Sharon. The DAWN collaboration will be adding a collaboration agreement for committers, http://www.dawnsci.org/collaborate (still in progress). Previous committers will have to fill out a question set which will be sent around our mail list DAWN-DEV@JISCMAIL.AC.UK
The DAWN software will have releases scheduled with the shutdown phases of Diamond Synchrotron. The sub-components which are part of the dawn Eclipse project, DAWNSci , will be going through the same cycle. The releases are maintained by Diamond Light Source and are not on the Eclipse web site.
1. Data description (numpy like layer)
2. Plotting (based on nebula and in-house)
3. Plot tools
5. File loading including HDF5
The science and engineering community relies heavily on modeling and simulation to bridge the gap between theory and experiment. Almost every type of physical phenomena has a corresponding community of codes that simulate it for various physical conditions and parameters. A wide array of modeling tools exist to create inputs for these simulators. In fact, modeling and simulation tools are so widely available that the challenge is not finding one that will simulate the system, but actually figuring out how to use it.
There are four primary activities performed by all computational scientists and engineers: setting up the model, launching the job, analysing the results and managing the input and output data. There are certainly other activities, such as building or authoring code, but these “Big Four” are the only ones that are shared by users and developers. These activities may be trivial for a super user of any one code or set of codes, but new users can be severely challenged and even put off by these tasks. In some cases the challenges are scientific in nature or related to the engineering, but, arguably, in the majority of cases the challenges come from details like file formats, exotic host platforms and poor or completely missing usability.
The details of the way the Big Four activities are performed for any subset of the community are certainly different, but conceptually there are remarkable similarities because of shared packages, ideas and data. While it would be impossible for any one tool to service the entire community, ideally, this commonality could be exploited to create a suite of integrated tools to support users. This integrated tool suite would be a platform to which developers could contribute their own tools to help users and one that users could leverage to accelerate their modeling and simulation studies with multiple tool sets and techniques. The details of exactly how files are manipulated and stored and jobs are launched and monitored would be encapsulated, automated - where possible - and otherwise unexposed to the user.
We propose a new Eclipse project - the Eclipse Integrated Computational Environment - that leverages the vast collection of Eclipse tools in the Integrated Development Environments to develop this integrated tool suite. The existing Eclipse tools - advanced editors, user interface components, parallel tools, etc. - will be used to develop science and engineering tools focused on modeling and simulation. New tools will also be developed and contributed that add capability to the Eclipse Platform for science and engineering tasks that have not, to date, been added to the Platform.
The scope of this project will cover everything necessary for performing the Big Four activities (setting up the model, launching the job, analysing the results and managing the input and output data) in addition to streamlining some of the activities necessary for obtaining modeling and simulation codes. This includes:
- Input generation using existing or new I/O tools for modeling and simulation projects.
- 3D geometry construction for physical models.
- 1D, 2D and 3D meshing tools or connectors to existing third party tools.
- Database connectors for scientific and engineering databases for physical information.
- Tools for launching scientific and engineering codes on local or remote machines and managing the input and output of those jobs.
- Tools for monitoring remote jobs with domain-specific views.
- Scientific and engineering tools for viewing data, including but not limited to plotting routines, 3D visualization, scientific data mining and “low-fidelity, high-access” views for derived quantities for modeling and simulation.
- Integration with engines for quantifying uncertainty in simulators.
In all cases, an emphasis will be placed on creating reusable tools that can be easily deployed for multiple modeling and simulation projects instead of creating highly specialized tools for a single modeling and simulation project. The first case maximizes the benefit of the tool suite to customers and accomplishes the second, with proper design, but the second case does not accomplish the goals of the project.
The capability to launch jobs in this project will be based on those already available in the Eclipse Parallel Tools Platform (PTP), but tailored specifically for modeling and simulation projects. All of the code under the hood for uploading, download, executing and steaming files will call the PTP. Custom view and editors for modeling and simulation will be made at the UI level as well as for organizing the results of the launch.
The Eclipse Integrated Computational Environment (ICE) will address the usability needs of the scientific and engineering community for the Big Four modeling and simulation activities. The focus of the ICE will be to develop an easily extended and reusable set of tools that can be used by developers to create rich user interfaces for their modeling and simulation products. Custom widgets and data structures with well-defined interfaces and high-coverage unit tests will be provided for plugin developers. Plugins based on the ICE tools will also be developed and released with the ICE for codes that the developers use and the community contributes to the extent that resources allow. The idea is that the tools should be available for developers to do what they do and the deployment mechanism should be ready and waiting when they are finished.
Each of the big four tasks presents unique challenges, but much of the capability required can be handled by extensions to the existing Eclipse tools.
Creating models for scientific and engineering projects is a very difficult task that is, presently, very time consuming and hampered by the lack of an integrated tool chain. For example, consider the screenshot in Figure 1 of an input for a battery simulation configured using the code in the Initial Contribution.
The activity depicted in Figure 1 appears to be straightforward. Some information is required to pick a case and some other information is required in a table. After that information is provided, some buttons can be clicked and an input file will be generated. This seems relatively simple when described as such and presented with an Eclipse Form, but on the backend it requires detailed knowledge about the input format and four simulation codes. It also requires dependency tracking and a little bit of math. Both the input format and the four simulation codes are relatively well known in the battery simulation community, but no tools exist for using them easily.
Figure 1 - Setting up the input for a battery simulation.
A user who was not using the tool would have to learn all of those things themselves, much like a Java programmer not using Eclipse for Java Developers or a similar tool would have to learn the details of using the “javac” and “java” command line tools. However, with the tool the user can perform that task far quicker.
The Eclipse Form in the example takes advantage of several data structures and widgets that are automatically created and rendered based on very little information from the developer. The instructions used to draw the widgets on the screen are separate from the domain-specific information (business logic) provided by the plugin developers. While the developer is responsible for saying that they want information from users about time integration or component configurations, they never have to be responsible for drawing that information. An extension API exists, but they can (and tend to) leave that task up to the computer scientists.
The second of the Big Four activities, launching jobs, is in some ways the defining activity of modeling and simulation. It can also be extremely difficult if the simulation in question requires an exotic machine, a special operating system or really any system with which the user is not intimately familiar. In interviews, users tend to say that they would rather push a button, enter a password if the job is remote and watch the information stream back than “work the shell” themselves. In some cases a user may not even know how to use a shell on the operating system in question and cannot easily run the job.
Similar to the proposed modeling tools, the ICE will provide a simple developer interface for configuring job information - by template - for users. These tools will handle input matching with jobs, the actual execution and streaming information back to the user in an organized and recoverable way that is easily reproducible. This burden on users and developers is simplified to that of saying what needs to be launched, although one does this in the user interface and the other writes, for example, a reusable, templated shell script.
Analyzing and managing data, the last two of the Big Four activities, go hand in hand. There are some cases where data can not be managed or analyzed on a local machine because the data files are too big. The exact ways of doing that with a suite like the proposed ICE are open areas of research. However, for smaller data sets there are many things that can be done to improve usability.
The most obvious thing that can be done is to provide users a way to read their data. It is increasingly common in the science and engineering community to use a binary data format, so providing some way to read that data is key. The second most obvious thing to do is provide simple information on the provenance and present location of the data. (That may seem obvious, but it is commonly skipped in scientific codes.)
It is much more useful to provide visualization capabilities for users that can be pre-programmed by developers that help them extract knowledge from the data, as shown in Figure 2. It is important to make a distinction between different types of data since there are different types of users. Some users will prefer full 3D views of their simulations on the mesh as shown in the left pane of Figure 2. Other users will prefer parameterized views of derived macroscopic quantities as shown in the right pane of Figure 2.
In both cases it is important to make these tools easily accessible to users and plugin developers. The view on the left of Figure 2 uses an extension that enables viewing 3D meshes in Eclipse with VisIt, https://wci.llnl.gov/codes/visit/, which some years ago was successful in rendering images with over one trillion elements of data. The parameterized view in the right pane of Figure 2 is also a good example. It is a Cartesian grid drawn with the Graphical Editing Framework and is fully interactive, resizable and can be exported to an image for users with a few clicks. It has a simple interface for user interface developers and can handle hexagonal and circular grids in addition to Cartesian grids.
Figure 2 - A 3D image of the mesh in a battery simulation on the left and a parameterized view of a nuclear fuel assembly and the power on a few rods.
This project is a natural fit for the Eclipse Foundation for several reasons. The most important reason is that the scientific and engineering communities are starting to use Eclipse more. The formation of the Eclipse Scientific Industry Working Group is a sign of things to come, namely that more scientific and engineering entities are relying on Eclipse for their tooling infrastructure or at least showing interest in it. A corollary to this is that these capabilities do not already exist in the Eclipse Platform even though they are well represented in the broader Eclipse ecosystem.
The Eclipse Platform is also, in the opinion of the authors, the only environment that already has a vast majority of the capabilities needed to produce the integrated tools suite. Tools that are used for authoring can be reused very easily for modeling and simulation. The platform has over ten years of reusable capability that could not be easily or cost effectively reproduced by any single entity.
The Eclipse Foundation also offers an avenue by which contributions to the integrated environment can be tracked, vetted and released more broadly and easily than alternatives. This minimizes risk and encourages adoption by corporate customers, which is something that helps transfer technology from Academic and National Laboratory circles to industry.
The Initial Contribution (IC) will be made by UT-Battelle, LLC, which manages the Oak Ridge National Laboratory (ORNL). The IC will be from an existing project, the NEAMS Integrated Computational Environment, presently released by ORNL at Sourceforge.net.
The initial contribution is an Eclipse RCP-based product and relies heavily on other Eclipse projects. It relies especially on the Parallel Tools Platform, Plugin Development Environment, the Graphical Editing Framework and Tycho.
There are no obvious legal issues with the code in the opinion of the authors. It includes and depends on a few open-source codes outside of the Eclipse ecosystem. It contains no code licensed under the GPL or LGPL.
The following dependencies outside the Eclipse ecosystem are used:
Hierarchical Data Format version 5 (HDF5)
I/O for reactors
3D visualization for geometry and meshes
Apache License 2.0
3D visualization for results
Common battery specification format
Unknown*, but presumably open-source
Units in XML schema and source code for BatML
* We are currently working with the authors of these schemas to determine the license. They are publicly available on US Government websites.
The initial contribution will be made in quarter two of 2014 and the first release will happen by the end of quarter one of 2015.
Future work for this project in the short term will include improvement of the core tool set, data structures, widgets and adapters for third parties and plugin developers. It will also include the development of plugins for “high priority” modeling and simulation codes to both the project members and community.
Long term work on this project will focus on data sharing and mining and team collaboration activities. This includes tools for automatically mining modeling and simulation data as well as tools for easily sharing that data, or its metadata, in an IP-friendly way between collaborators. Improvements in the visualization capabilities will also be critical to long term success.