This data-oriented project originates from the EU-funded Crossminer project. As the Eclipse DataEggs grew in size and maturity, with specific audiences and needs from the whole community arising, we decided to create a new project solely dedicated to the availability and disposal of this resource in order to continue providing this service for the Eclipse and research communities. The website presenting the datasets is already working (and continuously updated) and is available on the Scava download page.
Eclipse DataEggs provides open, anonymised, up-to-date and ready-to-use datasets related to development of Eclipse projects. It includes the following types of data:
- Mailing lists (full mboxes and csv extracts) hosted at the Eclipse forge.
- AERI exception stacktraces (not updated anymore, historical data only).
- Development data from Eclipse projects.
Currently, there are 21 projects that have been analysed using this tool. More could be added upon projects' request.
The datasets provided by this project can already be explored at https://download.eclipse.org/scava/ .
- Mailing lists (full mboxes and csv extracts) hosted at the Eclipse forge with their documentation and examples.
- AERI exception stacktraces (not updated anymore, historical data only) includes 2 datasets: problems (see documentation) and incidents (see documentation).
- Development data from Eclipse projects. Depending on data sources, the following information is provided:
- SCM (git).
- ITS (Bugzilla, GitHub issues, GitLab issues).
- CI (Jenkins).
- PMI checks.
- Stack Overflow statistics.
- Scancode analysis (executed on our server).
Privacy has been a major concern from the beginning, see our documentation for more details.
All code in the GitLab repository has been written by me, under the EPL v2. Project data is fetched from an Alambic instance (hosted on our server) and as such is not impacted by license constraints -- although Alambic itself is licensed under EPL, too.
Although the analysis engine itself is (almost) forge-agnostic, the datasets provided in this project are exclusively related to the Eclipse forge.
Code is ready and builds are already running weekly. Everything is deployed to https://download.eclipse.org/scava/projects/ on sundays, around 4am.
It should be noted that the builds are run on our own server (http://ci4.castalia.camp:8080) since it is quite resource-intensive.
Eclipse Foundation.
Project developpers and end-users.
Research Labs (see previous requests to access Eclipse forge datasets).
All code is already stored at the Eclipse Foundation since it was written for Eclipse Scava. It has been moved recently from Eclipse git repositories to the new GitLab infrastructure. It can be found at https://gitlab.eclipse.org/bbaldassari2kd/scava-datasets .
All code has been written by me (Boris Baldassari) under the usual ECA, and is licenced under the EPL v2.
- Log in to post comments