Sciclomatic: A Peer-to-Peer System for Sharing Scientific Datasets

This post sketches a peer-to-peer system for sharing scientific datasets. Permalink: http://www.monperrus.net/martin/sciclomatic-sharing-scientific-datasets

Introduction

Academics, students and researchers obtain, create or use data in their experiments. Papers are being published to present results analyzing, predicting, transforming or simply using this data. And sometimes, some of them try to share this data with other researchers.

In my opinion, this process of sharing some data is essential for the overall process of science. It allows:
* replicating previous experiments and results to consolidate knowledge;
* setting up new predictive or comparative experiments as well as refutation;
* in general improving over previous work, i.e. standing on the shoulders of giants.

Let me tell the short story of Alice, who published a excellent paper in a top journal and shared her research data. At the publication time, she first wanted to put the data file (called dataset.zip) on her web page, but the file was 1GB and exceeded her own storage quota. Fortunately, the administrator of her group's website agreed to host it. The file stayed there for two years, publicly-available, until the group website moved to a new content management system. In the migration process, the file was temporarily unavailable for two months, but finally came back again online at a new URL (different from the URL in the paper). One year later, the disk of the server crashed and unfortunately, the backup system failed to restore Alice dataset. Neither Alice, nor colleagues or peers had a copy of the dataset and consequently, Alice's dataset was lost forever.

What matters for scientific data?

The short story of Alice illustrates differents issues related with sharing scientific data.

Size Scientific datasets may be 1kb but they are often large: counting in gigabytes is common and in certain fields (e.g. genomics) the petabyte order of magnitude is a reality.

Availibility Scientific data should always be easily available, in order to get it just before a paper deadline, and in general to "count" on it in a timely manner.

Findability Scientific data should always be easily findable. In an ideal world, URLs in papers remain valid for centuries, but it is not the case in reality. Also, a dataset that exists somewhere on a server, but that is unlocatable for other researchers is virtually lost.

Durability Scientific data should survive all kinds of problems, especially hardware and software failures as well as corruption from malicious people.

Alice's story showed that the HTTP client-server model in particular are not appropriate for sharing scientific data. It may handle the size and availibility issues, but is poor with respect to the two others: it outsources findability to search engines and it does not provide any kind of replication to address durability. Borgman recently discusses those points as well [10]. But what are the alternatives?

Alternatives to the HTTP client-server model

At the time of writing this post (February 2012), there are different alternatives

* http://en.wikipedia.org/wiki/Peer-to-peerpeer-to-peer technologies (e.g. BitTorrent, see [1]).
* cloud-based storage (e.g. Amazon S3).
* distributed file systems (e.g. GlusterFS).

Those three families of technologies are different and have different dynamics, peer-to-peer advances by small innovative companies and creative hackers; cloud solutions are driven by talented engineers from big companies, distributed file systems is a large academic field with large pieces of open-source software products. They all provide partial solutions to the 4 issues aforementioned. Also, the diversity inside those technical niches is extremely high: there are dozens of peer-to-peer protocols, cloud storage providers and distributed file system implementations. Can we identify one system/platform that provides scalable, available, findable and durable storage of scientific data?

Let me now briefly show the weaknesses of certain approaches. BitTorrent is designed so that each peer has a full version of a given file and the replication is not controlled. As a result, a user (individual, group or university) can not contribute to say, 1GB of storage space for sharing whatever dataset (full version or only chunks). The uncontrolled replication may lead to files disappearing in the long term, unless somebody takes the responsability of keeping a version of all torrents (à la Biotorrents), which does not scale and represents a single point of failure.

Cloud storage guarantees high resilience against hardware or software failure. However, from the perspective of sharing scientific data, what would happen if the owner of the data file (individual, group) closes his/her account, if the hosting company (e.g. Amazon) bankrupts or if a malicious attacker achieves to break some of their infrastucture? In all cases, the scientific data has a significant risk to be lost.

Requirements for sharing scientific data

According to what I've just discussed, the requirements for a system to share scientific data are as follows:

* It shall be able to support files of any size.
* It shall be replicated in order to address both availalibity and durability.
* It shall include built-in indexing and searching capabilities on the metadata (e.g. per type, author, domain, etc.).
* It shall include a mechanism to ensure the integrity of data.
* It shall span different organizations (e.g. different universities), hosting systems (e.g. different OSes) and has different implementations to avoid single points of failures.

Let's call this system "Sciclomatic" (for an automatic system providing storage of scientific data). Sciclomatic is not meant to have a single instance of the system. One can imagine the genomics research community having its own Sciclomatic instance; the high-energy physics another one, the mining software repository a third one, etc. One can call each of those instances a Sciclomatic swarm.

A command-line interface for Sciclomatic

The interaction with Sciclomatic seems fairly simple.

Creating a Sciclomatic swarm:
$ sc-create --name Genomics
This command would output the name of the bootstrap server, e.G. http://sciclomatic.univ-lille1.fr/sciclomatic

Contributing to a Sciclomatic swarm with space:
$ sc-contribute --server http://sciclomatic.univ-lille1.fr/sciclomatic --size 10G
This command contibutes 10GB of storage space to a SciCloMatic instance. There may be an authentication and authorization phase before joining a Sciclomatic swarm.

Searching for files:
$ sc-search --server http://sciclomatic.univ-lille1.fr/sciclomatic author:alice
This command would output a list of files with their ids.

Uploading a file:

$ sc-submit --server http://sciclomatic.univ-lille1.fr/sciclomatic --metadata alice.metadata --file alice.dataset.2001.zip

There may be a validation step before the contributed file being replicated, indexed and searchable (see Anonymity and free contributions below).

Dowloading a file:
$ sc-download --server http://sciclomatic.univ-lille1.fr/sciclomatic alice.dataset.2001.zip.s45sdfs6e5s3d2fszeaw3s
(alice.dataset.2001.zip.s45sdfs6e5s3d2fszeaw3s is an ID generated at upload time, for instance a concatenation of the filename and a hash value)

I imagine a system that is this totally decentralized and where all peers are equals. This means that there is no master or control nodes.
When one contributes to a Sciclomatic swarm with space, this means that the newly connected server automatically becomes a door of the system, an access to the swarm.
In other terms, sc-contribute may output:

Contributing 10 GB to http://sciclomatic.univ-lille1.fr/sciclomatic, the created door is [[http://145.5.78.64/sciclomatic]]

Sciclomatic distribution, replication and voting algorithms would ensure that no single nodes or group of nodes is able to corrupt or delete files, to introduce fake files or to trigger incorrect search results.

Implementations of Sciclomatic

Whether from peer-to-peer, cloud or distributed file systems, I believe that the algorithms and building-blocks of Sciclomatic already exist. For instance, powerful hash functions are capable of ensuring data integrity, distributed hash tables (DHT) provide us with peer-to-peer searching capabilities (see http://btdigg.org), etc. The goal of Sciclomatic is not at all to reimplement a new system, it would be great to reuse the maximum amount of existing software. Also, as pointed above, having different implementations of the same protocol would minimize the single point of failure syndrom at the implementation level.

Anonymity and free contributions

Also, I would like to point that for sharing scientific data, anonymity is not crucial since the papers introducing some data are rarely anonymous (contrary to free speech peer-to-peer systems such as Freenet). Also, free contribution is not essential as well. One can imagine a validation step before uploading one dataset to Sciclomatic. This is actually important for encouraging organizations to provide a SciCloMatic swarm with "blind" storage space, where they would not master which data or data chunk is hosted on their machines. This is rather important in order to tame porn and illegal download.

Related work

Biotorrents [2], Bionimbus [3], dcache.org [4], Globus [5] provide interesting solutions but they don't seem exactly aligned with Sciclomatic's vision. "Data Conservancy" [11] is a project funded by the US National Science Foundation. Rob Allan [6] mentions many large scale projects in the direction of Sciclomatic. There are actually many, many academic papers (e.g. [7] or [8]) on this subject, but Sciclomatic is more about a mature and reliable implementation rather than concepts. The wikipedia page "Scientific Data Archiving" [9] also discusses this topic.

Conclusion

The main goal of this post is to ask you whether you know a system that is close to what I am dreaming of GlusterFS? Sector/Sphere?) and if not, to try to build a community to make this dream come true. Don't hesitate to comment this page :-) or drop me an email.

News Oct 2014: See https://zenodo.org/

Martin Monperrus, February 2012

Bibliography

[1] Sharing scientific datasets with BitTorrent (Martin Monperrus)
[2] BioTorrents: a file sharing service for scientific data (MGI Langille, JA Eisen)
[3] BioNumbus Cloud (http://www.bionimbus.org/)
[4] dCache, the commodity cache (P Fuhrmann) http://www.dcache.org/
[5] Globus (Globus Alliance) http://www.globus.org/
[6] Management and Analysis of Large Research Data Sets (Rob Allan)
[7] Beyond Music Sharing: An Evaluation of Peer-to-Peer Data Dissemination Techniques in Large Scientific Collaborations (Al-Kiswany et al.), Journal of Grid Computing, 2009
[8] Public sharing of research datasets: a pilot study of associations (Heather A. Piwowar and Wendy W. Chapman) Journal of Informetrics, 2010
[9] Scientific Data Archiving (Wikipedia)
[10] The conundrum of sharing research data (Borgman, C. L.), Journal of the American Society for Information Science and Technology, 2012
[11] Data Conservancy, a project funded by the US National Science Foundation (NSF)