OR08 Publications

Repositories for Scientific Data

Murray-Rust, P. (2008) Repositories for Scientific Data. In: Third International Conference on Open Repositories 2008, 1-4 April 2008, Southampton, United Kingdom.

Full text not available from this repository.


Scientists are producing data at an ever increasing rate (the data deluge) due to automated instruments, image capture and simulation tools. This holds the promise of “data-driven science” where scientific discovery can be made by linking or mining existing data. The reality is, unfortunately, that almost all this data is lost. Although some publishers welcome data as an adjunct to “fulltext”, many do not and most do not have the domain expertise to store and curate the data. And although “big science” (such as high energy physics, geospatial imaging, genomics and structural biology) can often provide domain repositories (e.g. in bioinformatics) most science (the long tail) cannot. There is an urgent need to address this problem. Current Institutional Repositories (IRs) are geared to storing and disseminating scholarly manuscripts and while some are prepared to accept other digital artefacts the practice is fragmented and does not scale. We need to define “Data Repositories” (DRs) which serve the interests of the scientists directly. This is highly domain-dependent and there is no one-size-fits-all solution. However there are some general principles. • The DRs must be intimately embedded in the current practice of the scientists - ideally they should be invisible to them. • They must directly support the scientific effort and been seen as doing so rather than being confused with metrics, business processes, etc. • The people running them should be physically present in the scientific laboratories (“wearing lab coats”). It is important not to overcomplicate with unnecessary middleware and metadata. The typical informatics toolset of a scientist includes Word/LaTeX, Excel, and the good old filing system – which with huge storage comes back into its own. Free text indexing tools will do as good a job of creating domain metadata as humans. Many departments are starting to introduce backup systems such as Active Directory, Samba or SVN that satisfy the most important users of the repository – the scientists themselves. HTTP/REST is good enough for many departments. These tools are an excellent starting point to engage the scientists and show there is real benefit. This is a new field and I shall review some of the current approaches, including work from our own group (in chemistry and crystallography). It is critical that prototypes and developed with sustainability in mind. Indeed, if good principles of data management are brought into the teaching and learning process (e.g. in final year projects) then the students themselves will provide much of the innovation and tools.

Item Type:Conference or Workshop Item (Speech)
Creators:Peter Murray-Rust
Subjects:Main Conference > Keynote
ID Code:82
Deposited By: Leslie Carr
Deposited On:29 Mar 2008 07:17
Last Modified:14 Apr 2008 15:54


Repository Staff Only: item control page

JISC/CNI Meeting 'Transforming the User Experience' July 10-11 2008

Microsoft eScience Workshop at Indiana University, December 7-9, 2008, Accelerating Time to Scientific Discovery

Open Repositories 2009 Atlanta, Georgia. May 18-21 2009