CrystalEye: From Desktop To Data Repository

Nick Day, Jim Downing, Peter Murray-Rust, Unilever Cambridge Centre for Molecular Science Informatics, University of Cambridge.

Summary

CrystalEye began its life as the backbone of Nick Day's PhD research - looking to find the chemical reasons for limitations in computational molecular structure prediction by comparing real structures (from published crystallography) with those predicted by a computational code. The crystallographic data was scraped from publishers' websites and processed to add chemical information (a heuristic process, also part of the research).

The project evolved; HTML pages of derived data were created as a method of results data visualisation, these were published as a website, and functionality was added to serve Nick's research and in response to the community. At the same time, the scale of the data increased, bringing other challenges and concerns.

We believe CrystalEye is important to the repositories community because it could not have been achieved using the prevalent centralised repository approach, because the context and first steps are typical of "small science" research, and because we believe there is a key role for initiatives such as CrystalEye in the repository ecology.

This presentation traces the evolution of CrystalEye, and discusses: -

Early Evolution

When Nick started his thesis he had no clear idea of the scale of the data or its internal structure, and like many scientists found that a large filesystem and web tools gave him all he needed to manage his data. The work consisted of calculations on a large number of disparate datasets, and for each he constructed a directory into which all of the results and ancillary files were put. This is a very natural approach for a large number of experimental and computational scientists in "small science" (as opposed to "big science", e.g. astronomy, high energy physics, geoscience etc). Like many of the projects in our laboratory, Nick started with a few thousand data and allocating containers (directories) for each is an obvious first approach. Moreover, it is usually possible to create a semi-semantic naming system such that the investigator can normally find information without need for a formal index.

Using HTML, CSS and the Jmol applet is a natural way to provide data visualisation for crystallographic data, so when it became clear that Nick's data was of wider value, the easiest way to share it was to replicate the existing data structure on a public facing web server - CrystalEye was born.

Repository-like Features

Metadata

Metadata from the original data is carried through to the data files in CML, and enhanced with metadata about the processing. There are plans to expose this metadata as RDF in the near future, available both as RDF/XML resources, and also through a SPARQL endpoint.

Browse, View

CrystalEye offers a browse structure based on the source of the original data. There are no current plans to provide alternative browse trees, in the expectation that others can use the RDF metadata to provide alternative views if they find them useful.

Data Harvesting Using Atom

Because we found it so difficult to reliably harvest data from "web 1.0" websites, we knew we had to make it much easier for others to harvest data from CrystalEye. The extensibility of the Atom syndication protocol made it possible to construct an atom feed that would render meaningfully in "normal" aggregators using raster images, allow data harvesting and also give richer visualisation in chemistry-aware feed readers. The advantages of using a resource oriented protocol like Atom rather than a service oriented protocol like OAI-PMH were considerable; easier deployment (no dynamic components), and we gained scalability features out-of-the box from the web server (Apache HTTPD).

The chemistry community are not yet comfortable with incremental harvesting methods, and so we are planning to offer bulk data downloads through Amazon's S3 service in early 2008.

Data Submission With APP / SWORD

Although the data in CrystalEye is entirely from the published literature, the software could easily be used to provide visualisation and repository features for crystallographic data from any source. The SPECTRa project delivered a submission tool for crystallographic data and work is in progress (Dec 2007) to integrate SPECTRa and CrystalEye using the JISC SWORD extensions to the Atom Publishing Protocol.

Search

Crystallography is not particularly well served by conventional, text based index and search approaches. CrystalEye provides a small number of the most popular approaches to indexed crystallographic data; substructure search and unit cell parameter search.

Lessons learnt

The evolution of CrystalEye was not painless, we made mistakes that can be learned from, and got some things right that are worth emulating. Some of the key points: -

CrystalEye and the Institutional Repository

Initiatives like CrystalEye provide a distributed alternative to the centralised approach to Institutional Repositories. Clifford Lynch's 2003 definition suggested that an IR was "a set of services" and "most essentially an organizational commitment to the stewardship of ... digital materials". The organic, evolutionary approach exemplified by CrystalEye is compatible with this vision; we suggest that by offering a small number of services (such as web redirection and archiving) the institution can meet this commitment to stewardship without centralising data.