CrystalEye began its life as the backbone of Nick Day's PhD research - looking to find the chemical reasons for limitations in computational molecular structure prediction by comparing real structures (from published crystallography) with those predicted by a computational code. The crystallographic data was scraped from publishers' websites and processed to add chemical information (a heuristic process, also part of the research).
The project evolved; HTML pages of derived data were created as a method of results data visualisation, these were published as a website, and functionality was added to serve Nick's research and in response to the community. At the same time, the scale of the data increased, bringing other challenges and concerns.
We believe CrystalEye is important to the repositories community because it could not have been achieved using the prevalent centralised repository approach, because the context and first steps are typical of "small science" research, and because we believe there is a key role for initiatives such as CrystalEye in the repository ecology.
This presentation traces the evolution of CrystalEye, and discusses: -
- The repository-like features in CrystalEye
- Lessons learnt - the technologies, standards and practises that allow for organic growth, and those that prevent it.
- The implications for other "small science" projects.
- The implications for institutional repositories - how can efforts like CrystalEye fit into an IR approach.
When Nick started his thesis he had no clear idea of the scale of the data or its internal structure, and like many scientists found that a large filesystem and web tools gave him all he needed to manage his data. The work consisted of calculations on a large number of disparate datasets, and for each he constructed a directory into which all of the results and ancillary files were put. This is a very natural approach for a large number of experimental and computational scientists in "small science" (as opposed to "big science", e.g. astronomy, high energy physics, geoscience etc). Like many of the projects in our laboratory, Nick started with a few thousand data and allocating containers (directories) for each is an obvious first approach. Moreover, it is usually possible to create a semi-semantic naming system such that the investigator can normally find information without need for a formal index.
Using HTML, CSS and the Jmol applet is a natural way to provide data visualisation for crystallographic data, so when it became clear that Nick's data was of wider value, the easiest way to share it was to replicate the existing data structure on a public facing web server - CrystalEye was born.
Metadata from the original data is carried through to the data files in CML, and enhanced with metadata about the processing. There are plans to expose this metadata as RDF in the near future, available both as RDF/XML resources, and also through a SPARQL endpoint.
CrystalEye offers a browse structure based on the source of the original data. There are no current plans to provide alternative browse trees, in the expectation that others can use the RDF metadata to provide alternative views if they find them useful.
Data Harvesting Using Atom
Because we found it so difficult to reliably harvest data from "web 1.0" websites, we knew we had to make it much easier for others to harvest data from CrystalEye. The extensibility of the Atom syndication protocol made it possible to construct an atom feed that would render meaningfully in "normal" aggregators using raster images, allow data harvesting and also give richer visualisation in chemistry-aware feed readers. The advantages of using a resource oriented protocol like Atom rather than a service oriented protocol like OAI-PMH were considerable; easier deployment (no dynamic components), and we gained scalability features out-of-the box from the web server (Apache HTTPD).
The chemistry community are not yet comfortable with incremental harvesting methods, and so we are planning to offer bulk data downloads through Amazon's S3 service in early 2008.
Data Submission With APP / SWORD
Although the data in CrystalEye is entirely from the published literature, the software could easily be used to provide visualisation and repository features for crystallographic data from any source. The SPECTRa project delivered a submission tool for crystallographic data and work is in progress (Dec 2007) to integrate SPECTRa and CrystalEye using the JISC SWORD extensions to the Atom Publishing Protocol.
Crystallography is not particularly well served by conventional, text based index and search approaches. CrystalEye provides a small number of the most popular approaches to indexed crystallographic data; substructure search and unit cell parameter search.
The evolution of CrystalEye was not painless, we made mistakes that can be learned from, and got some things right that are worth emulating. Some of the key points: -
- Care about naming and identification from day 1
- Unique identifiers are essential, but take care with data cleaning if reusing identifier schemes.
- Separate out logic for naming things, and managing data storage from processing.
- Don't use directory names as a source of metadata. Doing so makes extending the scope of the data difficult later.
- Avoid implementing data processing as a monolithic procedure.
- Focus on the inputs and routines needed for each artifact.
- This is crucial in both minimising the work the system has to do, extending functionality and enabling alternative deployment models, all of which become critically important as the data grows in scale.
- Don't throw away intermediate data - it might not seem expensive to recreate it for 1000 data points, but may well be for 100,000.
- Data normalisation is still important: use references to data rather than duplicating data wherever practical.
- Use the same organisational structure for data processing and data publication.
CrystalEye and the Institutional Repository
Initiatives like CrystalEye provide a distributed alternative to the centralised approach to Institutional Repositories. Clifford Lynch's 2003 definition suggested that an IR was "a set of services" and "most essentially an organizational commitment to the stewardship of ... digital materials". The organic, evolutionary approach exemplified by CrystalEye is compatible with this vision; we suggest that by offering a small number of services (such as web redirection and archiving) the institution can meet this commitment to stewardship without centralising data.