uniprot RDF

Introduction

Background
Requirements
Technology
Implementation
Challenges

Background

UniProt is a comprehensive repository of protein sequence and annotation data. The number of known protein sequences has been increasing ever since the creation of the database. In consequence we have become more dependant on software. Software is used for tasks such as data management, automatic annotation and for assisting curators during manual annotation. Well structured data facilitates the task of writing software. Even though some data could be extracted from unstructured text there is usually no reliable way to modify such data without human intervention. A side effect of introducing more structure is that the complexity of the data increases. Paradoxically this complicates the task of writing software. What we needed therefore was a technology that allowed both us and our users to handle complex data without having to reallocate the entire budget to software development. Further requirements are discussed in the next section.

Requirements

Must be standard

Integration of data from different sources is complicated because every data provider has his own set of conventions. This is true not only for the syntax of a data format, but also for other equally important aspects such as how identifiers and cross references are represented, and how concepts used to describe the data are themselves defined. Standards that have been around for a while are likely to have several implementations that might be suitable for reuse in a custom application. The usefulness of any standard is of course limited by how much ground it covers. Since a standard technology must by definition be well documented, the documentation we need to provide can focus on more important issues than low level technical details. Finally, users who are not familiar with a technology are likely to be more motivated to acquire skills that may be of use in different contexts.

Must be explicit

When formatting data for human readers we can make a lot of convenient assumptions. For example we can assume that most human readers can easily recognize that the string in a certain field is meant to be the name of a database which, when put together with a string in a second field representing an identifier, can be used to retrieve further data. Writing a generic program that is able to accurately recognize such cross references in different data sources is not possible. Instead we have to write specialized programs that has the knowledge about what certain fields mean embedded. This kind of work adds up, especially when working with many different data sources. Therefore we would like to have data in a form that allows us to determine what a cross reference is automatically. Another example: A field may contain the string "Mitochondrion". In order to write a program that allows users to issue queries such as "all information mentioning cellular components", code must be written to first extract any terms users may want to query on and then those terms must be organized into a hierarchy. This step is more often than not too time consuming for users. On the other hand this task would pose much less of a problem if the data already included explicit definitions of terms and their relationships. Another thing that should be explicit is identifiers. This may seem too basic to deserve being mentioned. Unfortunately not all technologies provide a way to explicitly assign identifiers to resources, or if they do, they leave open the question of how to refer to them from outside of the current document. These examples mentioned above show that more explicit data allows us to write more sophisticated software with less effort.

Must fit well

Just about every technology could in principle be applied to any problem. But different types of data call for different solutions. Adoption of inappropriate solutions is a frequent cause of wasted time and energy. The data found in the UniProt database has the following properties:

Must be flexible

Hardly a month passes without our data model being extended in some way. A suitable technology allows us to introduce changes without having to rewrite every application that depends on the data in some way. Currently even relatively minor changes may require weeks of preparation in order to verify and possibly adapt any tools that may be affected. Note that every technology may claim to be easily extensible - relative to some preceding technology. Most technologies that allow extension are relatively limited as to what can be extended and where. Another kind of extensibility concerns the data itself. Since UniProt can be seen as a kind of reference database, it is conceivable that outside people might want to add their own information either to specific database entries, or even at a more fine grained level. This again highlights the importance of a well defined mechanism for cross referencing data. In order to allow queries to be run over both our data and third-party data an automatic mechanism for mixing and merging data from different sources is required. Note that this requirement more or less precludes any approaches that rely on rigidly defined data structures.

Must be affordable

Maintaining a biological database should be, above all, a scientific effort, not a technical challenge. Any technology that requires expensive hardware and proprietary software is therefore suspect. A suitable technology must be able to handle several gigabytes of data on common, off the shelf hardware. Affordable of course must not mean just affordable for us. Not all projects have the resources to invest a lot of effort into understanding and extracting data. Extracting data from files that are both large and have a complex structure is still not a trivial task. Users are often dealing with many different databases. A suitable technology must therefore allow uncomplicated access, comparable in difficulty to extracting data from flat text files by string matching. There is no point in investing a lot of effort into providing more structured data that is too hard to use of most users.

Technology

The Resource Description Framework (RDF) is a standard mechanism for representing arbitrary information in a directed graph structure. Part of the motivation for creating RDF stems from its background in artificial intelligence research: Put data in a form that is more machine understandable. RDF is a core technology for the World Wide Web Consortium's Semantic Web activities. RDF is therefore well suited to work in a distributed and decentralized environment such as the World Wide Web.

The principle behind the RDF data model is very simple. All information is represented as a set of statements consisting of a subject, a predicate and an object. The subject and the predicate are both Resources. A resource is identified with a URI. Note that several URIs may refer to the same resource. The object in a statement may either be a resource or a literal value such as a simple string or number. A mechanism called reification allows statements about statements to be made.

The Ontology Web Language (OWL) allows classes, properties and individuals to be defined. All information in an OWL schema itself is expressed in RDF. Therefore all schema elements are treated as resources and may have URIs, labels for display and textual descriptions attached. Classes are either partial or complete. Complete classes are classes with automatic membership. Any individual that fulfills certain conditions is a member of this class. For example, we may define a complete class Obsolete_Protein, which has as members any Protein with a property isReplacedBy something. Classes, properties and individuals can be stated to be equivalent to another resource. We may therefore declare that what we consider an anti-oncogene is what the Gene Ontology Consortium would refer to as a negative regulation of the cell cycle. Properties may be used by several classes. We can also state that a property is symmetric. A symmetric property similarTo allows us to infer that if protein A is similar to another protein B, the inverse must also be true.

OWL allows us to focus on the logical structure of our data rather than a particular physical representation such as XML files or tables in a relational database. Experienced developers know that objects in object-oriented applications or tables in relational databases do not represent real world logical concepts. We would usually avoid creating more classes or tables than there are elements with different structures and behavior. The Annotation class in UniProt for example has dozens of logical subclasses such as Disease_Annotation or Allergen_Annotation. But only few of these subclasses differ in structure.

Since the basic concepts behind RDF are simple, it is not surprising that similar solutions have appeared independently. Most of these solutions are only ever used in the context of a single project or within a limited domain. One noteworthy exception are topic maps. A detailed comparison can be found here. Topic maps have a background in indexing and cataloguing. This makes them less suitable for use in a distributed environment and for reasoning applications. Also, topic maps lack an official schema language. And finally, there seems to be little political momentum behind this technology when compared to RDF.

Implementation

This section shows which data sets have already been made available in RDF format and what tools are used or have been implemented. Note that the ability to use or create such data set independent tools was one of the main motives for this project.

Data sets

The table below provides an overview of the data sets that are available for download. Note that most of these data sets are still maintained in their original formats. Therefore a simple procedure for converting the data must be run regularly. Each data set is distributed in a single, compressed file. The migration guide shows how specific elements in the original formats are mapped to RDF statements. All classes and properties are defined in a single OWL file that can be downloaded along with the data files.

Name Description Source
UniProt Protein annotation data UniProt Consortium
UniRef Clusters of proteins with similar sequences
Keywords Definition of keywords used for protein annotation
Taxonomy Classification of organisms
Enzymes Classification of enzymes Swiss Institute of Bioinformatics
Gene Ontology Definition of various biological concepts Gene Ontology Consortium
IntAct Protein interaction data, simplified European Bioinformatics Institute

Each data set is distributed in a single file. Duplication of data within and between different files is avoided. The file containing the UniProt data, for example, contains statements connecting proteins to taxa. Further data on the referenced taxa is stored only in the file containing the taxonomy data set. This approach was chosen for the following reasons:

This is how the different files are linked:

Data Sets

Read like this: uniprot.rdf.gz (describes proteins) references components.rdf.gz (describes cellular components) through an "encoded in" property.

We could consider providing an option to retrieve all data including any referenced data for a specific protein, for example, from a web server. Note that OWL only allows us to describe what relationships may exist, but does not restrict how data is physically packaged and distributed.

Conventions

Data is managed at the level of data sets and objects corresponding to individual database entries. Statements are therefore always initially generated from such objects and may be used to recreate objects. To ensure that this can be done in a streaming manner without having to first load all data into memory or a database all statements that describe a particular object are grouped together. But how do we recognize which statements have been grouped together when reading data? There is no direct language or syntax level support for grouping statements. Different data sets can be separated by storing them in separate files. Unfortunately the same approach is not practical for individual objects which may number in the millions. We therefore store all data in such a way that if when reading we expect a series of objects of a type T, any statement with a predicate rdf:type and an rdf:object value of T can be considered to be the first statement for a new object.

x rdf:type T # Start of first top-level object
x rdfs:label 'Foo'
x property y
y rdf:type U
...
z rdf:type T # Start of second top-level object

Statements that can be efficiently inferred with help of information found in the OWL file are generally not written to disk. Bidirectional and symmetric properties, for example, are only stated in one direction. This also helps avoid cycles in the graph which might otherwise complicate parsing.

Statements consist of resources and literals. Each resources is assigned a unique identifier. It is possible for us to distinguish several kinds of resources in our data:

Up to date only few databases provide stable URLs for their data. We therefore assign PURLs ourselves, in the following form:

http://purl.uniprot.org/{database}/{identifier}

Note that the resolving authority is always set to uniprot.org. This does not in any way imply that we are responsible for the resource, but states that we know how to resolve the identifier. Identifiers are resolved using a simple system based on URL templates. Setting up a standard compliant resolving service seems inefficient and overkill, at least for the time being. We do not currently support version numbers. According to the specification version numbers may by appended to the identifier after another colon.

All classes and properties defined by us are located in the http://purl.uniprot.org/core/ namespace. The full name of a resource such as the class Gene is therefore:

http://purl.uniprot.org/core/Gene

Several alternative naming schemes exist - the only requirement is that a resource must be identified by a valid URI. The W3C traditionally uses URIs such as http://purl.uniprot.org/core#Gene. It is common practice (though not required) to set up a web server that will return some relevant documentation when the corresponding URL is requested. The problem with the approach used by the W3C is that the complete documentation for all of the ontology will have to be returned every time, as the HTTP protocol does not require clients to transmit the document fragment identifier (the part after the '#'). Another option are LSIDs, e.g. urn:lsid:uniprot.org:core:Gene, however the resolution mechanism for LSIDs is a bit heavyweight and not widely supported (and while being able to resolve a URI is not a requirement, a lot of applications can benefit from this, including e.g. Semantic Web crawlers and browsers).

XML syntax

The standard RDF XML syntax allows the same set of statements to be serialized in quite a few different ways. The document structure we chose is intended to be simple to understand, even for users who have no knowledge of RDF.

This is the basic outline of a document (reformatted for readability):

<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF 
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
  xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
  xmlns:owl="http://www.w3.org/2002/07/owl#
  xmlns="http://purl.uniprot.org/core/"
>
  ...
</rdf:RDF>

The set of namespace prefixes is always the same. Prefixes for the RDFS and owl namespaces are defined so that we may use standard properties such as rdfs:label, rdfs:comment and owl:sameAs. The default namespace is the namespace in which all classes and properties are defined.

The body consists of a series of non-nested rdf:Description elements:

<rdf:Description rdf:about="http://purl.uniprot.org/uniprot/P30089">
  <rdf:type rdf:resource="http://purl.uniprot.org/core/Protein"/>
  <created>1978-04-24</created> <!-- Simple property with literal value. -->
  <annotation rdf:resource="#_1"/> <!-- Property pointing to another resource. -->
  <annotation rdf:ID="_2" rdf:resource="#_3"/> <!-- Property directly attached to a statement. -->
  ...
</rdf:Description>

<rdf:Description rdf:about="#_1">
  <rdf:type rdf:resource="http://purl.uniprot.org/core/Annotation"/>
  <rdfs:comment>...</rdfs:comment>
  ...
</rdf:Description>

<rdf:Description rdf:about="#_2">
  <note>...</note>
</rdf:Description>

In place of rdf:Description, owl:Thing could have been used. However this does not add any information, and may lead to confusion when dealing with hierarchical data where resources may be both individuals and classes.

As a note on the side there are currently several proposals [TriX, RXR] for improving the official syntax, which is often considered verbose and ugly. However the solutions suggested by most of these proposals would nearly double files sizes. This may not be an issue for small data sets, but seems unrealistic for the data we have. On the other hand a mechanism for grouping together a set of statements in a file would be welcome.

Finally, a comparison of distribution file sizes for UniProt data in various formats (in GB):

Flat text XML RDF/XML
Uncompressed 2.7 6.0 7.7
Compressed 0.5 0.6 0.7

Parsers

RDF data that is stored in RDF/XML documents can be processed with the help of common SAX or DOM parsers. Alternatively, specialized RDF parsers can be used. RDF parsers either support event- or graph-based parsing. The event-based parsing is comparable to SAX, except that instead of low level events complete statements are passed to the callback handler. The graph-based parsing is comparable to DOM, except that data is loaded into a generic graph rather than a tree structure. Open source RDF parsers are available for most programming languages. Unfortunately none of them met our performance requirements. We were able to improve performance by an order of magnitude by using a custom parser that is limited to the syntax subset described in the previous section. Note that our restricted syntax can be processed by any standard compliant parser, and we in turn may use a standard compliant parser as a fallback solution for processing RDF data from external sources.

In the requirements section we stressed the importance of providing low-tech access to our data. For this purpose a simple Perl module was created. The module has no external dependencies. The graph can be navigated through dynamically generated methods that correspond to the predicates attached to a resource. Depending on the method invocation context single values or arrays are returned. Currently this module only allows data to be read.

use Expasy::RDF;

my $parser = Expasy::RDF::Parser->new('uniprot.rdf');

while (my $entry = $parser->next)
{
  print 'Fields: ', join(', ', keys %$entry), "\n";
  my $id = $entry->id;
  my $mass = $entry->sequence->mass;
  
  foreach my $annotation ($entry->annotation)
  {
    print $annotation->type, ': ', $annotation, "\n";
    print 'Authors: ', join(', ', $_->author), "\n"
      foreach ($annotation->citation);
  }
}

$parser->close;

Shorthand syntax

Since the RDF XML syntax is too verbose for viewing and editing data by hand, a shorthand syntax is required. While several proposals exist already, we ended up creating our own flavor as we needed a simple way to express statements about statements, which is something none of the existing solutions provide.

@prefix <http://purl.uniprot.org/core/>
@prefix rdf <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
@prefix rdfs <http://www.w3.org/2000/01/rdf-schema#>
@prefix owl <http://www.w3.org/2002/07/owl#>
@prefix uniprot <http://purl.uniprot.org/uniprot/>
@prefix doi <http://dx.doi.org/>

uniprot:P30089 
  rdf:type Protein
  curated true
  mnemonic 'UPA3_HUMAN'
  name 'Corticotropin-lipotropin precursor'
  citation :1
  rdfs:seeAlso <http://www.expasy.org/cd40lbase/> [:2]

:1
  rdf:type Journal_Citation
  title 'Structure of cDNA clones coding for...'
  author 'Matsubara H.', 'Yamanaka T.'
  name 'Biochem. J.'
  volume '260'
  pages '509-516'
  date 1989
  owl:sameAs doi:10.1002/jcc.10386

:2
    rdf:type Related_Statement
    rdfs:comment 'European CD40L defect database'

Diff and patch

Computing differences between small sets of statements is straightforward, if no blank nodes occur in the data. This, incidently, is a side effect of the syntax restrictions outlined in a previous section. After sorting the statements by subject the standard diff algorithm can be applied. The only difference is that instead of lines we are dealing with statements. If resource identifiers are stable we can go one step further an allow patches to be applied in almost any order. The same approach is supported by some version control systems [arch].

Diff diff = new Diff();
Patch patch = diff.createPatch(statements, modifiedStatements);
...
assertEquals(modifiedStatements, patch.apply(statements));

Visualization

A popular way of visualizing data are graph drawing tools [IsaViz]. Unfortunately the output of such tools is rarely useful when the input is complex. Another approach is to generate simple web pages with lists of properties for each resource [Joseki]. Ideally, schema information such as rdfs:labels and rdfs:comments is be integrated into such views. If the syntax of the input follows a strict set of rules XSL style sheets can be used to generate web pages, see screenshot below.

Ontology

The ontology describing all classes and properties was created and is maintained with Protege. Protege does an excellent job of providing a simple user interface for viewing and editing OWL files. One minor issue with Protege is that it does not allow arbitrary URNs as identifiers for owl:Things but instead required the identifiers to be splittable into a namespace part and an NCName.

Protege Screenshot

Wherever possible we use standard RDFS properties rather than introducing our own terminology. In particular:

Unfortunately OWL does not support restrictions on these properties. Another issue is that OWL does not allow us to describe reified statements. In consequence it is not possible to define a complete class Statement_With_Evidence that has as its members all statements to which at least one evidence property is attached.

Triple store

In order to be able to efficiently store and retrieve individual sets of statements and run queries some kind of database is required. In particular such a system must be able to:

Several open source solution were evaluated but did not meet our performance requirements. To be fair, few projects claim to support such large data sets and most focus on providing advanced features such as inference capabilities instead. Kowari was the most promising solution, but does not at the time fulfill the last two requirements. We therefore implemented a simple solution based on a relation database.

Benefits over a conventional database schema:

Drawbacks:

RDF seems well suited not only for integrated but also for federated approaches to storing data. However it is difficult to achieve reasonable performance when working with large distributed data sets.

Rule engine

In order to detect conflicts between the OWL schema and instance data or to infer statements that are not stated explicitly a rule engine was set up. This approach also allows rules that can not be expressed in OWL (e.g. the value of a property begin must not be greater than the value of another property end of the same resource). A prototype was implemented with Jess. Data from the schema is translated into facts, as is instance data. The rule engine then uses a few general rules to detect certain conflicts.

Challenges

Similar concepts in different databases may be mapped to each other or unified. But users must still have detailed knowledge of precisely what classes and properties are available. Given that even a single database may make use of dozens of concepts this is not a trivial task. Some kind of graphical tool that allows users to browse the schema and construct queries might help. An alternative, but not complete replacement, is a smart full text search function that is capable of suggesting specific restrictions based on the query text. If the term "human" occurs in a query for proteins, for example, the user may be asked if he wants to restrict the query with something like ?protein organism taxonomy:9606.