expasy4j WebNG

Triple Store

This document describes a simple triple store. After setting up and loading data into the triple store you can run queries and retrieve data through a simple web interface.

Contents

Requirements
Configuration
Database Schema
Data Loading
Inference

Requirements

Configuration

Recommended parameters for MySQL:

[mysqld]
key_buffer_size=64M
myisam_sort_buffer_size=512MB
myisam_max_sort_file_size=100000000M
myisam_max_extra_sort_file_size=100000000M
bulk_insert_buffer_size=64MB

The web application can either be started directly from the project directory, or be deployed to an existing application server. If you choose to deploy the application on an existing server, remember to edit and copy the application.properties configuration file to a directory that is in the classpath of the application server.

Database Schema

Below is a description of the database schema used for storing all data. This is mainly of interest to anyone interested in porting the application to another database management system or extending the query engine.

A list of all namespaces is kept in a small table. This table will usually be loaded into memory by applications.

Name Type Description
id SMALLINT PRIMARY KEY
name VARCHAR(64) Full namespace URI

All statement are stored in a single (huge) table.

Name Type Description
id BIGINT AUTO_INCREMENT PRIMARY KEY Allows statements to be retrieved in insertion order.
model_ns SMALLINT NOT NULL Model, groups several statements. Allows efficient retrieval of related statements.
model VARCHAR(16)
statement VARCHAR(32) Statement identifier, required for reified statements.
subject_ns SMALLINT Subject of the statement.
subject VARCHAR(32) NOT NULL
predicate_ns SMALLINT NOT NULL Predicate of the statement.
predicate VARCHAR(32) NOT NULL
object_ns SMALLINT Object of the statement, if it is a resource.
object VARCHAR(32)
object_string LONGTEXT Object of the statement, if it is a literal.
object_number DOUBLE Additional representations of the literal value, for queries.
object_boolean BOOLEAN
inferred BOOLEAN NOT NULL DEFAULT 0 Set to true for inferred statements.

Data Import

The procedure for loading data is controlled through an Ant build file. The steps are:

  1. Download and unzip the project file.
  2. Add all libraries in $project/lib to the classpath.
  3. Change to $project/etc/load/.
  4. Make sure a database server is running and an account is set up.
  5. Set connection parameters and paths in the build.properties file.
  6. Copy any *.rdf.gz files you want to load to ${source.path}.
  7. Optionally disable generation of inferred statements by setting ${source.inferred} to no.
  8. If inferrence is not disabled you can set ${source.ontology} to a comma separated list of OWL files.
  9. Run ant clean if there is any previous data.
  10. Run ant load to load the new data.

The last step actually involves several steps. First, two files suitable for bulk loading into the database are generated (namespaces and statements). These files can be reused if the procedure fails at a later stage. Next, the database tables are created, and the data is loaded. Finally, indexes are created. The files containing the SQL commands can be found in the $project/etc/load/sql/ directory.

Note that the database will not be available for the duration of the update. If this is not acceptable and enough disk space is available, data could be loaded into a new database that is swapped with the original database when done.

Both loading time and disk space can be reduced considerably by creating an index only on the model columns. With only this index most queries are no longer practical. Nevertheless it is still possible to retrieve all statements belonging to a specific model within milliseconds.

The procedure shown here does not support incremental updates. Incremental updates could in principal be done, but would only be practical for updates of no more than a few thousand statements at a time. For any larger updates indexes would have to be dropped and recreated. Since index creation is by far the slowest step, a full reload might as well be performed. This is not only simpler but also avoids problems such as table fragmentation.

Time (h) Disk Space (GB)
MySQL 4.1.4 8 17

Query performance may be improved a bit by compressing the statements table. Note that when a table is compressed, existing indexes are deleted and must be recreated afterwards. Therefore comment out the ENABLE KEYS statement in the load.sql file - no point in creating all indexes twice.

# Compress table...
myisampack -v -b statements.MYI

# Recreate indexes...
myisamchk -rq -O sort_buffer=512M -O key_buffer=128M --sort-index --analyze statements.MYI

Inference

Support for basic inference is important for queries. Without inference, a query for all resources of a certain type, for example, may not return resources that are instances of a subclass, because superclasses are usually not stated explicitly. A query for all proteins occurring in bacteria may not return any results, because individual proteins only reference specific bacteria, not the taxonomic superkingdom bacteria.

Such additional statements can be generated during the load procedure, provided they can be trivially inferred from the statements that are being processed. This static approach ensures that query performance remains reasonable, but complicates incremental updates. Also, statements must be topologically sorted in advance, so that parents always appear before children. In consequence the order in which files are processed is important as well. Processing smaller files first happens to work in our case, but in other cases it may be necessary to specify an explicit order.

The main impact on import performance is the larger data volume (approximately one third, in our case). Because a map of all superclass relations is kept in memory during the import procedure, more memory is used.

Inferred from instance data:

x foo y
y rdfs:subClass z
-----------------
x foo z

Inferred from OWL data:

x rdf:type y
y rdfs:subClass z
-----------------
x rdf:type z
x p y
p rdfs:subPropertyOf q
----------------------
x q y
x p y
p owl:inverseOf q
-----------------
y q x
x p y
p rdf:type owl:SymmetricProperty
--------------------------------
y p x

Information that is not used, for the time being: