This document describes a simple triple store. After setting up and loading data into the triple store you can run queries and retrieve data through a simple web interface.
Recommended parameters for MySQL:
[mysqld] key_buffer_size=64M myisam_sort_buffer_size=512MB myisam_max_sort_file_size=100000000M myisam_max_extra_sort_file_size=100000000M bulk_insert_buffer_size=64MB
The web application can either be started directly from the project directory,
or be deployed to an existing application server. If you choose to deploy the application
on an existing server, remember to edit and copy the application.properties
configuration file to a directory that is in
the classpath of the application server.
Below is a description of the database schema used for storing all data. This is mainly of interest to anyone interested in porting the application to another database management system or extending the query engine.
A list of all namespaces is kept in a small table. This table will usually be loaded into memory by applications.
| Name | Type | Description |
|---|---|---|
id |
SMALLINT PRIMARY KEY |
|
name |
VARCHAR(64) |
Full namespace URI |
All statement are stored in a single (huge) table.
| Name | Type | Description |
|---|---|---|
id |
BIGINT AUTO_INCREMENT PRIMARY KEY |
Allows statements to be retrieved in insertion order. |
model_ns |
SMALLINT NOT NULL |
Model, groups several statements. Allows efficient retrieval of related statements. |
model |
VARCHAR(16) |
|
statement |
VARCHAR(32) |
Statement identifier, required for reified statements. |
subject_ns |
SMALLINT |
Subject of the statement. |
subject |
VARCHAR(32) NOT NULL |
|
predicate_ns |
SMALLINT NOT NULL |
Predicate of the statement. |
predicate |
VARCHAR(32) NOT NULL |
|
object_ns |
SMALLINT |
Object of the statement, if it is a resource. |
object |
VARCHAR(32) |
|
object_string |
LONGTEXT |
Object of the statement, if it is a literal. |
object_number |
DOUBLE |
Additional representations of the literal value, for queries. |
object_boolean |
BOOLEAN |
|
inferred |
BOOLEAN NOT NULL DEFAULT 0 |
Set to true for inferred statements. |
The procedure for loading data is controlled through an Ant build file. The steps are:
$project/lib to the classpath.$project/etc/load/.build.properties file.*.rdf.gz files you want to load to ${source.path}.${source.inferred} to no.${source.ontology} to
a comma separated list of OWL files.ant clean if there is any previous data.ant load to load the new data.The last step actually involves several steps. First, two files suitable for
bulk loading into the database are generated (namespaces and statements).
These files can be reused if the procedure fails at a later stage.
Next, the database tables are created, and the data is
loaded. Finally, indexes are created. The files containing the SQL commands
can be found in the $project/etc/load/sql/ directory.
Note that the database will not be available for the duration of the update. If this is not acceptable and enough disk space is available, data could be loaded into a new database that is swapped with the original database when done.
Both loading time and disk space can be reduced considerably by creating an
index only on the model columns. With only this index
most queries are no longer practical. Nevertheless it is still possible to
retrieve all statements belonging to a specific model within milliseconds.
The procedure shown here does not support incremental updates. Incremental updates could in principal be done, but would only be practical for updates of no more than a few thousand statements at a time. For any larger updates indexes would have to be dropped and recreated. Since index creation is by far the slowest step, a full reload might as well be performed. This is not only simpler but also avoids problems such as table fragmentation.
| Time (h) | Disk Space (GB) | |
|---|---|---|
| MySQL 4.1.4 | 8 | 17 |
Query performance may be improved a bit by compressing
the statements table. Note that when a table is compressed, existing indexes are
deleted and must be recreated afterwards. Therefore comment out the ENABLE KEYS
statement in the load.sql file - no point in creating all indexes twice.
# Compress table... myisampack -v -b statements.MYI # Recreate indexes... myisamchk -rq -O sort_buffer=512M -O key_buffer=128M --sort-index --analyze statements.MYI
Support for basic inference is important for queries. Without inference, a query for all resources of a certain type, for example, may not return resources that are instances of a subclass, because superclasses are usually not stated explicitly. A query for all proteins occurring in bacteria may not return any results, because individual proteins only reference specific bacteria, not the taxonomic superkingdom bacteria.
Such additional statements can be generated during the load procedure, provided they can be trivially inferred from the statements that are being processed. This static approach ensures that query performance remains reasonable, but complicates incremental updates. Also, statements must be topologically sorted in advance, so that parents always appear before children. In consequence the order in which files are processed is important as well. Processing smaller files first happens to work in our case, but in other cases it may be necessary to specify an explicit order.
The main impact on import performance is the larger data volume (approximately one third, in our case). Because a map of all superclass relations is kept in memory during the import procedure, more memory is used.
Inferred from instance data:
x foo y y rdfs:subClass z ----------------- x foo z
Inferred from OWL data:
x rdf:type y y rdfs:subClass z ----------------- x rdf:type z
x p y p rdfs:subPropertyOf q ---------------------- x q y
x p y p owl:inverseOf q ----------------- y q x
x p y p rdf:type owl:SymmetricProperty -------------------------------- y p x
Information that is not used, for the time being:
owl:equivalentClass and owl:equivalentPropertyowl:sameAs.