After being frustrated with poor performance of JSON data stores it seemed like a different approach might be an improvement. Triple stores seemed interesting. After some exploring it turns out triple store and graph databases were pretty poor in the performance area as well. Import and startup times were measured in minutes or even hours for just 10M or 100M triples.
As a start this page is intended as a place to offer people a chance to comment and get involved as the project progresses. Feel free to send comments, feedback, or initiate general discussion by sending email to email@example.com.
A bit of encouragement when looking at graph databases. Neo4J seems to be one of the top and it uses 100K triples for samples and testing. I've been using 100M. Opo is pretty memory and CPU friendly by that comparison.
The N-Quads/N-Triples parser is working! It all fell into place with the rest of the parts already having been developed. Opo was ready for some benchmarks and a comparison to the JSON import.
N-Triples is a rather verbose format. Mostly due to the need to specify the full IRI for each subject, predicate, and object. A N-Triples file for 210 million triples is 17GB in size while a JSON file with the same data is 3.4GB. Both generate the same number of quads but of course the N-Triples generate RDF conformant quads while the JSON generates quads using literals for subject, predicate, and object, Still stored as a quad but not RDF.
It takes a bit longer to process the N-Triples file which is about 5 times larger than the JSON. Importing the 210M N-Triples took 105 seconds while importing the JSON with the same number of quads took 39 seconds. Roughly 2.7 times slower for the N-Triples. Still better than what would be expected if the file reading was the bottleneck. Not bad when compared to other RDF stores.
The memory use for both the N-Triples and JSON import was the same. Most likely due to most of the memory going to storing the quads. Strings or IRIs were a fraction of the quads use of memory. Each import generates the same number of blanks and literals. There was some difference in memory use due to IRI with N-Triples but that was small when compared to the quads footprint. The negligable difference in size between IRI and string storage is due to the efficient manner that opo stores IRIs.
All in all N-Triple or N-Quad imports perform reasonable well so the next step is writing a Ruby gem for the Ruby RDF gems that allows storing in opo similar to the way RDF records are stored into MongoDB.
This past week was spent on updating the HTTP REST APIs to support quads. While in that part of the code the APIs were cleaned up and documented with HTML pages. Those pages are build into the opo daemon.
The built in pages will include data access pages and forms to change settings. For now just the documentation is being added as features are implemented. That means the pages are fairly sparse but do provide an idea of what is to come.
The build in pages are included here as well and a 'Build In Pages' link has been added in the upper right of this page.
After discovering RDF.rb and its parent, Ruby RDF and sending a email to Gregg Kellogg it was clear some rework was needed on opo. Opo had to support not just named trees but a quad store instead of a triple store if it was to support RDF and SPARQL. Most of that realization was due to a bad assumption on my part. I had assumed a "graph" was the dictionary definition of a directed graph. Unfortunately in the RDf world a graph is a collection of triples that may or may not have any relation to each other outside being in the same graph. So an RDF graph is really just a collection of triples, not traditional graph.
With the changes to opo the store is now a quad store but it still supports directed graph which are now referred to as trees. JSON is still parsed into triple or now quads so opo is still a fast JSON store as well as an RDF store.
Thanks to Gregg's comments the current direction of opo development is to provide a Ruby gem similar to the rdf-mongo gem that will allow opo to be used as a quad store while relying on the Ruby-RDF collection to deal with the non-store features. Backend opo features will be added incrementally after that.
After generating the N-Triples for the same 7 million JSONs the difference in file size is huge. The JSON was generated from a CSV file which is 658M. The JSON file with minimal spacing is 3.4G. Quite an expansion but expected as every entry now includes the header information that was only entered once in the CSV. The N-Triples file is even larger at 20G. In all three case the number of triples is around 210 million. This is the large data set used for testing opo. Larger sets will be tested later. The expected maximum for the desktop machine I'm using is about 1 billion triples.
Jumping into RDF and the most basic representation of N-Triples meant identifying the mapping between a generic triple store and RDF which is considerably more restrictive. Mapping was pretty straight forward with the exception of the strictness of RDF matching. "1"^^<http://www.w3.org/2001/XMLSchema#integer> is different than "01"^^<http://www.w3.org/2001/XMLSchema#integer> which is not the same as "1"^^<https://www.w3.org/TR/xmlschema11-2/#integer>. The opo core stores data items according to the fundamental type so all three examples resolve to the same value, 1. The design goal was that the user intent was that they should be the same when using SPARQL or other queries to extract data. Fortunately there is an a way to revert to the more strict mode by providing an option to not convert strings to native types.
Next was coming up with a better format for dumping and importing. JSON works well as long as all triples are part of one or more graphs and there are no named graphs but to support named graphs and triples not part of a graph a new format was needed. For lack of a better name the dump format is referred to as the Opo Graph or 'og' format. A comparison of N-Triple, JSON, and the og format for representing data are shown here.
The og format is intentionally different than N-Triple and JSON but it does share some similarities. The format had to be different enough that no one would confuse one with the other. There are several options possible for including the internal identifier for triples and for blanks nodes so that the REST API can be used to map to the output.
Minimal details needed to be able to import and recreate the data store. Indented only for clarity in this case.
Verbose mode includes the identifiers for blank nodes and in the comment the identifier of the triple associated with the line.
JSON does not allow for named graphs nor separate triples. Note this is not JSON-LD but just plain old JSON.
N-Triple format is fairly verbose and like plain old JSON it does not support named graphs directly although blank nodes can have a name providing similar functionality.
As a JSON store the project is useable with some limitations. Create, Read, Update, and Delete are supported. Seach has not yet been implemented and journalling is also on the queue. There is enough there to be able to exercise the API though.
With the introductions of multiple query evaluation threads the performance is a bit better with 150,000 GETs per second and a latency of 80 microseconds. There is a gain with using additional threads but then there is overhead of making sure modifications are atomic and search results are consistent for a specific snapshot.
Next up is support for RDF followed closely by TURTLE. An efficient data structure for URIs or IRIs is needed as RDF is all about using URIs for everything. Okay, not everything, there are literals and blanks but everything else is a URI/IRI.
Another challenge is how to map JSON to RDF nicely. Maybe providing a default namespace is enough to represent JSON as valid RDF triples. If you want to share your opinion or help set the direction for opo development send me an email at firstname.lastname@example.org.
At first glance JSON-LD didn't seem very useful as a way to describe the triples in a store. When compared to TURTLE, TURTLE seemed like the better choice. JSON-LD is gaining a lot of traction in a large part due to Google promoting it. JSON-LD will be on the list for supported imports for the opo triple store.
Any old JSON is what I've been using for imports and graph representation so far. That works well with a few simple rules on how to convert JSON to and from triple graphs. Even JSON arrays with order preserved are working with the current code.
The REST API is coming along nicely. POST, PUT, and GET are all working for JSON graphs. Named graphs were added to support the PUT operation. Next up are DELETE. SPARQL, TURTLE, and JSON-LD will come later.
The approach take was to build a very generic triple store as the base. That decision is proving to be the correct choice as analysis indicates the triple store will be able to support a variety of representation from JSON to RDF, TURTLE, and JSON-LD.
Storing JSON as triples maps blank nodes to vertices in a JSON document and predicates as the names in JSON Object members. Leaf values are objects of the correct type. Arrays are not unlike object except that order must be preserved and the members have null predicates.
Opo maintains additional information to be able to recreate the original JSON in the same order. That information is not visible as part of the RDF style representation.
I worked on getting the core engine wrapped with an HTTP REST API. By reusing the wush (websockets and HTTP server) library from Piper Push Cache it wasn't too difficult. That also give the same scalability of Piper Push Cache.
The REST API supports JSON and triples. Only GETs of single entities and lists with paging so far but that was enough to run some benchmarks. Results are round to 2 significant digits. The HTTP requests were a simple GET for a 30 element JSON document given the identifier of the document.
I'm pretty pleased with the results so far given the effort is still in the proof of concept phase.
|Connections||Latency (msecs)||Throughput (GETS/sec)|
The first set of benchmarks look great! The current state is a minimal set of functionality to test the engine. Importing over 7 million JSON records or more than 200 million triples and persisted to disk took only 27 seconds. The records are fully indexed so fast queries should be possible. Restarting too 6 seconds and dumping to a JSON file took 27 seconds to write a 3.2GB file.
There is no restriction on the JSON file structure. Any JSON is fine. The JSON does not need to be JSON-LD or one of the other RDF/JSON alternatives. Just simple JSON. I didn't see the need to over engineer as that can always be added later if necessary.