It all began when exploring triple stores and graph databases and noting that those available were rather slow. As experimentation continued it seemed that JSON store were also slow by comparison. The blog entries here provide a bit of the history and the latests musings over the development of OpO.
As a start this page is intended not only as a blog but a place to encourage comments and get involved as the project progresses. Feel free to send comments, feedback, or initiate general discussion by sending email to email@example.com.
The Agoo release with WebSocket and SSE support is ready for release. I put up a PR for an addition to the Rack spec. It took a little while to get the spec addition finalized but it is ready now. It was nie working with Bo to get both server gems using the same API.
Next up is blending in the Agoo handling into OpO but with support for more connections.
While implementing support for Rack it became clear that using a common request handler for Agoo and OpO has some advantages. That effort is underway and includes support for Rack, WebSockets, and SSE. Of course OpO will include a few extra options.
I've been collaborating with Bo, the author of Iodine which is another Ruby gem web server. We are refining a Rack push spec that we can both implement.
After releasing Agoo it only made sense to open source the 'hoe' application used for benchmarking. After a rename 'hose' was released as perfer.
Next up has not changed. It is adding Ruby rack support to WAB and OpO. Then on to some tweaks to OpO for faster heavy load performance.
Agoo is Japanese for a type of flying fish is is also the name for a new Ruby gem. Agoo, the gem, is a derivative of the OpO code base with a few changes that will be worked into OpO as well. Agoo is a Ruby rack compatible web server that like OpO is fast. The benchmarks will be updated soon with value on the desktop but for now they are from a laptop. The first release should be before the end of this month.
Next up is adding Ruby rack support to WAB and OpO. Then on to some tweaks to OpO for faster heavy load performance.
The opo-c client has been made public on github as OpO-C. With my desktop rig the benchmarks come in at over 400K queries/second and latency of around 100 microseconds. Satisfying results. I would love to try benchmarking on the new MacPro with 20 cores running Ubuntu but a bit expensive for just a benchmark.
Ok, I don't like the web site. That will have to change over the next week or two.
Nothing much in the last 5 days except a redesign of the home page.
More coming next year.
Early results show the client to be faster than the HTTP API. Thats before one more significant optimization. With the C client API OpO may hit over 1 million fetches per second with a fast enough network connection. The bottleneck appears to be the network at this point. More benchmarking will tell. For a performance geek like myself that is exciting.
Added another lesson to the WABuR tutorial as well to make sure it keeps going.
Next up is a client API for OpO. In preparation for that a new binary wire format is underway. It looks promising. Simple to parse. Prealloc possible when receiving, and built directly with not intermediate structures. Thats being done in a new opo-c repo. The repo is private for now but will become public and OSS once the server side is completed. Looking to exceed one million requests per second.
WABuR is still being worked on. Flip flopping back and forth. The tutorial is progressing and additional tests are being added.
Finally finished journalling. Works as expected with not much overhead. That was nice.
The web site was updated. I think it is more interesting. The benchmarks turned out very well.
Lots of work on WABuR in the last couple of weeks. It's looking good. OpO-Rub is being tested while implementing WABuR and a few new features and bug fixes are going out today.
Next week is preparation for a Tokyo presentation. Should be interesting.
Not finished yet but ran the first benchmarks on a partially completed opo-rub and the results were 6 times faster than opo using stdio. That is on a laptop so if it scales to the desktop that is about 100K requests a second. Latency was a quarter of the stdio approach as well.
I was not able to get multiple Ruby threads processing requests. For some reason the primary ruby thread does not play well with created Ruby threads. SOmething to dig into in the future as it might give a little more performance. At first it seemed like hooking up the embedded Ruby was a simple task. After three design changes that thought went out the window. In the end it is one pthread to start and run Ruby. All requests go on a queue and the Ruby loop pulls them off and processes them.
Whats next is completing the opo-rub app, cleaning up and then onto more WABuR stuff. opo-rub should be released next week sometime. In case it isn't clear, opo-rub is OpO with embedded Ruby.
Sine the last blog entry some close issues were also addressed. A bad assumption on my part assumed all browsers would notice a closed socket. Not a chance. An explicit 'Connection: Close' has to be added to the response header. If keep-alive and no acitivity for a while an empty response must be sent sith a close for the browser to realize the connection is closed.
Not much on OpO was completed this last week and a half but the first working version of WABuR is ready for play. To support WABuR the first developer release of OpO is being made. Nice to see the parts are come together and work.
A neat new feature was added to support WABuR. Next up will be some work on WABuR to take advantage of the new feature. OpO can now be configured to handle HTTP request by spawning an app and exchanging data with that app using pipes. Some documentation can be found at handler docs.
The app can be in any language. For WABuR it would be Ruby but with the use of pipes there is no language restriction. The app listens for JSON messages on STDIN and writes JSON responses on STDOUT. This make app testing easy and app writing simple as well.
The spawning feature is the first step to integrating code used as the controller in a model/view/controller set up. As OpO development continues there will be a OpO-Rub with an embedded Ruby interpreter. Depending on demand OpO-Py might be implemented as well to support Python.
WHERE and FILTER expressions were completed with the initial set of conditional operations as described in the Documentation.
Sorting and paging is on the to-do list after getting HTTP handlers set up to support WABuR.
Some work was also done on WABuR to define and implement the Data class. The next step is to implement and test a STDIO shell that will work with OpO. That is going to involve defining the API between the two as well as some design changes to simplify the WABuR classes.
HTTP TQL using JSON implemented and tested. It is still missing a few features such as SELECT sorting, unique, and paging as well as all operations other than EQ but end to end a query of INSERT, UPDATE, DELETE, or SELECT are working.
Moving along with the TQL implementation, TQL SELECT has been tested although some of the options are set aside for later. The plan is to gets the other clauses, INSERT, UPDATE, and DELETE tested then spending some time on WABuR before finishing finer points on TQL implementation.
The TQL implemenation is not fully optimized but should be reasoanably fast using OjC heavily. Later some of the conversion steps can be bypassed for some performance improvement.
Web Application Builder using Ruby has been started on GitHub. OpO will be one of the high performance options as it progresses. Looking forward to having other join in.
TQL implemenation is progressing with JSON to TQL and back implemented and tested. Basic WHERE clause expressions with AND, OR, and EQ have been tested with all types. Much more to go but steady progress.
Tree Query Language or TQL has been defined. The definition is here. TQL has a friendly and a JSON format. The plan is to use that as part of the RoR alternative.
Next up is implementing TQL over HTTP in both formats. Maybe JSON first. Of course the initial implementation will not included all the comparison operators.
This month was spent on two efforts. The first was defining what a fast Ruby on Rails alternative would like. The second was implementing query functionality. Both are taking shape.
I'm excited about the fast RoR alternative. After dealing with making Rails fast with Oj it seemed like an alternative that is orders of magnitude fast might be welcomed. That effort has been started and OpO will play a part in it while still calling Ruby for business logic. I should have that opened up as open source next month. Hooking up OpO will take a bit longer.
The query functionality is working with AND, OR, and EQ but only from coded structures. The next step is a parser followed by support over HTTP.
With the direction more toward supporting JSON a refactor is due to move quads back to triples and support namespaces with seprate OpO instances. This also leaves the option open for a distributed solution down the road.
I spent the last month reorgnaizing Oj to improve json gem and Rails compatibility. Finally back to OpO. During that time I had time to reflect on what direction OpO should go.
Looking at the user community and what might be most interesting, the focus for OpO is to follow the JSON track. While updating Oj I realized how difficult it is to get performance out of Rails and how constraining it is. There is a very large number of Ruby users that need a Model View Controller enabler. Maybe OpO can provide an alternative to Rails.
With the MVC pattern in mind the new plans for OpO are to focus on first a JSON query language, then journalling, followed by an embedded Ruby interpreter with APIs for JSON data and HTTP interactions.
I didn't feel like dealing with rdf-opo this week so did some refactoring of the code, added optimized UUID support, and started on the time type. I wonder if it is strange to write code to relax. Oh well, it is fun exploring RDF and triple/quad stores by implementing one.
Just in case anyone is interested a plans page was added to track implementation and see what is on the horizon. Input from those interested will be used to add features and bump up the priority of feature implemenation.
The RDF-opo repo has been created. Still not much there but it has been created. It is mostly a shameless copy of the RDF-mongo repo but since I hope to hand it over to RDF ruby thats seems reasonable.
I couldn't let the C side of opo go unchanged so unrestricted strings were implemented. While strings were restricted to 31 characters before the change they can now be any length as long as it fits into memory. The approach was to use a chain of blocks to store the long strings. No more string length strictions. Turns out it is a pretty clean approach that slides in easily with the rest of the system.
The IRI implementation needs to be updated with the same approach but first some work on the RDF-opo Ruby repo.
The N-Quads/N-Triples parser is working! It all fell into place with the rest of the parts already having been developed. OpO was ready for some benchmarks and a comparison to the JSON import.
N-Triples is a rather verbose format. Mostly due to the need to specify the full IRI for each subject, predicate, and object. A N-Triples file for 210 million triples is 17GB in size while a JSON file with the same data is 3.4GB. Both generate the same number of quads but of course the N-Triples generate RDF conformant quads while the JSON generates quads using literals for subject, predicate, and object, Still stored as a quad but not RDF.
It takes a bit longer to process the N-Triples file which is about 5 times larger than the JSON. Importing the 210M N-Triples took 105 seconds while importing the JSON with the same number of quads took 39 seconds. Roughly 2.7 times slower for the N-Triples. Still better than what would be expected if the file reading was the bottleneck. Not bad when compared to other RDF stores.
The memory use for both the N-Triples and JSON import was the same. Most likely due to most of the memory going to storing the quads. Strings or IRIs were a fraction of the quads use of memory. Each import generates the same number of blanks and literals. There was some difference in memory use due to IRI with N-Triples but that was small when compared to the quads footprint. The negligable difference in size between IRI and string storage is due to the efficient manner that OpO stores IRIs.
All in all N-Triple or N-Quad imports perform reasonable well so the next step is writing a Ruby gem for the Ruby RDF gems that allows storing in OpO similar to the way RDF records are stored into MongoDB.
This past week was spent on updating the HTTP REST APIs to support quads. While in that part of the code the APIs were cleaned up and documented with HTML pages. Those pages are build into the OpO daemon.
The built in pages will include data access pages and forms to change settings. For now just the documentation is being added as features are implemented. That means the pages are fairly sparse but do provide an idea of what is to come.
The build in pages are included here as well and a 'Build In Pages' link has been added in the upper right of this page.
After discovering RDF.rb and its parent, Ruby RDF and sending a email to Gregg Kellogg it was clear some rework was needed on OpO. OpO had to support not just named trees but a quad store instead of a triple store if it was to support RDF and SPARQL. Most of that realization was due to a bad assumption on my part. I had assumed a "graph" was the dictionary definition of a directed graph. Unfortunately in the RDf world a graph is a collection of triples that may or may not have any relation to each other outside being in the same graph. So an RDF graph is really just a collection of triples, not traditional graph.
With the changes to OpO the store is now a quad store but it still supports directed graph which are now referred to as trees. JSON is still parsed into triple or now quads so OpO is still a fast JSON store as well as an RDF store.
Thanks to Gregg's comments the current direction of OpO development is to provide a Ruby gem similar to the rdf-mongo gem that will allow OpO to be used as a quad store while relying on the Ruby-RDF collection to deal with the non-store features. Backend OpO features will be added incrementally after that.
After generating the N-Triples for the same 7 million JSONs the difference in file size is huge. The JSON was generated from a CSV file which is 658M. The JSON file with minimal spacing is 3.4G. Quite an expansion but expected as every entry now includes the header information that was only entered once in the CSV. The N-Triples file is even larger at 20G. In all three case the number of triples is around 210 million. This is the large data set used for testing OpO. Larger sets will be tested later. The expected maximum for the desktop machine I'm using is about 1 billion triples.
Jumping into RDF and the most basic representation of N-Triples meant identifying the mapping between a generic triple store and RDF which is considerably more restrictive. Mapping was pretty straight forward with the exception of the strictness of RDF matching. "1"^^<http://www.w3.org/2001/XMLSchema#integer> is different than "01"^^<http://www.w3.org/2001/XMLSchema#integer> which is not the same as "1"^^<https://www.w3.org/TR/xmlschema11-2/#integer>. The OpO core stores data items according to the fundamental type so all three examples resolve to the same value, 1. The design goal was that the user intent was that they should be the same when using SPARQL or other queries to extract data. Fortunately there is an a way to revert to the more strict mode by providing an option to not convert strings to native types.
Next was coming up with a better format for dumping and importing. JSON works well as long as all triples are part of one or more graphs and there are no named graphs but to support named graphs and triples not part of a graph a new format was needed. For lack of a better name the dump format is referred to as the OpO Graph or 'og' format. A comparison of N-Triple, JSON, and the og format for representing data are shown here.
The og format is intentionally different than N-Triple and JSON but it does share some similarities. The format had to be different enough that no one would confuse one with the other. There are several options possible for including the internal identifier for triples and for blanks nodes so that the REST API can be used to map to the output.
Minimal details needed to be able to import and recreate the data store. Indented only for clarity in this case.
Verbose mode includes the identifiers for blank nodes and in the comment the identifier of the triple associated with the line.
JSON does not allow for named graphs nor separate triples. Note this is not JSON-LD but just plain old JSON.
N-Triple format is fairly verbose and like plain old JSON it does not support named graphs directly although blank nodes can have a name providing similar functionality.
As a JSON store the project is useable with some limitations. Create, Read, Update, and Delete are supported. Seach has not yet been implemented and journalling is also on the queue. There is enough there to be able to exercise the API though.
With the introductions of multiple query evaluation threads the performance is a bit better with 150,000 GETs per second and a latency of 80 microseconds. There is a gain with using additional threads but then there is overhead of making sure modifications are atomic and search results are consistent for a specific snapshot.
Next up is support for RDF followed closely by TURTLE. An efficient data structure for URIs or IRIs is needed as RDF is all about using URIs for everything. Okay, not everything, there are literals and blanks but everything else is a URI/IRI.
Another challenge is how to map JSON to RDF nicely. Maybe providing a default namespace is enough to represent JSON as valid RDF triples. If you want to share your opinion or help set the direction for OpO development send me an email at firstname.lastname@example.org.
At first glance JSON-LD didn't seem very useful as a way to describe the triples in a store. When compared to TURTLE, TURTLE seemed like the better choice. JSON-LD is gaining a lot of traction in a large part due to Google promoting it. JSON-LD will be on the list for supported imports for the OpO triple store.
Any old JSON is what I've been using for imports and graph representation so far. That works well with a few simple rules on how to convert JSON to and from triple graphs. Even JSON arrays with order preserved are working with the current code.
The REST API is coming along nicely. POST, PUT, and GET are all working for JSON graphs. Named graphs were added to support the PUT operation. Next up are DELETE. SPARQL, TURTLE, and JSON-LD will come later.
The approach take was to build a very generic triple store as the base. That decision is proving to be the correct choice as analysis indicates the triple store will be able to support a variety of representation from JSON to RDF, TURTLE, and JSON-LD.
Storing JSON as triples maps blank nodes to vertices in a JSON document and predicates as the names in JSON Object members. Leaf values are objects of the correct type. Arrays are not unlike object except that order must be preserved and the members have null predicates.
OpO maintains additional information to be able to recreate the original JSON in the same order. That information is not visible as part of the RDF style representation.
I worked on getting the core engine wrapped with an HTTP REST API. By reusing the wush (websockets and HTTP server) library from Piper Push Cache it wasn't too difficult. That also give the same scalability of Piper Push Cache.
The REST API supports JSON and triples. Only GETs of single entities and lists with paging so far but that was enough to run some benchmarks. Results are round to 2 significant digits. The HTTP requests were a simple GET for a 30 element JSON document given the identifier of the document.
I'm pretty pleased with the results so far given the effort is still in the proof of concept phase.
|Connections||Latency (msecs)||Throughput (GETS/sec)|
The first set of benchmarks look great! The current state is a minimal set of functionality to test the engine. Importing over 7 million JSON records or more than 200 million triples and persisted to disk took only 27 seconds. The records are fully indexed so fast queries should be possible. Restarting too 6 seconds and dumping to a JSON file took 27 seconds to write a 3.2GB file.
There is no restriction on the JSON file structure. Any JSON is fine. The JSON does not need to be JSON-LD or one of the other RDF/JSON alternatives. Just simple JSON. I didn't see the need to over engineer as that can always be added later if necessary.