Why RDF Is Struggling - the Case of R2RML
In 2012 I started my .NET implementation of R2RML and RDB to RDF Direct mapping which I called r2rml4net. It never reached the maturity it should have but now, 8 years later, I have little choice but to polish it and use it for converting my database to triples. A task I had originally intended but never really completed.
Why is it significant? Because all those years later the environment around R2RML as a standard is almost as broken, incomplete and sad as it was when I started. Let’s explore that as an example of what is wrong with RDF in general.
Update July 31st, 2020
It has been brought to my attention that Morph is in fact actiavely maintained. I’ve updated it’s details and evaluation.
Intro. What is R2RML?
R2RML and Direct Mapping are two complementary W3C recommendation (specifications) which define language and algorithm respectively which are used to transform relation databases into RDF graphs. The first is a full blown, but not overly complicated RDF vocabulary which lets designers hand-craft the way in which relational tables are converted into RDF. Individual columns are either directly converted into values (taking their respective database types into consideration) or used within simple templates to produce compound values as literals, blank node and literal alike.
Direct Mapping is a simpler approach, often using R2RML internally as the mapping model, which creates an automatic mapping from any given relational database into triples. The specification defines way in which tables, rows and values are meant to map into triples. It can be either executed standalone and then the resulting RDF would be refined, or an R2RML document can be produced so that it can be fine-tune before the actual transformation happens.
Complementary to these two specs there are a two sets of test cases which can be exercised by implementors claiming compatibility and advertised at a central RDB2RDF implementation report page hosted by W3C.
Related to R2RML, there is also a newer specification RML.io which extends it into supporting also other sources like XML and CSV.
Why is it important?
I had an interesting twitter exchange recently where I tried to present arguments why applying RDF selectively, without really using it in every layer of the application architecture is problematic.
You need to look at the big picture, entire stack of a single or multiple applications— Tomasz Pluskiewicz (@tpluscode) July 16, 2020
Polyglot persistence becomes a burden if you convert JSONs and relational data into RDF all the time
If RDF is not your programming model then you're in for pain
And no, JSON-LD is snake oil
In that case JSON-LD got the bashing but the bottom like here is that when building an application using RDF technologies it is worth using it in all software components. From the user interface all the way to the database. This is the only way which prevents constant tension between graph and non-graph models, such as the mentioned issue where JSON-LD hides the graphy nature of data. It is a similar problem which haunted software where relation data model is mapped into object complex models. For that I recommend the classic blog post by Jeff Atwood titled Object-Relational Mapping is the Vietnam of Computer Science
R2RML should be an important tool in the toolkit of any Semantic Web development team as it aims to provide an effective way for migrating existing datasets stored in SQL silos into RDF. This can be done by performing a one-time conversion as mentioned above but an alternative approach some take is running the mapping on-demand, for example by translating SPARQL queries into SQL without ever persisting the converted triples.
You could think that surely, over the years we should have grown a vibrant ecosystem around this cornerstone piece of technology. Well, think again…
My humble requirements
For my use case I have simple requirements. I need to perform a fairly simple mapping of a handful of tables into quads. That is, I want to partition the dataset into named graphs, mostly in a graph-per-entity fashion. Pretty standard as R2RML goes.
My database is Azure SQL so MS SQL has to be supported.
I expect also ease of use. Preferably a standalone CLI, easily installed and usable on CI.
R2RML implementations in the wild
The first logical place to look for R2RML software should be the Implementation Report. It lists 8 implementations, 4 out of which implement both R2RML and Direct mapping:
- RDF-RDB2RDF (both)
- XSPARQL (both)
- ultrawrap (both)
- db2triples (both)
- D2RQ (Direct Mapping)
- SWObjects dm-materialize (Direct Mapping)
- OpenLink Virtuoso (R2RML)
- morph (R2RML)
The listing is clearly not actively maintained (last updated in August 2012) so one would also try searching so the latest and greatest. Here’s what I found:
- CARML (RML)
- RML.io RMLMapper (RML)
- SDM-RDFizer (RML)
- RocketRML (RML)
Let’s take a closer look to check if they present a viable option. I’m only interested in R2RML so that eliminates D2RQ and SWObjects dm-materialize but let’s check them out either way.
Of the RML implementations, CARML and RocketRML do not support SQL data source and SDM-RDFizer does not support SQL Server. That leaves RMLMapper.
Finally, there are a bunch of commercial products which incorporate R2RML and other kinds of mappings and migrations from other data sources to semantic graphs. Names like Stardog or Anzo which are aimed at big corporate settings. They often don’t have free versions, require adopting their entire, integrated environment and cost big buck.
|Installation||Perl package manager||👎|
The project page is rather developer-centric. An INSTALL file linked in an Other files section says
Installing RDF-RDB2RDF should be straightforward. If you have cpanm, you only need one line:
Looks simple, but I have no idea about PERL and
cpanm. There is also a
README file but the usage instructions are rather uninformative. I think this is only a library. Even if this gets the job done, there is no way I’m learning PERL for this 🙄
While the address linked from the implementation report is now dead, a quick google reveals its new home on GitHub.
|Developed by||Company (?)|
The R2RML feature is not well advertised but found in the wiki under Working with RDBMS SQL
Configuration is provided using a
.properties file. Awkward but doable. Unfortunately the project does not show an example of how to set it up.
The linked company Capsenta redirects to https://data.world and appears to be a commercial product. There is also a Community tier of what seems to be a SaaS offering.
Not sure about this one.
|Installation||Build with maven||👎|
This one looks promising. Sadly, it appears that the sources have to be built manually. No thank you. On the other hand the
format parameter can be one of
'RDFXML', 'N3', 'NTRIPLES' or 'TURTLE' so I guess no named graphs? 😢
|Installation||Download from d2rq.org||🙄|
Anyway, only Direct Mapping and unmaintained but if it works, it works…
❌ It’s dead Jim
|Installation||Dedicated installers + a plugin||😕|
Virtuoso is a well-known name in the RDF space. It is a commercial product and a triple store. Support for R2RML comes as an add-on and the overall setup looks super complicated and not at all standalone 👎. Sorry
Much outdated in the original 2012 implementation report, it turns out that Morph has seen much activity since and has been developed by a commercial company. Java-style setup using a JAR download and the awkward
.properties file but definitely something to try out.
Ontop is mainly a Virtual Graph endpoint, like d2rq, but comes with a CLI command
materialize which takes a R2RML mapping graph and serializes the resulting triples to a file.
Unfortunately, at the time of writing named graphs are not supported. The project is very actively maintained and that might change very soon.
Another super active but also quite complex tool. An installation page shows how to install a GUI tool. The README gives examples of commands running Maven within a clone of the original repository. Maybe I’m missing something but it does look like it falls into “easy of use” category.
To do it justice, this definitely looks super useful as a
an information integration tool that enables users to quickly and easily integrate data from a variety of data sources
as advertised in the repo. Not what I’m looking for though.
|Installation||scripts in repository||😕|
|Developed by||Individual (?)|
r2rml-kit is an offshoot of D2RQ, based on its abandoned develop branch r2rml-kit is currently in pre-alpha stage.
Not only is it pre-alpha, it is also not really maintained. Too bad…
Another Java project which fails to even provide a pre-built JAR. This one has at least seen some development recent time and claims to support quad output formats. Maybe worth a go.
The last RML implementation looks promising too. Actively maintained, supports SQL server, outputs quads, uses modern tooling. A definite candidate for success.
For such a crucial piece of software it’s quite disappointing to see in what state the environment is and how little it has changed since 2012 when I first had a look at R2RML.
The old implementations died off or became commercial products. C’est la vie.
npm i -g hypothetical-r2rml) or the latest .NET (
dotnet tool install -g hypothetical-r2rml). Once installed it should simply create a global executable to run the transformation.
And why are so many poorly documented? Again, I can mostly speak of JS and .NET ecosystems and there are plenty of examples of beautiful, detailed documentation pages and guides. How is it possible that most of those above fail on that front.
Maybe I’m being unfair about that last point. Much software is poorly documented and I have been guilty of that myself in the past but for the RDF community at large it should be critical to provide working, well documented software in order for semantic technologies to achieve any wider recognition.
Finally, I would have said in the past that universities are part of the problem and the Semantic Web has been long viewed as academic and impractical. It pleases me to see that but of the above, the more recent uni-managed packages actually stand out as being more modern and better maintained overall. 👍
And I have not even looked at test coverage but I do not dare.
In the end, it’s still a little disappointing how limited the choice seems for someone looking for an unimposing but functional R2RML solution. In the two lists above I gathered 16 potential candidates out of which only a handful remain:
- XSPARQL (config is going to be a trial & error thing)
- db2triples (only if the docs are inaccurate and named graphs are supported)
- Ontop (no named graph but deserves a closer look)
I initially intended to give more details about each of the promising implementation in this post but I decided that I should look in more detail and actually try running and comparing those most promising implementations to see if they can actually deliver. In a subsequent post I will take my mappings and try processing them with the 5 tools I selected.