A Road To Axiom

MidPoint is a fully schema-aware system. MidPoint eats and breaths the schema from the very bottom to the very top. Therefore we need a language to express the schema. MidPoint was built on XML Schema Definition (XSD) and we have lived in that uneasy relationship for years. But now it is the right time to make big step forward.

The concept of schema completely permeates midPoint. You cannot really do anything with midPoint without dealing with schema, directly or indirectly. Connectors represent attribute names and types using schema. That schema is used by midPoint mappings to correctly convert data types. The schema is used by user interface to automatically create correct input fields for data. Schema is used to customize and extend midPoint data model. Schema is everywhere. This is one of fundamental principles of midPoint. It lowers deployment effort, it makes customization easier and it provides some guarantees about correctness of the configuration.

MidPoint project started in 2011, but some parts of midPoint design go back even further. XML Schema Definition (XSD) was an obvious choice for schema definition language at that time. We were not happy with XSD and the XML ecosystem even at that early time, but there was nothing better we could use. MidPoint has evolved during all these years. XML is no longer the only data language we support, there is also JSON and YAML. But XSD remained as a schema language to this day. We considered using JSON Schema instead, but it does not provide any significant advantage over XSD. In fact, we considered several schema languages at several points in midPoint development process. But the result was always the same: there is no schema language really suiting our needs. Switching from XSD to any other existing language would mean that we have to do a lot of work to get to the same place where we already are.

The problem with XML schema is that it describes XML data structures. The problem with JSON schema is that it describes JSON data structures. These languages are designed to describe data represented in a very specific format. We need something else. We need way how to describe data structures that can be used in a wide variety of ways: data in JSON file, data in relational database tables, data provided by a RESTful interface, data displayed in user interface and so on. This may seem easy, but the devil is in the details. E.g XML has namespaces, JSON does not (unless it is JSON-LD which kind of has namespaces). XML has attributes, JSON does not. JSON and XML assume ordering in multivalue data, but such assumption is a problem when data are stored in relational database or LDAP. XML has XPath which is an overkill and JSONPath is pretty much the same. It is all one big mess. One can survive in this world by making a lot of compromises and violating a couple of standards. That is what we have done with XSD and it kind of worked. We have been (ab)using XSD for the purpose of data modelling for many years. But we got to know all the problems quite intimately. Nobody can say that we have not tried hard enough. What is even worse, JSON Schema, YANG or SCIM schema are built on the same principles as XSD and therefore they are not going to solve the fundamental issues either.

What we need to do is to go one level of abstraction up. We do not want to model XML or JSON data. We want to model data, regardless of their actual representation or storage mechanism. That was quite clear as early as in 2012 when we designed Prism as an abstraction layer in midPoint code. Prism was used to model the data, not just their XML representation. That decision allowed us to implement JSON and YAML support in midPoint in quite an elegant way. Prism has evolved during all these years, but it was always limited in its capabilities. And XSD played a significant part in these limitations. We planned for years that we have to do something about it. But solving this problem properly is not an easy task. And we always managed to push XSD a bit further, to make it play one more dirty trick. This worked for more than 6 years.

Enter midPrivacy. We have been working on data protection features for quite some time. But it was 2019 when we got our chance to take it to the next level. NGI has an NGI_TRUST project that looked like a perfect opportunity for us. We were more than aware that data protection is as much about meta-data as it is about data. You can make proper use of the data only if you know how reliable the data are, where they come from and whether you are entitled to use them at all. Meta-data capability is basic building block for pretty much any data protection platform. It provides visibility and accountability. Obviously, we needed that in midPoint as well. Therefore we have put together a proposal to NGI_TRUST open call. And we were very lucky to get the funding.

However, everything gets quite complex when it comes to meta-data. We need to keep such meta-data for every value of every data item. And the meta-data are going to be slightly different for every midPoint deployment. This adds an entirely new dimension of data modeling, a new dimension of complexity. This is very hard to do with conventional data modeling languages. We might try to make XSD one more dirty trick – and after all these years of XSD hacking we might actually succeed. But we have decided that this is the point where we finally say good-bye to XSD and do it properly. We started by double checking that we are not missing any obvious solution. But there was no solution that could satisfy our needs.

That is how Axiom was born. Axiom is a new data modeling language we are working on right now. It is still a baby, still wildly evolving. But it starts to take its shape. First ambition of Axiom is to replace XSD in midPoint. But that would not be enough to justify existence of the new language. We need Axiom to do more than that. Our goal is to use Axiom to define a meta-data schema. We want to maintain complex meta-data structures for every data value. The data will be modeled by Axiom schema, but also the meta-data will be modeled by independent Axiom schema. These schemas will be orthogonal, independently developed, independently maintained, independently extended and customized for every deployment. We want to join the schemas inside midPoint at run-time. This is a method how to create two-dimensional schema from two simple schemas without getting a code of insane complexity. This is the right way how to implement data provenance capabilities.

We are now working on prototype implementation of a processing code for Axiom and adjusting the Axiom language specification at the same time. We believe that something like Axiom cannot be designed on a drawing board or in a standards committee. This needs experimentation, prototyping and evolution. We are proceeding in iterations, using the midPoint code as a test bed. Therefore we expect that Axiom will be evolving for quite some time until it is completely ready. But we believe that this is a step in the right direction. This is more than likely to bring a lot of long-term benefits.

Finally, we are more than grateful for this opportunity and we would like to thank everyone in NGI for our chance to make another step towards robust and professional data protection platform that can be used by everybody. We appreciate that European Union is not just imposing data protection regulations, but that it is also contributing to open source technologies that can be used to implement practical data protection mechanisms. We are more than happy for this opportunity to push the technology one small step forward.

This project has received funding from the European Union’s Horizon 2020 research and innovation programme under the NGI_TRUST grant agreement no 825618.

Leave a Reply

Your email address will not be published.