Plans for Data Provenance

Today is a Data Protection Day, which is a very symbolic day for midPoint. We are taking data protection and privacy very seriously. We believe that privacy in the cyberspace is necessary for the free society to flourish. Despite such belief, we acknowledge the implementation of privacy and data protection may not be easy. But we are not afraid of challanges. We are fully committed to implement privacy and data protection features in midPoint.

MidPoint was still quite young when we have realized that data protection and identity management are in a very intimate relationship. Identity management and governance system are in a perfect position to control the flow of identity data. And the essence of data protection is about controlling the flow and especially the use of data. In fact, we believe that any practical data protection solution must be supported by identity management infrastructure. Many people see data protection as liability. But we believe that data protection can be turned into a substantial advantage when it is implemented properly.

This belief led us to several experiments with data protection functionality. We have started several years ago. We presented some of the results at FOSDEM’18. We implemented several experimental features for data protection, such as consent management and even more general management of lawful bases for data processing. Unfortunately, there was almost no interest for those features in the industry and we were not able to secure sufficient funding to finish all of them. Some smaller pieces are implemented, but there is still a long way to go to get a complete set of data protection functionality.

However, we are not giving up. Now we plan to implement a very important feature that has many facets and many practical uses: Data Provenance. There is one big problem that is common to data protection and identity management. It is problem of data origin or provenance. The problem can be described by something that every identity engineer knows only too well: In a sufficiently large system nobody has any idea where the data came from and how they ended up here. There are too many source systems, mappings, data transformations and information flows that the resulting system resembles proverbial Labyrinth.

The provenance problem is causing a lot of troubleshooting nightmares. This problem slows down IDM deployments and complicates the maintenance. But it is a complete disaster for data protection. Accountability is one of the basic pillars of data protection. And how good is your accountability if you have no idea where your data came from?

We had the provenance problems in our sights for a really long time. In fact, one of the earliest data structures we are using to manage identity data contains a notion of origin. But we have realized quite early this is much more difficult than it seems. The ideas were brewing in our minds for quite a long time. But now we hope it is finally the time to do this, and to do it properly. Therefore, we plan to implement data provenance features in a couple of next midPoint versions. This is still not completely certain. There are sill some variables, including the most important enabler: funding. But our hopes are high. Because some things are certain. Such as the importance of data protection. For all of us.

4 thoughts on “Plans for Data Provenance

  • Great article on an important topic.
    MidPoint definitely needs this to stay ahead, to keep pace with changing cyber world.
    And that’s why I believe this is not something that can afford to wait for funding. It has to be done regardless of funding. And if there is not enough funding, the resources has to be found different way, f.e. from other sponsored features as contribution to midPoint core. This is what I believe have to be done. And, there is more areas that applies to this my point of view.

    • There are many great things that just “have to be done”. But how can we do them without the funding? We have to pay salaries, we have to pay taxes, we have to eat and we need a place to live. Those things are not free. There needs to be funding for everything that we do.

  • Hi Rado, as i understand it you cover the world in which data is only added. What are your plans about the other half of the problem – when data is deleted? How do you track the ‘origin’ of the decision to remove something? Do you plan to propagate empty slots with metadata through the whole system?

  • We are not that far yet. You are raising a valid concern. But that is one of the later steps. First milestone for data protection is to know provenance of the data that we actually have. Even that can be a great improvement over having no provenance information at all.

Leave a Reply

Your email address will not be published.