Plans for Data Provenance - Evolveum | Open Source Identity Management & Governance

Today is a Data Protection Day, which is a very symbolic day for midPoint. We are taking data protection and privacy very seriously. We believe that privacy in the cyberspace is necessary for the free society to flourish. Despite such belief, we acknowledge the implementation of privacy and data protection may not be easy. But we are not afraid of challanges. We are fully committed to implement privacy and data protection features in midPoint.

MidPoint was still quite young when we have realized that data protection and identity management are in a very intimate relationship. Identity management and governance system are in a perfect position to control the flow of identity data. And the essence of data protection is about controlling the flow and especially the use of data. In fact, we believe that any practical data protection solution must be supported by identity management infrastructure. Many people see data protection as liability. But we believe that data protection can be turned into a substantial advantage when it is implemented properly.

This belief led us to several experiments with data protection functionality. We have started several years ago. We presented some of the results at FOSDEM’18. We implemented several experimental features for data protection, such as consent management and even more general management of lawful bases for data processing. Unfortunately, there was almost no interest for those features in the industry and we were not able to secure sufficient funding to finish all of them. Some smaller pieces are implemented, but there is still a long way to go to get a complete set of data protection functionality.

However, we are not giving up. Now we plan to implement a very important feature that has many facets and many practical uses: Data Provenance. There is one big problem that is common to data protection and identity management. It is problem of data origin or provenance. The problem can be described by something that every identity engineer knows only too well: In a sufficiently large system nobody has any idea where the data came from and how they ended up here. There are too many source systems, mappings, data transformations and information flows that the resulting system resembles proverbial Labyrinth.

The provenance problem is causing a lot of troubleshooting nightmares. This problem slows down IDM deployments and complicates the maintenance. But it is a complete disaster for data protection. Accountability is one of the basic pillars of data protection. And how good is your accountability if you have no idea where your data came from?

We had the provenance problems in our sights for a really long time. In fact, one of the earliest data structures we are using to manage identity data contains a notion of origin. But we have realized quite early this is much more difficult than it seems. The ideas were brewing in our minds for quite a long time. But now we hope it is finally the time to do this, and to do it properly. Therefore, we plan to implement data provenance features in a couple of next midPoint versions. This is still not completely certain. There are sill some variables, including the most important enabler: funding. But our hopes are high. Because some things are certain. Such as the importance of data protection. For all of us.

4 thoughts on “Plans for Data Provenance”

Great article on an important topic.
MidPoint definitely needs this to stay ahead, to keep pace with changing cyber world.
And that’s why I believe this is not something that can afford to wait for funding. It has to be done regardless of funding. And if there is not enough funding, the resources has to be found different way, f.e. from other sponsored features as contribution to midPoint core. This is what I believe have to be done. And, there is more areas that applies to this my point of view.

Radovan Semancik says:
6th February 2020 at 11:27 AM
There are many great things that just “have to be done”. But how can we do them without the funding? We have to pay salaries, we have to pay taxes, we have to eat and we need a place to live. Those things are not free. There needs to be funding for everything that we do.
Reply

Hi Rado, as i understand it you cover the world in which data is only added. What are your plans about the other half of the problem – when data is deleted? How do you track the ‘origin’ of the decision to remove something? Do you plan to propagate empty slots with metadata through the whole system?

We are not that far yet. You are raising a valid concern. But that is one of the later steps. First milestone for data protection is to know provenance of the data that we actually have. Even that can be a great improvement over having no provenance information at all.

The Apache Nifi project, https://nifi.apache.org, has good support for data provenance management. I can imagine if Nifi were ’embedded’ in midPoint, it could take on tasks beyond data provenance.

Name	Provider	Purpose	Expiry	More information
PHPSESSID	Evolveum s.r.o.	Cookie generated by applications based on the PHP language. This is a general purpose identifier used to maintain user session variables. It is normally a random generated number, how it is used can be specific to the site, but a good example is maintaining a logged-in status for a user between pages.	When you close your browser
moove_gdpr_popup	Evolveum s.r.o.	When this Cookie is enabled, these Cookies are used to save your Cookie Setting Preferences. You can enable or disable your Cookie Settings on our website at anytime via Cookie Settings.	1 year	Read more
wordpress_test_cookie	Evolveum s.r.o.	Tests that the browser accepts cookies.	Session	Read more
wp-settings-[user_id]	Evolveum s.r.o.	Used to preserve user’s wp-admin settings	1 year	Read more
wordpress_[hash]	Evolveum s.r.o.	On login, WordPress uses the wordpress_[hash] cookie to store your authentication details. Its use is limited to the Administration Screen area, /wp-admin/	Session	Read more
wordpress_logged_in_[hash]	Evolveum s.r.o.	Remember User session	Session	Read more
wordpress_sec_[hash]	Evolveum s.r.o.	This cookie is used to store your authentication details. Its use is limited to the admin console area, /wp-admin/	Session	Read more
wp-postpass_[hash]	Evolveum s.r.o.	This cookie is used to grant access to password protected areas of the site.	10 days
wp-settings-time-[user_id]	Evolveum s.r.o.	The number on the end is your individual user ID from the user’s database table. This is used to customize your view of admin interface, and possibly also the main site interface.	1 year	Read more

Name	Provider	Purpose	Expiry	More information
woocommerce_cart_hash	Evolveum s.r.o.	Helps WooCommerce determine when cart contents/data changes.	When you close your browser	Read more
woocommerce_items_in_cart	Evolveum s.r.o.	Helps WooCommerce determine when cart contents/data changes.	When you close your browser	Read more
wp_woocommerce_session_[hash]	Evolveum s.r.o.	Contains a unique code for each customer so that it knows where to find the cart data in the database for each customer.	2 days	Read more
tk_ai	Evolveum s.r.o.	Stores a randomly-generated anonymous ID. This is only used within the dashboard (/wp-admin) area and is used for usage tracking, if enabled.	Session	Read more
comment_author_{HASH} comment_author_email_{HASH} comment_author_url_{HASH}	Evolveum s.r.o.	When visitors comment on your blog, they too get cookies stored on their computer. This is purely a convenience, so that the visitor won’t need to re-type all their information again when they want to leave another comment.	These are set to expire a little under one year from the time they’re set.	Read more

4 thoughts on “Plans for Data Provenance”

Leave a Reply Cancel reply

Cookie settings

Strictly necessary

Third party

Technical