Transitive closure and matrix multiplication in identity management - Evolveum | Open Source Identity Management & Governance

MidPoint development of is full of interesting software problems – be it management of long-running tasks, integration of third-party workflow engine, devising a flexible authorization mechanism, creating a GUI that adapts to the customizable data model, or many others. However, the following one in particular reminded me of my happy student years at the faculty of mathematics and physics: computing the transitive closure of the organizational structure graph.

What’s it about?

MidPoint cares about the organizational structure, or, better said – structures. A company can have a number of divisions, each of which could be split into departments. In public governance scenario, a country can be divided into regions, regions into counties, and in each county there can be cities and villages. Or, a university can have faculties; faculties can have departments, and within departments there can be any smaller organizational units, as dictated by local habits. Any person can then belong to one or more such units. Moreover, there can be structures laying over the above-mentioned ones. For example, there can be a set of projects, forming again a tree (or more trees). The structure of study programs at the university can also form such an overlaying structure.

We want to be able to point at any organization (it’s a midPoint term for anything that is part of the organizational structure) O in such a set of hierarchies and ask questions like “does user X belong to O or any of its suborganizations?” or “give me a list of users of age under 35, belonging to O or any of its suborganizations.”

It is easy to see that what we have here is a directed acyclic graph, also known as DAG. Let’s call it G. G consists of two sets: V and E. V is the set of vertices of this graph; these are organizations and persons. E is the set of edges. An edge e from vertex v₁ to vertex v₂ is in E if organization or user v₁ “belongs to” organization v₂(we would say that v₂ is a parent of v₁). What we need is the transitive closure of this graph, i.e. a graph G^* = (V, E^*), which has the same set of vertices as V and contains an edge e from vertex v₁to vertex v₂ if and only if v₂ is an ancestor (i.e. parent or grand-parent or grand-grand-…-parent) of v₁.

If we would have G^* available, then it would be very easy to answer questions posed above:

Answering the question “does user X belong to O or any of its suborganizations?” would become a simple query to see if there is an edge from X to O in G^*.
Answering the question “give me a list of users of age under 35, belonging to O or any of its suborganizations” would consist of getting all elements U such that there is an edge from U to O in G^*, U is a user, and U is of age under 35.

But… how to compute the closure?

There are many nice algorithms for computing the transitive closure of a graph, for example the Floyd-Warshall algorithm. However, this algorithm (and many other ones) expects that the graph is fully stored in main memory. And, what is worse, the time needed for the computation is just too large for large graphs.

The more practical approach is to store a transitive closure alongside the original graph. When changing the graph, we would make a corresponding change in the closure. Our repository is implemented as a SQL database, so both original graph and its closure would be represented as database tables.

At first, we implemented an algorithm proposed by Dong et al [1]. Its main idea can be explained like this: when adding an edge v₁ → v₂ into G, add to G^* all edges x → y such that x → v₁ and v₂ → y are already in G^*. It’s obvious: if there is a path from x to v₁ and a path from v₂ to y, certainly there will exist a path from x to y, because v₁ is now connected to v₂. And the other way around: any “new” path from x to y would comprise one “old” path from x to v₁, then “new” edge v₁→ v₂ and then some “old” path from v₂ to y. This can be implemented as an SQL join, followed by some commands aimed to insert those rows to G^* that aren’t already there.

The removal of edge from G is a bit more complex: the algorithm computes a table SUSPECTS that contains all edges x → y such that there was a path from x to y going through edge being deleted (v₁→ v₂). Then it computes a TRUSTY table containing all edges that are for certain untouched by the removal of the edge v₁→ v₂. And finally, as authors have proven, new transitive closure contains all paths that are created by concatenation of up to three subpaths from the TRUSTY table.

Unfortunately, this “removal” side of the algorithm takes just too long time to execute. Much longer than is acceptable in midPoint.

Is that the best we can do?

Certainly not. After slight googling I’ve found a very interesting article, referring to a chapter in the SQL patterns book by Vadim Tropashko [2]. Here comes the idea:

Each graph can be represented by an adjacency matrix A = (a_ij) where a_ij = 1 or 0, depending on whether there is an edge v_i→ v_j or not (i, j range from 1 to N, where N is the number of vertices). If we compute the following:

T = A + A² + A³ + A⁴ + …

we will get a matrix T = (t_ij) containing information about the number of paths from any vertex v_i to any other vertex v_j in G! It is not so hard to see that:

A = (a_ij) contains the number of paths of length 1 (i.e. edges) from v_i to v_j.
A² = A*A = (a⁽²⁾_ij) contains the number of paths of length 2 from v_i to v_j. It is so because a’_ij = a_i1*a_1j + a_i2*a_2j + a_i3*a_3j + … + a_iN*a_Nj. So we take the number of paths of length 1 from v_i to v₁ (it is 0 or 1 in this case, but that’s not important), multiply it by the number of paths of length 1 from v₁ to v_j (again 0 or 1) and this is the number of paths of length 2 from v_i to v_j going through v₁. We do the same for v₂, …, v_N and then by adding these numbers together we get the number of all paths of length 2 from v_i to v_j i.e. a⁽²⁾_ij.
In a similar way, A^k = (a^(k)_ij) contains the number of paths of length k from v_i to v_j for any k.

It is clear that T is very close to the transitive closure, isn’t it? If we replace all non-zero numbers in it by 1, we will get the adjacency matrix of the transitive closure graph.

This is interesting, but not directly helpful.

As Tropashko shows using simple algebraic operations, changing adjacency matrix A of graph G by adding an edge e, represented by matrix S, i. e.

A → A + S

changes the transitive closure matrix T to a new value of T + T*S*T, i. e.

T → T + T*S*T

and this is something that can be computed using SQL without much problems!

What is even more delighting is that the reverse operation, i.e. edge removal, is of about the same complexity:

A → A – S;
T → T – T*S*T.

SQL implementation of this computation is really simple. T*S*T can be computed using one join. T + T*S*T is then one upsert (update+insert), and T – T*S*T is done as update+delete. Details are more than understandably described in Tropashko’s book.

We improved the Tropashko’s algorithm a little bit by allowing adding/removal of more edges at once. The only condition is that they are “independent” in such a way that no path would go through two or more edges added/removed in one step. Also, we added special treatment for some situations, namely adding a node with one parent and no children and removing a node without children.

The implementation

The implementation was quite straightforward. Each of 5 supported databases (H2, MySQL, PostgreSQL, Oracle, Microsoft SQL Server) has its own specifics concerning how to deal with temporary tables, how to write upsert/merge command, how exactly to write update and delete commands to achieve the best performance, and how to deal with concurrent access to the closure table.

The final step was realization that by moving users out of the organizational graph we could make closure table updates much more efficient (by reducing its size substantially), while making queries slightly slower (by introducing a join between the closure and user-org relation table).

Production-ready code can be seen in OrgClosureManager class.

Preliminary performance evaluation

We have done a preliminary performance evaluation of our implementation on MySQL and PostgreSQL databases. It was done by creating a sequence of graphs of the following sizes:

Level 1	Level 2	Level 3	Level 4	Level 5	Orgs	Closure size
1	2	3	4	5	153	1K
2	2	3	4	5	306	3K
4	2	3	4	5	612	9K
4	4	3	4	5	1220	21K
4	8	3	4	5	2436	49K
8	8	3	4	5	4872	116K
8	16	3	4	5	9736	252K
4	8	6	8	10	17124	379K

“Level 1″ column indicates how many root nodes are there. “Level 2..5″ colums say how many children at a particular level (2..5) were created for each parent node residing at the upper level. “Orgs” is the total number of vertices in the graph, and “Closure size” gives an approximate number of records in the closure table. It is approximate, because the graph was generated by a randomized algorithm, and while the number of parents for each org was given (for levels 1 to 5 these are 0, 1, 2, 3, and 3, respectively), by creating the graph in a depth-first way, some of the child-parents links were not created, as there were not enough parents existing yet.

Here are the results. Because operation duration strongly depends on the number of affected rows in the closure table, we have divided operations into two categories: (1) operations in the “upper part” of the graph (levels 1, 2, 3 in our case), (2) operations in the “lower part” of the graph (levels 4 and 5). These two categories are distinguished in the graphs below (click to enlarge):

Note that the average time required to add/delete an edge in the lower parts of the graph (where majority of operations can be expected to occur) does not exceed 50 milliseconds in all cases.

Calc sheet with the results is here.

Tests were executed by running (appropriately configured) OrgClosurePerformanceTest2 class. The configuration of database servers had to be tuned a bit. Basically, the memory available to the servers had to be increased. The numbers related to MySQL and PostgreSQL are absolutely not meant as a comparison of these databases – for example, the engines are not tuned in the same way. They are only shown here as an indication that the algorithm works on more than one specific database engine. By tuning the engines appropriately (e.g. by allowing them to use more memory or tweaking other parameters) we could perhaps get to even better results. However, we consider the results presented here to be are good enough for our purposes.

Conclusion

We have shown here a basic idea of two existing transitive closure maintenance algorithms and some notes on our implementation of one of them, along with a preliminary performance evaluation. You can freely inspire yourself by looking at the source code (albeit some of the code is really midPoint-specific). Any suggestions or improvements are more than welcome!

Life of a software developer often brings surprising and much pleasuring moments. Meeting years-forgotten pieces of graph theory and even linear algebra during development of an identity management tool is definitely one of them.

References

[1] Guozhu Dong, Leonid Libkin, Jianwen Su and Limsoon Wong: Maintaining the transitive closure of graphs in SQL. In Int. J. Information Technology, vol. 5, 1999.
[2] Vadim Tropashko: SQL Design Patterns: Expert Guide to SQL Programming. Rampant Techpress, 2007. ISBN 978-0977671540.

1 thought on “Transitive closure and matrix multiplication in identity management”

Igor Farinič says:
25th October 2014 at 8:05 PM
Volunteers, students interested in academic research in identity management could find more information at:
https://wiki.evolveum.com/display/midPoint/Academia

Name	Provider	Purpose	Expiry	More information
PHPSESSID	Evolveum s.r.o.	Cookie generated by applications based on the PHP language. This is a general purpose identifier used to maintain user session variables. It is normally a random generated number, how it is used can be specific to the site, but a good example is maintaining a logged-in status for a user between pages.	When you close your browser
moove_gdpr_popup	Evolveum s.r.o.	When this Cookie is enabled, these Cookies are used to save your Cookie Setting Preferences. You can enable or disable your Cookie Settings on our website at anytime via Cookie Settings.	1 year	Read more
wordpress_test_cookie	Evolveum s.r.o.	Tests that the browser accepts cookies.	Session	Read more
wp-settings-[user_id]	Evolveum s.r.o.	Used to preserve user’s wp-admin settings	1 year	Read more
wordpress_[hash]	Evolveum s.r.o.	On login, WordPress uses the wordpress_[hash] cookie to store your authentication details. Its use is limited to the Administration Screen area, /wp-admin/	Session	Read more
wordpress_logged_in_[hash]	Evolveum s.r.o.	Remember User session	Session	Read more
wordpress_sec_[hash]	Evolveum s.r.o.	This cookie is used to store your authentication details. Its use is limited to the admin console area, /wp-admin/	Session	Read more
wp-postpass_[hash]	Evolveum s.r.o.	This cookie is used to grant access to password protected areas of the site.	10 days
wp-settings-time-[user_id]	Evolveum s.r.o.	The number on the end is your individual user ID from the user’s database table. This is used to customize your view of admin interface, and possibly also the main site interface.	1 year	Read more

Name	Provider	Purpose	Expiry	More information
woocommerce_cart_hash	Evolveum s.r.o.	Helps WooCommerce determine when cart contents/data changes.	When you close your browser	Read more
woocommerce_items_in_cart	Evolveum s.r.o.	Helps WooCommerce determine when cart contents/data changes.	When you close your browser	Read more
wp_woocommerce_session_[hash]	Evolveum s.r.o.	Contains a unique code for each customer so that it knows where to find the cart data in the database for each customer.	2 days	Read more
tk_ai	Evolveum s.r.o.	Stores a randomly-generated anonymous ID. This is only used within the dashboard (/wp-admin) area and is used for usage tracking, if enabled.	Session	Read more
comment_author_{HASH} comment_author_email_{HASH} comment_author_url_{HASH}	Evolveum s.r.o.	When visitors comment on your blog, they too get cookies stored on their computer. This is purely a convenience, so that the visitor won’t need to re-type all their information again when they want to leave another comment.	These are set to expire a little under one year from the time they’re set.	Read more

What’s it about?

But… how to compute the closure?

Is that the best we can do?

The implementation

Preliminary performance evaluation

Conclusion

References

1 thought on “Transitive closure and matrix multiplication in identity management”

Leave a Reply Cancel reply

Cookie settings

Strictly necessary

Third party

Technical