robsite

Nested Sets suck: On how (not) to map trees into a relational database

A guest article by Ludwig Gramberg

A few years back I read some articles on how to map tree-nodes into a relational database. Now having implemented such a tree-mapper and using it in a fairly large and complex application I want to share my experience with you.

The simple structure you might imagine is node(id, name, parent) where parent can be null, thus identifying a node as a trees root.

So you might have something like this:

table node:
id name parent
--------------
1  a    NULL
2  b    1
3  c    1
4  d    2```

or, if you want to support multiple trees in one table, you do this: (you can also put multiple trees in the structure above but this makes it simpler to query)

table node:
id name parent root
-------------------
1  a    NULL   1
2  b    1      1
3  c    1      1
4  d    2      1
5  x    NULL   5
6  y    5      5
7  z    6      5```

This structure will suffice until you want to perform one of the following queries and can not afford to use recursive algorithms, which will need many queries:

  • select/update (sub)trees (a node and all of its children and descendants)
  • select all ancestors of one node (the path to the root)

More generally speaking you want to know that the number of queries does not depend on the number of nodes you have in your table.

One approach which you can find everywhere is called 'nested sets', you can read all about them here or here and I suggest you do if you don't know them yet, otherwise don't even bother to continue reading.

Basically, the idea of Nested Sets is to give each node a left and a right border value. I won't go into the details, but I can tell you that it enables you to do the queries mentioned above pretty efficiently. The only problem is that some other queries are fucking insanely expensive.

Nested sets are completely useless once you begin to make changes often in your tree-structure. So, if you have a read-only tree, or the tree is much more often read than updated, you could use it. But actually, if you ever plan to change your tree, don't use Nested Sets. They suck. Period.

Why Nested Sets suck

  1. It's not easy to implement them

    This would be tolerable, if the benefits would outweigh the costs. But they don't. And the harder something is to implement, the harder it is for you as a programmer to understand it and guarantee its security.

  2. Changes are insanely expensive

    Because of the way nested sets work, everything is tied together very closely and only if this relation is not broken do they work. You can ignore or change some of the rules, but that will change what features of nested sets you can use.

    For example: If you delete a node in the middle of the tree – that being the node with left = max(left)/2 – you will have to update the left and right value of each node, which is on the right hand side of that node. So the worst case scenario is that you delete a node which is far left, so in return you have to update almost each node.

    The same goes for inserting nodes. You have to make room first, which means pushing all nodes on the right of the new node a bit to the right by, again, updating all of them. You might think this is not a problem because its just one query. Say that again if you have a few thousand, hundred thousand or millions of rows in your tree table. The situation gets worst if you want the left value to be your natural sorting order, because then sorting your nodes becomes fucking expensive as well.

  3. Nested Sets kill parallel processing

    You will need transactions to make sure all your left and right values are in order. But since you're also almost always changing at least half the table, you block at least half the table for the duration of the operation. The situation becomes clusterfuckingly delicious should you forget to make sure you block the rows properly using SELECT FOR UPDATE when selecting values from nodes you're going to use to calculate left/right value shifts. So there you have it, Nested Sets sucks when it comes to working in parallel on the tree.

So these are the main reasons Nested Sets suck. You're welcome to prove me wrong ;-)

So whats better then?

Besides your first simple node table, have a second table called 'ancestor': ancestor(node, ancestor). Put a primary key comprised of the two fields on it and you're done. You can manage to fill this table using Triggers or by application programming. If you like, you can also use foreign keys, but you don't have to.

table node:
id name parent
--------------
1  a    NULL
2  b    1
3  c    1
4  d    2```

table ancestor:
node ancestor
--------------
2    1
3    1
4    1
4    2```

Read the table like so: node has the ancestor ancestor.

So let's see how this performs:

  • Select/update (sub)trees (a node and all of its children and descendants)

    No problem there, first select from ancestor and then join onto it like so: SELECT node.* FROM ancestor INNER JOIN node ON ancestor.node=node.id WHERE ancestor.ancestor=YOURNODEIDHERE

  • Select all ancestors of one node (the path to the root)

    Just as easy as the first one, just the other way around: SELECT node.* FROM ancestor INNER JOIN node ON ancestor.ancestor=node.id WHERE ancestor.node=YOURNODEIDHERE

  • Move nodes within the tree

    Ok, so this is not as easy, but almost. When moving a node, the ancestor relation between the node that is being moved and all of its descendants stays untouched. What changes though is the relation between all nodes in the (sub)tree thats being moved and the ancestors of the (sub)trees root. So what you do is this:

    1. Change the parent of the node you want to move
    2. Now delete all entries of the former ancestors of the moved node and all nodes in the subtree beneath the node you are moving
    3. Now insert a new pair of node-ancestor in your ancestor tree, for each node in the subtree you are moving

    For example, lets say d and e are under c. If you move c like so /a/b/c to /x/y/z/c, then you have to delete all pairs of c,d and e with a and b and then insert new pairs for c,d,e with x,y,z.

That was the hardest part. Everything else is easier than this.

The pros

  1. This is super transaction and parallel processing friendly. If you insert a node in one part of the tree and another one somewhere else, there is no conflict of interest between the two nodes. Just complete their ancestor relation and you are done. You will also have to make sure that at the end of every transaction the relations are in sync, but you don't have to block half the tree to do so. You are just inserting and deleting into ancestor and node.

  2. It's easy to check if you did everything right. Just look at the two, simple tables. Good luck trying that with Nested Sets. Using standard database features like foreign keys you can easily secure almost all restrictions of this structure.

  3. It's fast. The most expensive operation is moving a node. Inserting is super cheap.

  4. It's easy to extend.

  5. It's easy to implement.

  6. The ancestor relation is explicit while the left/right relation is implicit. Make a slight mistake in your left and right values and you're fucked.

The slight downside

The costs are of course the second table. And here the data will grow by the factor of the average depth of your tree multiplied with the number of nodes you have. So the worst case scenario is just a list of nodes. Then the costs are: (n^2 + n)/2 where n is the number of nodes you have. But the thing is, you can calculate these costs easily in advance, look at your tree and determine the average depth.

Glad I got that off my chest

Fuck Nested Sets... seriously.

Don't Insert

· performance, sql, webdev ·
Mastodon