Thursday 2 June 2011

Binary Trees in SQL: The Adjacency List


When learning to program, a key skill is to know you data structures. For every task there is the ideal data structure that makes the task easy and straightforward to implement. But then we are thrown into the real world. In this real world we are told to finish a piece of code with the restriction that the data has to be stored in a SQL database. Most of the time we think of databases essentially being arrays in which the data is stored in a tabular way. In the back of our head we know that a more sophisticated data structure would be more fit for the task, but we don't have the time or can't be bothered to think about these luxuries in more detail. One prime example is the tree data structure. Wouldn't it be great if there was a simple, straightforward way to implement trees in SQL?

I am going to show how this can be done. As it turns out there is more than one solution to the problem. In this post I will be writing about the first and maybe the most intuitive approach, the adjacency list. In the end this approach will turn out to be OK but not ideal for all applications. In two following posts I will then present more stable approaches to the problem, the nested set and the binary heap.

Adjancency List


The adjacency list is probably the implementation of a binary tree that is the easiest to understand and the one that most people would come up with if asked to implement a tree structure. Every entry in the tree connects two elements by a parent child relationship. We want to keep things neat and tidy, so we make sure that we separate data from structure. What this means is that we have a table with all the data, somewhat like this

CREATE TABLE people (id INTEGER PRIMARY KEY NOT NULL, name VARCHAR(30));

Now we define the table that will hold the structure of the tree. Every entry in this tree will link a child node to a parent node and therefore has two references to our data table

CREATE TABLE bintree (parent INTEGER, child INTEGER NOT NULL, rel CHAR(1));

You will notice an extra column 'rel' in this table which can hold the two values 'L' or 'R'. Because we are storing a binary tree, each node is either the left or the right node of it's parent. By specifying the relation to it's parent we can perform proper binary searches in the table. Keeping the two tables separate allows us to create the tree structure for the data relatively independently of the data itself. We could also sort the same data set in two or more trees for different sorting criteria.

Now let's start populating our tables with data. When we insert a piece of data into our data table, we must also insert the relationship to the other data sets into the tree table. The next two insert statements do the job.

INSERT INTO people VALUES (1,'John'),(2,'Ben'),(3,'Sue');

INSERT INTO bintree VALUES (-1,1,''),(1,2,'L'),(1,3,'R');

But note how we have to take care that we retain the tree structure. The parent must exist and it can have a maximum of one left and one right child node. Only the root node does not have a parent which is denoted by a -1 in the parent column. This constraint can be tested using check statements, but I will not go into the details of that in this post. Deleting a leaf node from the tree is also done using two simple drop statements, for example to remove Ben from the tree we execute the following

DELETE FROM people WHERE id=2;

DELETE FROM bintree WHERE child=2;

We have to ensure manually that the node deleted in this way doesn't have any children of its own. If it does, the children will end up orphaned and unreachable. Checking this can again be done using check statements. The real problem arises when you try to delete a node and all it's children. The only way to achieve this is by going through all the elements which are reachable from the node and delete them individually. This can be quite time consuming if you have a larger database.

Another common task is to find out if one node A is the ancestor of another node C. Again, a number of SQL statements have to be issued. Starting from the child node C, we find it's parent

SELECT parent FROM bintree WHERE child=[C]

If this returns the ancestor A then we know that the two nodes are related by an ancestor-descendant relationship. If the statement returns nothing, then we have reached the root of the tree telling us that the two nodes are not related. If we get any other node then we issue the same command again, but with the new node as the child node. We continue this way until we have either found the ancestor or reached the root node. The PHP code for this algorithm is as follows

// a function that returns true if $parent is an ancestor of child and false otherwise
function isAncestor($parent,$child) {
  $node = $child;
  while ($node != -1) {
    $query="SELECT parent FROM bintree where child=".$node;
    $result=mysql_query($query); // should return exactly one result
    $node=mysql_result($result,0,"parent");
    if ($node==$parent) return true;
  }
  return false;
}

As you can see, the adjacency list layout for storing trees is not the ideal method. Some simple operations take multiple SQL queries and the consistency of the tree is not automatically ensured. Apart from orphaned nodes we have to avoid cyclic dependencies, where one node ends up to be it's own ancestor. In another post I will write about alternative methods that will alleviate some of these problems.

No comments:

Post a Comment