Skip to main content

Efficient SQL for Hierarchical Data

·7 mins

This article explores efficient methods for retrieving hierarchical data in PostgreSQL using PL/pgSQL and Recursive CTEs.

Introduction #

While working on CardHero, one of the entities in my data model was a folder, which stored notes. Here’s the corresponding type in Go:

type folder struct {
	id        int
	name      string
	parent_id *int
}

The only notable thing here is that a folder references its parent since folders are part of a hierarchy. There’s a root folder at the top, whose parent is nil, and each folder can have folders within itself.

PostgreSQL is my go-to database and here’s the straightforward schema I use for the folders table:

CREATE TABLE folders (
    id int,
    name text not null,
    parent_id int,
    primary key (id),
    foreign key (parent_id) references folders (id)
);

The parent_id is nullable, so that we can represent the root folder easily. We also establish parent_id as a foreign key, but on the same table. This table is thus, self-referential.

Problem Statement #

While working on CardHero, a need arose to display breadcrumbs i.e. the folder path that led to the current folder that the user is interacting with. We had access to the id of the current folder, and had to trace the path up to the root.

A straightforward approach to address this is through a PL/pgSQL function, which can be either iterative or recursive.

To compare the performance of these options, I set up a test hierarchy 100 levels deep:

folder_0
└─ folder_1
   └─ folder_2
      └─ folder_3
         └─ folder_4
              ...
              └─ folder_99

Iterative PL/pgSQL #

Following is a simple iterative PL/pgSQL function that lets us grab a heirarchy, given the id of the current folder.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CREATE OR REPLACE FUNCTION get_hierarchy(folder_id int)
RETURNS SETOF folders AS $$
DECLARE
    f folders;
BEGIN
    -- retrieve the current folder
    SELECT INTO f * FROM folders WHERE id = folder_id;
    RETURN NEXT f;

    -- loop until we reach the root
    WHILE f.parent_id IS NOT NULL LOOP
        SELECT INTO f *
        FROM folders
        WHERE id = f.parent_id;

        RETURN NEXT f;
    END LOOP;
END;
$$ LANGUAGE plpgsql;

First, the function declares a result set to accumulate into:

3
4
DECLARE
    f folders;

Next, we add the current folder to the result set:

5
6
7
8
BEGIN
    -- retrieve the current folder
    SELECT INTO f * FROM folders WHERE id = folder_id;
    RETURN NEXT f;

Next, the function loops until we reach the root. The loop condition checks f.parent_id, which references the row that was last added to the result set f. If it’s not null, the loop fetches the next parent and adds it to the result set.

10
11
12
13
14
15
16
17
18
    -- loop until we reach the root
    WHILE f.parent_id IS NOT NULL LOOP
        SELECT INTO f *
        FROM folders
        WHERE id = f.parent_id;

        RETURN NEXT f;
    END LOOP;
END;

As a benchmark, I queried for the entire hierarchy starting at the last folder, with id = 99.

SELECT name FROM get_hierarchy(99);

The iterative approach took ~20ms. Since the exact latency figure may vary depending on the environment, we will compare it with other options to gauge its relative performance.

Recursive PL/pgSQL #

We can take a recursive approach to the PL/pgSQL function as well:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
CREATE OR REPLACE FUNCTION get_hierarchy_recursive(folder_id int)
RETURNS SETOF folders AS $$
DECLARE
    f folders;
BEGIN
    -- retrieve the current folder
    SELECT INTO f * FROM folders WHERE id = folder_id;
    RETURN NEXT f;

    -- if the current folder has a parent folder,
    -- recursively fetch its parent
    IF f.parent_id IS NOT NULL THEN
        RETURN QUERY
        SELECT * FROM get_hierarchy_recursive(f.parent_id);
    END IF;
END;
$$ LANGUAGE plpgsql;

This approach replaces the loop with a recursive call. The recursion will terminate on reaching the root, when f.parent_id = NULL. Recursion can be a bit slower, with average latency observed at ~30ms.

Common Table Expressions #

The PL/pgSQL approaches work just fine and perform well, but I was curious to see if this problem could be solved with vanilla SQL. A little bit of research led me to Recursive Common Table Expressions (CTEs).

However, before jumping into recursive CTEs, let’s first learn about regular CTEs. Common table expressions are queries that result in the creation of virtual tables, which we can then query from. These virtual tables are temporary and only exist for the lifetime of the query. Here’s an example:

WITH department_salary AS (
    SELECT department, AVG(salary) AS average_salary
    FROM employees
    WHERE department = 'Sales'
    GROUP BY department
)
SELECT department, average_salary
FROM department_salary;

This results in a virtual table named department_salary being created, which contains the names of departments and their average salary. We can then query department_salary like a regular table.

Can you achieve the same result without CTEs? Yes. In a way, CTEs are syntax sugar, but some databases also have special optimizations for them, such as using materialized views to represent the virtual table, leading to better performance.

Recursive CTEs #

Recursive CTEs are more than just syntax sugar, though. They are designed to run recursive queries. Following is the recursive CTE that I wrote for CardHero. Using the id of the current folder, it retrieves its hierarchy up to the root:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
WITH RECURSIVE folder_hierarchy AS (
    -- retrieves current folder
    SELECT f.name, f.id, f.parent_id
    FROM folders f
    WHERE id = 99

    UNION ALL

    -- recursive case / retrieves ancestors of current folder
    SELECT f.name, f.id, f.parent_id
    FROM folders f
        INNER JOIN folder_hierarchy fh ON fh.parent_id = f.id
)
SELECT name FROM folder_hierarchy;

Let’s break down this query, starting with the first line. You’ll notice we have a special keyword RECURSIVE which indicates that this is a recursive CTE:

1
WITH RECURSIVE folder_hierarchy AS (

We start by retrieving the current folder, the one we have the id for:

2
3
4
5
    -- retrieves current folder
    SELECT f.name, f.id, f.parent_id
    FROM folders f
    WHERE id = 99

The recursive case forms the backbone of the recursive CTE, facilitating the traversal of the folder hierarchy until it reaches the root. During each recursive call, the next parent in the hierarchy is appended to the recursive CTE, which essentially extends the virtual table folder_hierarchy.

The inner join acts as both the recursive case, as well as base case of this recursion. It joins folders with folder_hierarchy. The ON clause matches the following:

  • fh.parent_id which is the parent of the folder added to the CTE in the previous iteration
  • f.id which is its parent folder from the folders table
 9
10
11
12
    -- recursive case / retrieves ancestors of current folder
    SELECT f.name, f.id, f.parent_id
    FROM folders f
        INNER JOIN folder_hierarchy fh ON fh.parent_id = f.id

When the inner join reaches the root, fh.parent_id would be NULL, leading to the end of the recursion. Finally, the results from these cases are merged with a simple UNION ALL. Thus, folder_hierarchy contains all 100 folders in the hierarchy starting from folder_99.

We can now query this recursive CTE at the very end to retrieve whatever we need.

13
14
)
SELECT name FROM folder_hierarchy;

Similar to the recursive PL/pgSQL function, our recursive CTE also takes ~30ms to execute.

Conclusion #

Despite its slightly inferior performance, I still chose the recursive CTE for the job, owing to its readability. It serves the purpose fast enough, at least for now, and is very elegant to read.

The iterative PL/pgSQL function is undoubtedly the fastest. It should likely scale better with deeper/wider hierarchies, more complex retrievals, and a higher system load. It will be the way to go, if and whenever the recursive CTE becomes a bottleneck.