SELECT in WITH RECURSIVE
We saw the use of WITH in Common Table Expressions in this post: Common Table Expressions (CTE) In PostgreSQL
CTE are nothing but temporary tables that we can use in the same query execution.
The optional RECURSIVE modifier changes WITH from a mere syntactic convenience into a feature that accomplishes things not otherwise possible in standard SQL.
Using RECURSIVE, a WITH query can refer to its own output. A very simple example is this query to sum the integers from 1 through 100:
WITH RECURSIVE t(n) AS ( VALUES (1) UNION ALL SELECT n+1 FROM t WHERE n < 100 ) SELECT sum(n) FROM t; //Output 5050
In the example above, the working table has just a single row in each step, and it takes on the values from 1 through 100 in successive steps. In the 100th step, there is no output because of the WHERE clause, and so the query terminates.
From the query above, we can easily make the syntax:
WITH RECURSIVE CTE_name AS( CTE_query -- non-recursive UNION [ALL] CTE_query -- recursive ) SELECT * FROM CTE_name;
The general form of a recursive WITH query is always a non-recursive term, then UNION (or UNION ALL), then a recursive term, where only the recursive term can contain a reference to the query's own output.
Such a query is executed as follows:
Recursive Query Evaluation
1. Evaluate the non-recursive term. For UNION (but not UNION ALL), discard duplicate rows. Include all remaining rows in the result of the recursive query, and also place them in a temporary working table.
2. So long as the working table is not empty, repeat these steps:
1. Evaluate the recursive term, substituting the current contents of the working table for the recursive self-reference. For UNION (but not UNION ALL), discard duplicate rows and rows that duplicate any previous result row. Include all remaining rows in the result of the recursive query, and also place them in a temporary intermediate table.
2. Replace the contents of the working table with the contents of the intermediate table, then empty the intermediate table.
Recursive queries are typically used to deal with hierarchical or tree-structured data.
I am using this database for the next example which is available on my Github public repo
The following query returns the list of all the employees who reports to employee with id 5.
WITH RECURSIVE employee_list AS ( SELECT employee_id, reports_to, first_name, title FROM employees WHERE reports_to = 5 UNION SELECT e.employee_id, e.reports_to, e.first_name, e.title FROM employees e INNER JOIN employee_list s ON s.employee_id = e.reports_to ) SELECT * FROM employee_list; //Output employee_id reports_to first_name title 6 5 "Michael" "Sales Representative" 7 5 "Robert" "Sales Representative" 9 5 "Anne" "Sales Representative" End of results
When working with recursive queries it is important to be sure that the recursive part of the query will eventually return no tuples, or else the query will loop indefinitely.
Sometimes, using UNION instead of UNION ALL can accomplish this by discarding rows that duplicate previous output rows.
However, often a cycle does not involve output rows that are completely duplicate: it may be necessary to check just one or a few fields to see if the same point has been reached before.
The standard method for handling such situations is to compute an array of the already-visited values.
For example, consider the following query that searches a table graph using a link field:
WITH RECURSIVE search_graph(id, link, data, depth) AS ( SELECT g.id, g.link, g.data, 1 FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1 FROM graph g, search_graph sg WHERE g.id = sg.link ) SELECT * FROM search_graph;
This query will loop if the link relationships contain cycles.
Because we require a "depth" output, just changing UNION ALL to UNION would not eliminate the looping.
Instead we need to recognize whether we have reached the same row again while following a particular path of links.
We add two columns path and cycle to the loop-prone query:
WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS ( SELECT g.id, g.link, g.data, 1, ARRAY[g.id], false FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1, path || g.id, g.id = ANY(path) FROM graph g, search_graph sg WHERE g.id = sg.link AND NOT cycle ) SELECT * FROM search_graph;
Aside from preventing cycles, the array value is often useful in its own right as representing the "path" taken to reach any particular row.
In the general case where more than one field needs to be checked to recognize a cycle, use an array of rows.
For example, if we needed to compare fields f1 and f2:
WITH RECURSIVE search_graph(id, link, data, depth, path, cycle) AS ( SELECT g.id, g.link, g.data, 1, ARRAY[ROW(g.f1, g.f2)], false FROM graph g UNION ALL SELECT g.id, g.link, g.data, sg.depth + 1, path || ROW(g.f1, g.f2), ROW(g.f1, g.f2) = ANY(path) FROM graph g, search_graph sg WHERE g.id = sg.link AND NOT cycle ) SELECT * FROM search_graph;
Tip: Omit the ROW() syntax in the common case where only one field needs to be checked to recognize a cycle. This allows a simple array rather than a composite-type array to be used, gaining efficiency.
Tip: The recursive query evaluation algorithm produces its output in breadth-first search order. You can display the results in depth-first search order by making the outer query ORDER BY a "path" column constructed in this way.
A helpful trick for testing queries when you are not certain if they might loop is to place a LIMIT in the parent query.
For example, this query would loop forever without the LIMIT:
WITH RECURSIVE t(n) AS ( SELECT 1 UNION ALL SELECT n+1 FROM t ) SELECT n FROM t LIMIT 100;
This works because PostgreSQL's implementation evaluates only as many rows of a WITH query as are actually fetched by the parent query.
Using this trick in production is not recommended, because other systems might work differently.
Also, it usually won't work if you make the outer query sort the recursive query's results or join them to some other table, because in such cases the outer query will usually try to fetch all of the WITH query's output anyway.
A useful property of WITH queries is that they are evaluated only once per execution of the parent query, even if they are referred to more than once by the parent query or sibling WITH queries.
Thus, expensive calculations that are needed in multiple places can be placed within a WITH query to avoid redundant work.
Another possible application is to prevent unwanted multiple evaluations of functions with side-effects.
However, the other side of this coin is that the optimizer is less able to push restrictions from the parent query down into a WITH query than an ordinary sub-query.
The WITH query will generally be evaluated as written, without suppression of rows that the parent query might discard afterwards. (But, as mentioned above, evaluation might stop early if the reference(s) to the query demand only a limited number of rows.)
The examples above only show WITH being used with SELECT, but it can be attached in the same way to INSERT, UPDATE, or DELETE.
In each case it effectively provides temporary table(s) that can be referred to in the main command.