A lot of solutions in t-sql forums are criticized as being procedural rather than set-based, although the solutions look perfectly logical and deliver the correct results. What is the difference between a set-based solution and a procedural solution and why should I care?
I'm putting a bounty on this question for two reasons:
First, bounties are new (to me) and I think it's a cool idea.
Second, while I see some merit in the various answers posted, I would like to see a more thorough treatment of the subject. Differences should be illustrated with code, not analogies, and the performance implications should be spelled out clearly.
I believe this is a huge conceptual issue that newcomers to SQL need to understand and deserve to have explained well.
A procedural approach follows the paradigm of traditional languages used for applications. You perform a function then call another function and so on. When this method is used to retrieve data from a data source, it means requesting data from a set of information, then requesting another set and so on. Notice I used the term "set" within my description. Each retrieval task is requesting data from a separate set of data (even if from the same data source), and each request could return no rows or potentially thousands of rows or more.
The most obvious example of a procedural approach is the use of cursors. Let's take a look at the general syntax to see how it works.
We know SELECT is the basic data retrieval command in the SQL language. Here we see it used in the definition of the cursor. Each time we execute the statement "FETCH NEXT", we are essentially re-executing that SELECT statement to get the next top result row. Here is a very similar approach without defining a cursor:
Let's take a look at a simple example of storing the number of books currently checked out at the library into the local variable @BookCount.
WHILE loop solution:
As you can see, not only is the set-based solution easier to read, it only issues a single statement to the engine as opposed to one statement per row. Most likely, the library doesn't have thousands of books checked out, so the difference here might not be too bad. But the difference can be very, very large if the data source contains millions of rows of data.
It is important to note that set-based design will not automatically result in faster performance, especially if that particular query includes subqueries in the SELECT clause. For example, a query like this:
looks like a set-based approach but really isn't because each of those subqueries will have to run for each row returned by the main query. For more information on ways that some attempted set-based solutions really aren't truly set-based after all, read Jeff Moden's article here: http://www.sqlservercentral.com/articles/T-SQL/61539/ entitled "Hidden RBAR: Triangular Joins".
The set-based solution time was less than 3% of the while-loop time and less than 2% of the cursor time.
The best analogy I have seen comes from Jeff Moden's signature on the main site. And that is that with procedural code you are probably thinking about what you want to do to a row. With set-based code you are usually thinking about what you want to do to a column.
Why should you care?
The efficiencies inherent within a set-based solution can never hope to be matched by a procedural one, as a procedural one can only solve one entity at a time, wheras a set-based one aims to solve all entities in one pass.
answered Oct 25, 2009 at 11:11 AM
Matt Whitfield ♦♦
My preference for explaining this question comes down to the paradigm that database systems employ.
When you really get into it, a database system runs on a computer, and eventually works things out iteratively. When you look at an execution plan, it shows you what it's doing, and it's incredibly set-based. Sure, there are things such as merge-joins rather than loop-joins which feel set-oriented, but it's still something that is programmed in C at the end of the day.
The power of 'set-based' comes into play because of the Query Optimizer. When you write a query rather than a procedure, there is a translation between that and what actually runs. That translation understands ways of simplifying the work involved, so that it can take advantage of indexes, spools, and so on.
When you use an iterative solution, you effectively tie the Query Optimizer's hands behind its back, forcing it into a particular path. The way you design the solution may not be entirely different to the way the plan works (consider writing an SSIS package - it's very easy to make one that works in the same way as an execution plan), but you're removing options that could cause it to be solved many times faster.
Combining this with the inherent slowness of explicit cursors (which require the data to be queried repeatedly, or storing temporary resultsets to satisfy the logic), and you find that 'set-based' solutions generally (but certainly not always) work faster.
answered Oct 25, 2009 at 07:36 PM
One of the motivations of E.F.Codd when he first proposed the Relational Model of data was to devise a system that didn't require the looping, branching and step-at-a-time processing used in imperative programming languages. His idea was a system that would achieve all its results using only expressions based on a set of operators (relational algebra) returning or assigning relation values in a database.
SQL DBMSs unfortunately are not truly relational systems because SQL doesn't properly implement Codd's model. For example SQL uses tables and query expressions that are not true relations - they may contain duplicate rows and column names for example. For that reason SQL has to support lots of the old-style "procedural" language as well as the "set-based" type of expressions of the relational model.
answered Oct 25, 2009 at 11:02 AM
I landed here while trying to find something that described when it is not appropriate to use a set based language to solve a problem. Or in other words what types of algorithms don't lend themselves nicely to set based implementations? I think I have an example:
Write bubble sort using SQL. Check this out and see what you think:
I don't know that I can elegantly and simply articulate the principles involved. If someone can that would be great. That is really what I am looking for; simple language that a semi technical stakeholder might be able to understand. My attempt at it would be something like set based is great unless you need to compare the rows against each other while collecting the result set.
Another thing that should be mentioned is the difficulty of debugging an algorithm when it happens all at once like the set based approach. I may want to see which comparison created an unexpected result and for me this is more difficult with SQL.
The other thing I have run into with this is the maintainability of the code in this situation. In my case I inherited a set of SPs that fall into this category. The psuedo code algorithm that describes the solution is iterative and very straight forward. However the actual SQL code that implements it is very difficult to discern. The programmer did not want to use cursors which helped with performance but it is hard to read.
On a similar note I wonder if the CS discussion on this would involve recurrence relations in relationship to the time complexity of set based algorithms.
Looking forward to hearing this discussion if it gets bumped.