A lot of solutions in t-sql forums are criticized as being procedural rather than set-based, although the solutions look perfectly logical and deliver the correct results. What is the difference between a set-based solution and a procedural solution and why should I care?
I'm putting a bounty on this question for two reasons:
First, bounties are new (to me) and I think it's a cool idea.
Second, while I see some merit in the various answers posted, I would like to see a more thorough treatment of the subject. Differences should be illustrated with code, not analogies, and the performance implications should be spelled out clearly.
I believe this is a huge conceptual issue that newcomers to SQL need to understand and deserve to have explained well.
A procedural approach follows the paradigm of traditional languages used for applications. You perform a function then call another function and so on. When this method is used to retrieve data from a data source, it means requesting data from a set of information, then requesting another set and so on. Notice I used the term "set" within my description. Each retrieval task is requesting data from a separate set of data (even if from the same data source), and each request could return no rows or potentially thousands of rows or more.
The most obvious example of a procedural approach is the use of cursors. Let's take a look at the general syntax to see how it works.
We know SELECT is the basic data retrieval command in the SQL language. Here we see it used in the definition of the cursor. Each time we execute the statement "FETCH NEXT", we are essentially re-executing that SELECT statement to get the next top result row. Here is a very similar approach without defining a cursor:
Let's take a look at a simple example of storing the number of books currently checked out at the library into the local variable @BookCount.
WHILE loop solution:
As you can see, not only is the set-based solution easier to read, it only issues a single statement to the engine as opposed to one statement per row. Most likely, the library doesn't have thousands of books checked out, so the difference here might not be too bad. But the difference can be very, very large if the data source contains millions of rows of data.
It is important to note that set-based design will not automatically result in faster performance, especially if that particular query includes subqueries in the SELECT clause. For example, a query like this:
looks like a set-based approach but really isn't because each of those subqueries will have to run for each row returned by the main query. For more information on ways that some attempted set-based solutions really aren't truly set-based after all, read Jeff Moden's article here: http://www.sqlservercentral.com/articles/T-SQL/61539/ entitled "Hidden RBAR: Triangular Joins".
The set-based solution time was less than 3% of the while-loop time and less than 2% of the cursor time.
The best analogy I have seen comes from Jeff Moden's signature on the main site. And that is that with procedural code you are probably thinking about what you want to do to a row. With set-based code you are usually thinking about what you want to do to a column.
Why should you care?
The efficiencies inherent within a set-based solution can never hope to be matched by a procedural one, as a procedural one can only solve one entity at a time, wheras a set-based one aims to solve all entities in one pass.
answered Oct 25 '09 at 11:11 AM
Matt Whitfield ♦♦
My preference for explaining this question comes down to the paradigm that database systems employ.
When you really get into it, a database system runs on a computer, and eventually works things out iteratively. When you look at an execution plan, it shows you what it's doing, and it's incredibly set-based. Sure, there are things such as merge-joins rather than loop-joins which feel set-oriented, but it's still something that is programmed in C at the end of the day.
The power of 'set-based' comes into play because of the Query Optimizer. When you write a query rather than a procedure, there is a translation between that and what actually runs. That translation understands ways of simplifying the work involved, so that it can take advantage of indexes, spools, and so on.
When you use an iterative solution, you effectively tie the Query Optimizer's hands behind its back, forcing it into a particular path. The way you design the solution may not be entirely different to the way the plan works (consider writing an SSIS package - it's very easy to make one that works in the same way as an execution plan), but you're removing options that could cause it to be solved many times faster.
Combining this with the inherent slowness of explicit cursors (which require the data to be queried repeatedly, or storing temporary resultsets to satisfy the logic), and you find that 'set-based' solutions generally (but certainly not always) work faster.
answered Oct 25 '09 at 07:36 PM
One of the motivations of E.F.Codd when he first proposed the Relational Model of data was to devise a system that didn't require the looping, branching and step-at-a-time processing used in imperative programming languages. His idea was a system that would achieve all its results using only expressions based on a set of operators (relational algebra) returning or assigning relation values in a database.
SQL DBMSs unfortunately are not truly relational systems because SQL doesn't properly implement Codd's model. For example SQL uses tables and query expressions that are not true relations - they may contain duplicate rows and column names for example. For that reason SQL has to support lots of the old-style "procedural" language as well as the "set-based" type of expressions of the relational model.
answered Oct 25 '09 at 11:02 AM
Although the discussion here has focussed on cursors to a large extent, my pet hate is scalar functions in SELECT statements, where the scalar functions contain (and very effectively hide) SELECT statements of their own.
I recently saw a stored proc improve from 7 hours to 2 minutes by replacing the functions with table joins. Some of the functions were SELECTing other scalar functions which themselves had SELECT statements. I can understand the rationale for scalar functions in SQL. It reduces the time it takes to write the code and (possibly) improves readability. It isolates the business logic into reusable code. It... just... doesn't... always translate that well into practical efficient SQL. I think the biggest problem is that they don't LOOK like RBAR. People who can develop set-based solutions to avoid loops will still fall for this trick.
answered Dec 29 '11 at 10:54 PM