x

Set-Based vs. Procedural

Seeder question.

A lot of solutions in t-sql forums are criticized as being procedural rather than set-based, although the solutions look perfectly logical and deliver the correct results. What is the difference between a set-based solution and a procedural solution and why should I care?

I'm putting a bounty on this question for two reasons:

First, bounties are new (to me) and I think it's a cool idea.

Second, while I see some merit in the various answers posted, I would like to see a more thorough treatment of the subject. Differences should be illustrated with code, not analogies, and the performance implications should be spelled out clearly.

I believe this is a huge conceptual issue that newcomers to SQL need to understand and deserve to have explained well.

more ▼

asked Oct 25 '09 at 09:50 AM in Default

Bob Hovious gravatar image

Bob Hovious
1.6k 5 6 9

I'll leave it 6 days before expanding my answer :)
Oct 28 '09 at 02:41 PM Matt Whitfield ♦♦
(comments are locked)
10|1200 characters needed characters left

9 answers: sort voted first

A procedural approach follows the paradigm of traditional languages used for applications. You perform a function then call another function and so on. When this method is used to retrieve data from a data source, it means requesting data from a set of information, then requesting another set and so on. Notice I used the term "set" within my description. Each retrieval task is requesting data from a separate set of data (even if from the same data source), and each request could return no rows or potentially thousands of rows or more.

The most obvious example of a procedural approach is the use of cursors. Let's take a look at the general syntax to see how it works.

DECLARE MyCursor CURSOR
    FOR SELECT ...
OPEN MyCursor

FETCH NEXT FROM MyCursor 
INTO @LocalVariable

WHILE @@FETCH_STATUS = 0
BEGIN
    -- do something

    FETCH NEXT FROM MyCursor 
    INTO @LocalVariable
END
CLOSE MyCursor
DEALLOCATE MyCursor

We know SELECT is the basic data retrieval command in the SQL language. Here we see it used in the definition of the cursor. Each time we execute the statement "FETCH NEXT", we are essentially re-executing that SELECT statement to get the next top result row. Here is a very similar approach without defining a cursor:

DECLARE @var varchar(50); SET @var = 'x';
WHILE @var is not null
BEGIN
    SET @var = 
    (
    	SELECT TOP(1) column1 FROM table1
    	WHERE column1 > @var
    	ORDER BY column1
    )
    IF @var is null
    	BREAK;
    -- do something
END



The idea behind a set-based approach is to attempt to retrieve the same final result but with as few set requests as possible. In fact, the ideal scenario is to design a solution that allows the entire final result to be retrieved from a single dataset.

Let's take a look at a simple example of storing the number of books currently checked out at the library into the local variable @BookCount.

Cursor solution:

DECLARE @BookCount int; SET @BookCount = 0;
DECLARE MyCursor CURSOR
    FOR SELECT book_id FROM books_checked_out
OPEN MyCursor

FETCH NEXT FROM MyCursor
INTO @BookCount

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @BookCount = @BookCount + 1;
    FETCH NEXT FROM MyCursor 
    INTO @BookCount
END
CLOSE MyCursor
DEALLOCATE MyCursor

WHILE loop solution:

DECLARE @BookCount int; SET @BookCount = 0;
DECLARE @BookID int; SET @BookID = 0;
WHILE @BookID is not null
BEGIN
SET @BookID =
(
	SELECT TOP(1) book_id
	FROM books_checked_out
	WHERE book_id > @BookID
);
    IF @BookID is not null
    	SET @BookCount = @BookCount + 1;
END

Set-based solution:

DECLARE @BookCount int;
SET @BookCount = (SELECT COUNT(*) FROM books_checked_out);

As you can see, not only is the set-based solution easier to read, it only issues a single statement to the engine as opposed to one statement per row. Most likely, the library doesn't have thousands of books checked out, so the difference here might not be too bad. But the difference can be very, very large if the data source contains millions of rows of data.

It is important to note that set-based design will not automatically result in faster performance, especially if that particular query includes subqueries in the SELECT clause. For example, a query like this:

SELECT a
    , (SELECT x FROM table2) AS field1
    , (SELECT y FROM table3) AS field2
    , (SELECT z FROM table4) AS field3
FROM table1

looks like a set-based approach but really isn't because each of those subqueries will have to run for each row returned by the main query. For more information on ways that some attempted set-based solutions really aren't truly set-based after all, read Jeff Moden's article here: http://www.sqlservercentral.com/articles/T-SQL/61539/ entitled "Hidden RBAR: Triangular Joins".

Edit:
In response to Matt's request, I decided to use the data from the FIFO challenge. In this simple example, though, I am just determining the current items. For each IN or RETurn transaction, inventory increases by the number of items in the transaction, and it decreases for each OUT transaction. Edit2:
corrected script and provided results

-- cursor
DBCC FREEPROCCACHE;
GO

DECLARE @StartTime datetime; SET @StartTime = GETDATE();
DECLARE @CurrentItems bigint; SET @CurrentItems = 0;
DECLARE @TranCode char(3), @Items int;
DECLARE MyCursor CURSOR
    FOR SELECT TranCode, Items FROM Stock

OPEN MyCursor
FETCH NEXT FROM MyCursor INTO @TranCode, @Items
WHILE @@FETCH_STATUS = 0
BEGIN
    IF @TranCode IN ('IN', 'RET')
    	SET @CurrentItems = @CurrentItems + @Items;
    ELSE
    	SET @CurrentItems = @CurrentItems - @Items;
    FETCH NEXT FROM MyCursor INTO @TranCode, @Items
END
CLOSE MyCursor
DEALLOCATE MyCursor

SELECT 'cursor', DATEDIFF(ms, @StartTime, GETDATE()), @CurrentItems

GO

-- WHILE loop
DBCC FREEPROCCACHE;
GO

DECLARE @StartTime datetime; SET @StartTime = GETDATE();
DECLARE @StockID int; SET @StockID = 0;
DECLARE @CurrentItems bigint; SET @CurrentItems = 0;
DECLARE @TranCode char(3), @Items int;

WHILE @StockID is not null
BEGIN
    SET @StockID =
    (
    	SELECT TOP(1) StockID
    	FROM Stock
    	WHERE StockID > @StockID
    	ORDER BY StockID
    )

    IF @StockID is not null
    BEGIN
    	SELECT @TranCode=TranCode, @Items=Items
    	FROM Stock
    	WHERE StockID = @StockID

    	IF @TranCode IN ('IN', 'RET')
    		SET @CurrentItems = @CurrentItems + @Items;
    	ELSE
    		SET @CurrentItems = @CurrentItems - @Items;
    END
END

SELECT 'while', DATEDIFF(ms, @StartTime, GETDATE()), @CurrentItems

GO

-- set-based
DBCC FREEPROCCACHE;
GO

DECLARE @StartTime datetime; SET @StartTime = GETDATE();
DECLARE @CurrentItems bigint; SET @CurrentItems = 0;

SELECT @CurrentItems = @CurrentItems +
    CASE TranCode
    	WHEN 'IN' THEN Items
    	WHEN 'RET' THEN Items
    	WHEN 'OUT' THEN -Items
    END
FROM Stock

SELECT 'set', DATEDIFF(ms, @StartTime, GETDATE()), @CurrentItems

Results:

  • Cursor = 48046 ms
  • While loop = 30343 ms
  • set-based = 843 ms

The set-based solution time was less than 3% of the while-loop time and less than 2% of the cursor time.

more ▼

answered Oct 28 '09 at 08:55 PM

Tom Staab gravatar image

Tom Staab
5.8k 6 8 10

Now we're talking. I'm not going to pick yours yet, Tom, because I'm hoping the competition for a bounty will provoke a few more efforts like this. I'm giving you a vote for raising the bar.
Oct 28 '09 at 09:08 PM Bob Hovious
Thanks, Bob. I'm glad you liked my answer. I hadn't planned to answer since I liked some of the answers that were there already, but once you asked for more details with examples (and offered a nice bounty), I reconsidered. I use examples a lot when explaining concepts to other developers, anyway, so it was natural for me to do so here. Besides, if you're willing to be the first to give a bounty, I suppose I'd be willing to be the first to get one if my answer is chosen. ;)
Oct 28 '09 at 11:06 PM Tom Staab
+1 - excellent answer - just wondering if it would be worth putting in some timing differences of the while / cursor / set based statements over, say, 1M rows?
Oct 29 '09 at 08:07 AM Matt Whitfield ♦♦
About a week ago, I edited my response to include examples per Matt's suggestion, but I forgot to add a comment until now. The code and results are at the bottom of the answer.
Nov 04 '09 at 01:12 PM Tom Staab
A late comment: Nice example. But I have to disagree on one point: Each time we execute the statement "FETCH NEXT", we are essentially re-executing that SELECT statement to get the next top result row. Are we really? Isn't a cursor in SQL Server represented by an actual server object that is kept in memory? At least that was my assumption.
Nov 18 '10 at 03:44 AM Magnus Ahlkvist
(comments are locked)
10|1200 characters needed characters left

The best analogy I have seen comes from Jeff Moden's signature on the main site. And that is that with procedural code you are probably thinking about what you want to do to a row. With set-based code you are usually thinking about what you want to do to a column.

Why should you care?

The efficiencies inherent within a set-based solution can never hope to be matched by a procedural one, as a procedural one can only solve one entity at a time, wheras a set-based one aims to solve all entities in one pass.

more ▼

answered Oct 25 '09 at 11:11 AM

Matt Whitfield gravatar image

Matt Whitfield ♦♦
29.4k 61 65 87

(comments are locked)
10|1200 characters needed characters left

My preference for explaining this question comes down to the paradigm that database systems employ.

When you really get into it, a database system runs on a computer, and eventually works things out iteratively. When you look at an execution plan, it shows you what it's doing, and it's incredibly set-based. Sure, there are things such as merge-joins rather than loop-joins which feel set-oriented, but it's still something that is programmed in C at the end of the day.

The power of 'set-based' comes into play because of the Query Optimizer. When you write a query rather than a procedure, there is a translation between that and what actually runs. That translation understands ways of simplifying the work involved, so that it can take advantage of indexes, spools, and so on.

When you use an iterative solution, you effectively tie the Query Optimizer's hands behind its back, forcing it into a particular path. The way you design the solution may not be entirely different to the way the plan works (consider writing an SSIS package - it's very easy to make one that works in the same way as an execution plan), but you're removing options that could cause it to be solved many times faster.

Combining this with the inherent slowness of explicit cursors (which require the data to be queried repeatedly, or storing temporary resultsets to satisfy the logic), and you find that 'set-based' solutions generally (but certainly not always) work faster.

more ▼

answered Oct 25 '09 at 07:36 PM

Rob Farley gravatar image

Rob Farley
5.7k 15 18 20

(comments are locked)
10|1200 characters needed characters left

One of the motivations of E.F.Codd when he first proposed the Relational Model of data was to devise a system that didn't require the looping, branching and step-at-a-time processing used in imperative programming languages. His idea was a system that would achieve all its results using only expressions based on a set of operators (relational algebra) returning or assigning relation values in a database.

SQL DBMSs unfortunately are not truly relational systems because SQL doesn't properly implement Codd's model. For example SQL uses tables and query expressions that are not true relations - they may contain duplicate rows and column names for example. For that reason SQL has to support lots of the old-style "procedural" language as well as the "set-based" type of expressions of the relational model.

more ▼

answered Oct 25 '09 at 11:02 AM

David 1 gravatar image

David 1
1.8k 1 3

(comments are locked)
10|1200 characters needed characters left

I landed here while trying to find something that described when it is not appropriate to use a set based language to solve a problem. Or in other words what types of algorithms don't lend themselves nicely to set based implementations? I think I have an example:

Write bubble sort using SQL. Check this out and see what you think:

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=124131

I don't know that I can elegantly and simply articulate the principles involved. If someone can that would be great. That is really what I am looking for; simple language that a semi technical stakeholder might be able to understand. My attempt at it would be something like set based is great unless you need to compare the rows against each other while collecting the result set.

Another thing that should be mentioned is the difficulty of debugging an algorithm when it happens all at once like the set based approach. I may want to see which comparison created an unexpected result and for me this is more difficult with SQL.

The other thing I have run into with this is the maintainability of the code in this situation. In my case I inherited a set of SPs that fall into this category. The psuedo code algorithm that describes the solution is iterative and very straight forward. However the actual SQL code that implements it is very difficult to discern. The programmer did not want to use cursors which helped with performance but it is hard to read.

On a similar note I wonder if the CS discussion on this would involve recurrence relations in relationship to the time complexity of set based algorithms.

Looking forward to hearing this discussion if it gets bumped.

Blake
more ▼

answered Nov 17 '10 at 12:20 PM

sbhollis gravatar image

sbhollis
1 1

I'm not really sure what you mean by:

On a similar note I wonder if the CS discussion on this would involve recurrence relations in relationship to the time complexity of set based algorithms.

But with the other stuff I can really identify with what you are saying - and felt exactly the same way maybe 6 or 7 years ago. Yes, set based is different. Yes, it can be hard to understand. But it can also yield very positive results.

I would be tempted to change your explanation from

set based is great unless you need to compare the rows against each other while collecting the result set.

To

>set based is great unless you need to compare the rows against each other while collecting the result set - which is when you use CLR.
Nov 18 '10 at 12:31 AM Matt Whitfield ♦♦

"I don't know that I can elegantly and simply articulate the principles involved. If someone can that would be great"

Bubble sort: SELECT [your coulmn] ORDER BY [your column]

"Another thing that should be mentioned is the difficulty of debugging an algorithm when it happens all at once like the set based approach"

Execute portions of the SP by highlighting only what you want in SSMS and hit F5. Work from the inner queries outward. Add a WHERE clause to only return a few rows until you understand what is happening

SQL is like chess in that it may not take long to learn the language but it takes years to master

SQL is just a tool and not all tools are right for every job.
Nov 18 '10 at 05:08 AM Scot Hauder
My comments about Computer Science were a poor attempt at me trying to dust off my unused knowledge from school. I just figure there must be a "formal" description of what we are discussing - when is a set based language the wrong tool. If you couldn't tell I am new to database programming =P I spared you guys the pain of the real algorithm I am dealing with and chose the task of implementing bubble sort because it is simple but has the same attributes as my problem. I will most likely do as you suggested and use CLR to solve this issue. I think it is too dificult to make a general statement about when SQL is the wrong tool for the job but it would be great if there were such a rule of thumb. Thanks for the comments.
Nov 18 '10 at 11:24 AM sbhollis
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x977
x242
x61
x11
x2

asked: Oct 25 '09 at 09:50 AM

Seen: 10717 times

Last Updated: Jul 31 '12 at 05:17 PM