question

Bob Hovious avatar image
Bob Hovious asked

Set-Based vs. Procedural

Seeder question.

A lot of solutions in t-sql forums are criticized as being procedural rather than set-based, although the solutions look perfectly logical and deliver the correct results. What is the difference between a set-based solution and a procedural solution and why should I care?

I'm putting a bounty on this question for two reasons:

First, bounties are new (to me) and I think it's a cool idea.

Second, while I see some merit in the various answers posted, I would like to see a more thorough treatment of the subject. Differences should be illustrated with code, not analogies, and the performance implications should be spelled out clearly.

I believe this is a huge conceptual issue that newcomers to SQL need to understand and deserve to have explained well.

t-sqlperformancebest-practiceset-basedprocedural
1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Matt Whitfield avatar image Matt Whitfield ♦♦ commented ·
I'll leave it 6 days before expanding my answer :)
0 Likes 0 ·
David 1 avatar image
David 1 answered

One of the motivations of E.F.Codd when he first proposed the Relational Model of data was to devise a system that didn't require the looping, branching and step-at-a-time processing used in imperative programming languages. His idea was a system that would achieve all its results using only expressions based on a set of operators (relational algebra) returning or assigning relation values in a database.

SQL DBMSs unfortunately are not truly relational systems because SQL doesn't properly implement Codd's model. For example SQL uses tables and query expressions that are not true relations - they may contain duplicate rows and column names for example. For that reason SQL has to support lots of the old-style "procedural" language as well as the "set-based" type of expressions of the relational model.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Matt Whitfield avatar image
Matt Whitfield answered

The best analogy I have seen comes from Jeff Moden's signature on the main site. And that is that with procedural code you are probably thinking about what you want to do to a row. With set-based code you are usually thinking about what you want to do to a column.

Why should you care?

The efficiencies inherent within a set-based solution can never hope to be matched by a procedural one, as a procedural one can only solve one entity at a time, wheras a set-based one aims to solve all entities in one pass.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Rob Farley avatar image
Rob Farley answered

My preference for explaining this question comes down to the paradigm that database systems employ.

When you really get into it, a database system runs on a computer, and eventually works things out iteratively. When you look at an execution plan, it shows you what it's doing, and it's incredibly set-based. Sure, there are things such as merge-joins rather than loop-joins which feel set-oriented, but it's still something that is programmed in C at the end of the day.

The power of 'set-based' comes into play because of the Query Optimizer. When you write a query rather than a procedure, there is a translation between that and what actually runs. That translation understands ways of simplifying the work involved, so that it can take advantage of indexes, spools, and so on.

When you use an iterative solution, you effectively tie the Query Optimizer's hands behind its back, forcing it into a particular path. The way you design the solution may not be entirely different to the way the plan works (consider writing an SSIS package - it's very easy to make one that works in the same way as an execution plan), but you're removing options that could cause it to be solved many times faster.

Combining this with the inherent slowness of explicit cursors (which require the data to be queried repeatedly, or storing temporary resultsets to satisfy the logic), and you find that 'set-based' solutions generally (but certainly not always) work faster.

10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Tom Staab avatar image
Tom Staab answered

A procedural approach follows the paradigm of traditional languages used for applications. You perform a function then call another function and so on. When this method is used to retrieve data from a data source, it means requesting data from a set of information, then requesting another set and so on. Notice I used the term "set" within my description. Each retrieval task is requesting data from a separate set of data (even if from the same data source), and each request could return no rows or potentially thousands of rows or more.

The most obvious example of a procedural approach is the use of cursors. Let's take a look at the general syntax to see how it works.

DECLARE MyCursor CURSOR
    FOR SELECT ...
OPEN MyCursor

FETCH NEXT FROM MyCursor 
INTO @LocalVariable

WHILE @@FETCH_STATUS = 0
BEGIN
    -- do something

    FETCH NEXT FROM MyCursor 
    INTO @LocalVariable
END
CLOSE MyCursor
DEALLOCATE MyCursor

We know SELECT is the basic data retrieval command in the SQL language. Here we see it used in the definition of the cursor. Each time we execute the statement "FETCH NEXT", we are essentially re-executing that SELECT statement to get the next top result row. Here is a very similar approach without defining a cursor:

DECLARE @var varchar(50); SET @var = 'x';
WHILE @var is not null
BEGIN
    SET @var = 
    (
    	SELECT TOP(1) column1 FROM table1
    	WHERE column1 > @var
    	ORDER BY column1
    )
    IF @var is null
    	BREAK;
    -- do something
END



The idea behind a set-based approach is to attempt to retrieve the same final result but with as few set requests as possible. In fact, the ideal scenario is to design a solution that allows the entire final result to be retrieved from a single dataset.

Let's take a look at a simple example of storing the number of books currently checked out at the library into the local variable @BookCount.

Cursor solution:

DECLARE @BookCount int; SET @BookCount = 0;
DECLARE MyCursor CURSOR
    FOR SELECT book_id FROM books_checked_out
OPEN MyCursor

FETCH NEXT FROM MyCursor
INTO @BookCount

WHILE @@FETCH_STATUS = 0
BEGIN
    SET @BookCount = @BookCount + 1;
    FETCH NEXT FROM MyCursor 
    INTO @BookCount
END
CLOSE MyCursor
DEALLOCATE MyCursor

WHILE loop solution:

DECLARE @BookCount int; SET @BookCount = 0;
DECLARE @BookID int; SET @BookID = 0;
WHILE @BookID is not null
BEGIN
SET @BookID =
(
	SELECT TOP(1) book_id
	FROM books_checked_out
	WHERE book_id > @BookID
);
    IF @BookID is not null
    	SET @BookCount = @BookCount + 1;
END

Set-based solution:

DECLARE @BookCount int;
SET @BookCount = (SELECT COUNT(*) FROM books_checked_out);

As you can see, not only is the set-based solution easier to read, it only issues a single statement to the engine as opposed to one statement per row. Most likely, the library doesn't have thousands of books checked out, so the difference here might not be too bad. But the difference can be very, very large if the data source contains millions of rows of data.

It is important to note that set-based design will not automatically result in faster performance, especially if that particular query includes subqueries in the SELECT clause. For example, a query like this:

SELECT a
    , (SELECT x FROM table2) AS field1
    , (SELECT y FROM table3) AS field2
    , (SELECT z FROM table4) AS field3
FROM table1

looks like a set-based approach but really isn't because each of those subqueries will have to run for each row returned by the main query. For more information on ways that some attempted set-based solutions really aren't truly set-based after all, read Jeff Moden's article here: http://www.sqlservercentral.com/articles/T-SQL/61539/ entitled "Hidden RBAR: Triangular Joins".

Edit:
In response to Matt's request, I decided to use the data from the FIFO challenge. In this simple example, though, I am just determining the current items. For each IN or RETurn transaction, inventory increases by the number of items in the transaction, and it decreases for each OUT transaction. Edit2:
corrected script and provided results

-- cursor
DBCC FREEPROCCACHE;
GO

DECLARE @StartTime datetime; SET @StartTime = GETDATE();
DECLARE @CurrentItems bigint; SET @CurrentItems = 0;
DECLARE @TranCode char(3), @Items int;
DECLARE MyCursor CURSOR
    FOR SELECT TranCode, Items FROM Stock

OPEN MyCursor
FETCH NEXT FROM MyCursor INTO @TranCode, @Items
WHILE @@FETCH_STATUS = 0
BEGIN
    IF @TranCode IN ('IN', 'RET')
    	SET @CurrentItems = @CurrentItems + @Items;
    ELSE
    	SET @CurrentItems = @CurrentItems - @Items;
    FETCH NEXT FROM MyCursor INTO @TranCode, @Items
END
CLOSE MyCursor
DEALLOCATE MyCursor

SELECT 'cursor', DATEDIFF(ms, @StartTime, GETDATE()), @CurrentItems

GO

-- WHILE loop
DBCC FREEPROCCACHE;
GO

DECLARE @StartTime datetime; SET @StartTime = GETDATE();
DECLARE @StockID int; SET @StockID = 0;
DECLARE @CurrentItems bigint; SET @CurrentItems = 0;
DECLARE @TranCode char(3), @Items int;

WHILE @StockID is not null
BEGIN
    SET @StockID =
    (
    	SELECT TOP(1) StockID
    	FROM Stock
    	WHERE StockID > @StockID
    	ORDER BY StockID
    )

    IF @StockID is not null
    BEGIN
    	SELECT @TranCode=TranCode, @Items=Items
    	FROM Stock
    	WHERE StockID = @StockID

    	IF @TranCode IN ('IN', 'RET')
    		SET @CurrentItems = @CurrentItems + @Items;
    	ELSE
    		SET @CurrentItems = @CurrentItems - @Items;
    END
END

SELECT 'while', DATEDIFF(ms, @StartTime, GETDATE()), @CurrentItems

GO

-- set-based
DBCC FREEPROCCACHE;
GO

DECLARE @StartTime datetime; SET @StartTime = GETDATE();
DECLARE @CurrentItems bigint; SET @CurrentItems = 0;

SELECT @CurrentItems = @CurrentItems +
    CASE TranCode
    	WHEN 'IN' THEN Items
    	WHEN 'RET' THEN Items
    	WHEN 'OUT' THEN -Items
    END
FROM Stock

SELECT 'set', DATEDIFF(ms, @StartTime, GETDATE()), @CurrentItems

Results:

  • Cursor = 48046 ms
  • While loop = 30343 ms
  • set-based = 843 ms

The set-based solution time was less than 3% of the while-loop time and less than 2% of the cursor time.

1 comment
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Magnus Ahlkvist avatar image Magnus Ahlkvist commented ·
A late comment: Nice example. But I have to disagree on one point: **Each time we execute the statement "FETCH NEXT", we are essentially re-executing that SELECT statement to get the next top result row.** Are we really? Isn't a cursor in SQL Server represented by an actual server object that is kept in memory? At least that was my assumption.
0 Likes 0 ·
sbhollis avatar image
sbhollis answered
I landed here while trying to find something that described when it is not appropriate to use a set based language to solve a problem. Or in other words what types of algorithms don't lend themselves nicely to set based implementations? I think I have an example: Write bubble sort using SQL. Check this out and see what you think: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=124131 I don't know that I can elegantly and simply articulate the principles involved. If someone can that would be great. That is really what I am looking for; simple language that a semi technical stakeholder might be able to understand. My attempt at it would be something like set based is great unless you need to compare the rows against each other while collecting the result set. Another thing that should be mentioned is the difficulty of debugging an algorithm when it happens all at once like the set based approach. I may want to see which comparison created an unexpected result and for me this is more difficult with SQL. The other thing I have run into with this is the maintainability of the code in this situation. In my case I inherited a set of SPs that fall into this category. The psuedo code algorithm that describes the solution is iterative and very straight forward. However the actual SQL code that implements it is very difficult to discern. The programmer did not want to use cursors which helped with performance but it is hard to read. On a similar note I wonder if the CS discussion on this would involve recurrence relations in relationship to the time complexity of set based algorithms. Looking forward to hearing this discussion if it gets bumped. Blake
3 comments
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Matt Whitfield avatar image Matt Whitfield ♦♦ commented ·
I'm not really sure what you mean by: > On a similar note I wonder if the CS discussion on this would involve recurrence relations in relationship to the time complexity of set based algorithms. But with the other stuff I can really identify with what you are saying - and felt exactly the same way maybe 6 or 7 years ago. Yes, set based is different. Yes, it can be hard to understand. But it can also yield very positive results. I would be tempted to change your explanation from >set based is great unless you need to compare the rows against each other while collecting the result set. To >set based is great unless you need to compare the rows against each other while collecting the result set - which is when you use CLR.
0 Likes 0 ·
Scot Hauder avatar image Scot Hauder commented ·
"I don't know that I can elegantly and simply articulate the principles involved. If someone can that would be great" Bubble sort: SELECT [your coulmn] ORDER BY [your column] "Another thing that should be mentioned is the difficulty of debugging an algorithm when it happens all at once like the set based approach" Execute portions of the SP by highlighting only what you want in SSMS and hit F5. Work from the inner queries outward. Add a WHERE clause to only return a few rows until you understand what is happening SQL is like chess in that it may not take long to learn the language but it takes years to master SQL is just a tool and not all tools are right for every job.
0 Likes 0 ·
sbhollis avatar image sbhollis commented ·
My comments about Computer Science were a poor attempt at me trying to dust off my unused knowledge from school. I just figure there must be a "formal" description of what we are discussing - when is a set based language the wrong tool. If you couldn't tell I am new to database programming =P I spared you guys the pain of the real algorithm I am dealing with and chose the task of implementing bubble sort because it is simple but has the same attributes as my problem. I will most likely do as you suggested and use CLR to solve this issue. I think it is too dificult to make a general statement about when SQL is the wrong tool for the job but it would be great if there were such a rule of thumb. Thanks for the comments.
0 Likes 0 ·
Magnus Ahlkvist avatar image
Magnus Ahlkvist answered
One little addition from me: We have many, many stored procedures which use cursors. Sure - some of them could probably be replaced by a set based approach. Some of them definitely by XML-transformations as they were written for SQL Server 6.5 to convert catalogue data to XML-data. BUT. I feel no need to replace most of the cursor based solutions we have. They do work. They are not run interactively (most of them runs scheduled outside of office hours). The reason I want to keep most of them is that they will be looked at, and sometimes changed by, people who are C# or VB programmers, not database programmers. In a small organisation, it's just not possible to segregate duties. When the database programmer is not around, someone else needs to dig in and then I'd rather have a C# programmer change a cursor based solution than a set based solution. Having that said - it's true what others have said about a set based solution mostly being more efficient than a cursor based one. Finally (and rather important): A cursor is not just simply a cursor. A cursor can be more efficient than another cursor. The typical usage example is "Open a cursor, fetch each row, query some other table(s) based on the fetched row. Repeat until all rows have been read". In that case, it just doesn't do it for me to do DECLARE cur CURSOR FOR SELECT SOMETHING FROM SOMETHINGELSE That's an inefficient solution. DECLARE cur CURSOR FOR... without any other options means the same thing as DECLARE cur CURSOR GLOBAL FORWARD_ONLY OPTIMISTIC FOR ... A more efficient way to declare a cursor, if you just want to loop through it, and you're OK with the cursor not being updateable, and don't bother so much if row or two will be changed after you open the cursor: DECLARE cur CORSOR LOCAL FAST_FORWARD FOR... Have a look at [ http://msdn.microsoft.com/en-us/library/ms180169.aspx][1] for a description of the options. So I guess the advice from me would be: 1. Use cursors only when it makes sense. 2. Use the type of cursor that makes sense when you use one. [1]: http://msdn.microsoft.com/en-us/library/ms180169.aspx
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

GPO avatar image
GPO answered
Although the discussion here has focussed on cursors to a large extent, my pet hate is scalar functions in SELECT statements, where the scalar functions contain (and very effectively hide) SELECT statements of their own. SELECT fn_someting(tbl1.col2) as col2_description ,fn_someting(tbl1.col3) as col3_description ,fn_someting(tbl1.col4) as col4_description ,fn_someting(tbl1.col5) as col5_description ,fn_someting(tbl2.col4) as t2_col4_description ,fn_someting(tbl2.col5) as t2_col5_description ,fn_someting(tbl2.col6) as t2_col6_description ,fn_someting(tbl2.col7) as t2_col7_description I recently saw a stored proc improve from 7 hours to 2 minutes by replacing the functions with table joins. Some of the functions were SELECTing other scalar functions which themselves had SELECT statements. I can understand the rationale for scalar functions in SQL. It reduces the time it takes to write the code and (possibly) improves readability. It isolates the business logic into reusable code. It... just... doesn't... always translate that well into practical efficient SQL. I think the biggest problem is that they don't LOOK like RBAR. People who can develop set-based solutions to avoid loops will still fall for this trick.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

danlatimer avatar image
danlatimer answered
This isn't a complicated question and does not require a complicated answer: **Short Answer:** Databases are most efficient when you tell them what you want and let them figure out the most efficient way of doing it. **Longer answer:** Database engines have been designed to perform certain operations very efficiently. That's why traditional non procedural SQL doesn't tell the database what to do, just what it wants done. The database then has the freedom to do it in the most efficient mannor. **Example:** When you tell the database you want to iterate through a set of rows so you can count them you are telling the database how to do it. In this example it is easy to see that iterating isn't the most efficient way to determine the number of rows because the database stores these rows in special data structures that can perform operations like getting the total number VERY quickly maybe in one operation instead of N operations where N is the number of rows.
10 |1200

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.