x

The 'Subscription List' SQL Problem

Phil Factor SQL Speed Phreak Competition: No 1

This competition is now over, but the winner, Peso, got an Amazon Voucher for $60, and the privilege of being able to display the 'Phil Factor SQL Speed Phreak' award on their own site

It was quite a struggle with some close competition from many of those who participated in this competition. However, Peso came up with a blindingly fast winner that produced an aggregation from a million rows in a third of a second. Now, we're all scratching our heads trying to come up with the fastest way of solving....

The ‘FIFO Stock Inventory’ SQL Problem

...on this site


(here is the original preamble.)

I really genuinely don't know the answer to this question: What is the fastest way in SQL Server (any version) to provide the following subscription report.

It is a reasonable request. We have a subscription list with 10,000 subscribers and we need to do a management report that gives the monthly breakdown of the Date, the number of current subscribers at the end of the month, the number of resignations in the month ('Unsubscribes'), and the number of new subscribers. The list should be in date order, and the date should be just the date of the first day of the month.

The table is in this form (simplified from the way we'd do it in a real system of course)

CREATE TABLE [dbo].[Registrations]
    (
     [Registration_ID] [int] IDENTITY(1, 1)
                             NOT NULL,
     [FirstName] [varchar](80) NOT NULL,
     [LastName] [varchar](80) NOT NULL,
     [DateJoined] [datetime] NOT NULL,
     [DateLeft] [datetime] NULL,
     CONSTRAINT [PK_Registrations] PRIMARY KEY CLUSTERED 
    	([DateJoined], [LastName], [FirstName])
    )
CREATE INDEX idxDateJoined 
    ON Registrations (DateJoined, DateLeft, Registration_ID)

You are welcome to change the two indexes to suit your solution. I'll give you a reasonable amount of data to try stuff out on. 10,000 faked data entries of subscribers is the most that I can reasonably ask you to download, (The list is here) but I shall be taking each solution, putting it in in a test harness consisting of a million subscribers. (I may even make it 1,169,187 to celebrate SQLServerCentral's subscription list) , and find the fastest way of doing it. I have ideas of my own of the way to do this but I suspect they're wrong.

Note that in the sample data the subscriptions that extend to sept 2010 are those people who’ve paid for a year’s subscription only rather than those which have ongoing renewals (e.g. direct Debit). ‘Now’ is the end of September 2009.

I will allow you to use a number table, but you can assume that every month has new subscribers in it. You can use views, or temporary table, but the time taken for their creation will be included in the timings. You can use a cursor if you don't mind a sharp intake of breath from Jeff. You can use any version of SQL Server that you like.

The winner will be amongst the tied fastest entrants (generally there is a group of these) and it will be the one with the highest number of votes. We'll announce the winner in a week's time on 19th October.

Who knows, if you are the only person who ever discovers this site and this competition, then the winner will be you!


OK End of day 1 and we've had some extremely good results (in milliseconds) from the original 10,000 data table

 Matt 16 Kev Riley 93 Graham 60 Peso 33 Gianluca 16 AndyM 170 Joe Harris 15533 William brewer 406 

I've had to temporarily disqualify Kev, AndyM and Joe Harris as I couldn't get their results to match the correct values. All the other results agreed. The results for Matt, Peso and Gianluca are all very close indeed, too close to get a result from 10,000 rows, so I'll have to up the test table to a million rows. From experience, I know that a lot could change between now and next Monday. I feel sure that Graham's SQL can be tweaked. and I expect to see the slow ones to be back in the running with some fine-tuning. Anyone else?


OK End of Day 2, and Peso has streaked ahead with his third version. it is now too fast to measure on 10000 rows so I shall have to move to a larger database for the performance tests. Now I've put result-checking into the test harness Gianluca has had to be temporarily disqualified until he can cure his result, which is slightly out.
Matt                 16 ms          
Kev Riley            76 ms *         
Graham               60 ms         
Peso                 30 ms         
Gianluca             33 ms *         
AndyM                170 ms *       
William brewer       406 ms        
Peso 2               16 ms         
Peso 3               0 ms           
  • Result of SQL is incorrect
    Day 3. Due to a couple of glitches I haven't been able to complete all the timings on the million-row table. Here is what I've done so far!
graham 1        6986ms  
Peso 1           890ms  
Peso 2          1173ms  
Peso 3          1170ms  
Matt 1           596ms  
Matt 2           873ms  
Gianluca 1      4546ms  
Peso 4B          940ms  
Andriy Z        1200ms *
graham 2        1656ms  


Day 5

Currently re-creating the test harness! timings on the 1 Million row tables are as follows. remember that these are preliminary timings and Peso and I are trying to scratch our heads to see why we are getting such different results. In the test harness, the results are inserted into a table variable and subsequently checked for validity.

Entry       Elapsed time in milliseconds
graham 1    6983 ms
Peso 1       890 ms
Peso 2      1186 ms
Peso 3      1173 ms
Matt 1       576 ms
Matt 2       860 ms
Gianluca 1  4550 ms
Peso 4B      936 ms
Peso 4d      830 ms
Peso 4e      313 ms
Andriy Z    1203 ms
Graham 2    1640 ms
Brewer 2     406 ms
Peso 1d     1076 ms
Gustavo 1    580 ms
Gianluca 4  2390 ms

..at the moment I write this, William Brewer's entry (incorporating ideas from Graham, Peso and Matt) seems to be in the lead! (Saturday, it is now Peso 4e) However, I'll be holding the competition open until Monday evening GMT to allow for people who didn't hear that the competition was on last Monday.

Just if you thought that some potential winners were emerging, look at the results with the original 10,000 row table. Yes, quite different. (We'd have to use a C# or VB test-harness to sort out the speed differences!) One would never be able to sort out the fastest from such close competition!

Entry       Elapsed time in milliseconds
graham 1    80 ms
Peso 1      13
Peso 2      16
Peso 3      16
Matt 1      13
Matt 2      33
Gianluca 1  30
Peso 4B     16
Andriy Z    13
Graham 2    16
Brewer 2    30
Peso 1d     33
Gustavo 1   16
Gianluca 4  13
Peso 4d     16

The complete million long test table is now on http://www.simple-talk.com/blogbits/philf/registrations.zip if you would like to fine-tune your winning entry


Monday. Here are the final rankings These are the simple 'elapsed time' measurements. I'll follow this up as soon as possible with the details. Matt has very kindly written a special C# test harness for accurate timings that I'm dusting out at the moment. I'm planning to do a full write-up on my SSC blog as there are some important lessons for anyone doing this sort of reporting task. After all, there is a world of difference between the time and CPU loading of the best entrants and the Ho-hums. Even the Ho-hums are a lot better than some of the production code I've seen.

These timings are very tight so in all fairness, I have to also award Gustavo, as runner-up, the right to proudly display the 'Phil Factor SQL Speed Phreak' award. After all, what is 30 milliseconds in processing a million rows. Peso has to get the title, not only for the best entry (Peso 4E), but for his energy (he was on paternity leave!) and for the way he helped and advised the other entrants. A true champion. Special Thanks to Matt for all the advice, and the test harness.

Peso4E was such a good entry that it even struck terror into Celko-Prizewinner Barry, and decided him not to enter the competition. However William and Matt deserve special commendation for their entries which remain brilliantly fast over a million rows.

Peso 4E       313
Gustavo 3     343
Gustavo 4     346
Brewer 2      423
Peso 1e       470
Peso 1        500
Gustavo 1     563
Matt 1        580
Matt 1B       593
Peso 1d       856
GianLuca Final 856
Peso 4d       856
Peso 1D       860
Matt 2        873
Matt3         923
Peso 4B       940
Graham 2      1106
Peso 3        1156
Peso 2        1170
Andriy Z      1233 *
Gustavo/peso  2800
Gianluca 1    4500
graham 1      4656

Remember that the winning routines were calculating aggregate reports on a million-row tables in between a third and half a second. These included calculations that you will sometimes read has to be done using a cursor!

(* in results means result needs minor tweaking to make it conform.

more ▼

asked Oct 11, 2009 at 06:51 PM in Default

Phil Factor gravatar image

Phil Factor
3.8k 8 9 16

Phil: Do we have to account for those expirations in the future or not? As you said that we could assume that every month has new subscribers, it seems to me that we do not (because those future months have no new subscribers yet)?
Oct 13, 2009 at 07:58 PM RBarryYoung
Hmm, something odd here, I'm getting results of ~30ms for Peso #3.
Oct 13, 2009 at 09:58 PM RBarryYoung
I can't really compare my results - because I didn't even run it with 10K rows - i went straight to a million... Are you doing the same Phil?
Oct 14, 2009 at 05:58 AM Matt Whitfield ♦♦

I especially liked this competition due to the fact there was no template for a solution! And I personally would like to keep these competitions this way. Real world problems that everyone might encounter one day.

The first competition (Celko's Prime number) already had existing algorithms so that competition was just about adapting them to T-SQL. Not so much fun even if it was educational. Congrats again Barry!

The second competition (Celko's Data Warehouse) was more fun because there was no given solution, and was a real world problem. Congrats again Gianluca!

What do you think?
Oct 21, 2009 at 08:59 AM Peso
Yes. I've just spoken to Richard of Red-Gate Marketing, who are our sponsors, and he's agreed to keep the prize money going. I'm tied up for the next fortnight, but I'm happy to advise on tests. (I'm taking a week off next week to stare at seal colonies through telescopes), then I'll be at PASS staring at Brent, and various MVPs, through binoculars. The competition will be in very safe hands with Peso!
Oct 21, 2009 at 10:11 AM Phil Factor
(comments are locked)
10|1200 characters needed characters left

35 answers: sort voted first

Lets' see if we can establish some kind of baseline metrics. This is the sample data I have in my testlab on my laptop

  • AMD Turion X2 Ultra DualCore Mobile ZM-82 (2200 Mhz, 2 Cores)
  • 4GB RAM SQL
  • Server 2008 R2 (version 10.50.1092)

    -- Statistics SELECT MIN(DateJoined), -- 20010101 MAX(DateJoined), -- 20081231 MIN(DateLeft), -- 20040414 MAX(DateLeft) -- 20090101 FROM dbo.Registrations -- 1,058,428 records

    -- Directly (RowCount 900, CPU 968, Duration 623, Reads 11958) SELECT DATEDIFF(MONTH, 0, DateJoined),  DATEDIFF(MONTH, 0, DateLeft),  COUNT(*) FROM dbo.Registrations GROUP BY DATEDIFF(MONTH, 0, DateJoined),  DATEDIFF(MONTH, 0, DateLeft)-- Indirectly (RowCount 900, CPU 343, Duration 272, Reads 11958) SELECT DATEDIFF(MONTH, 0, DateJoined),  DATEDIFF(MONTH, 0, DateLeft),  SUM(Registrations) FROM (  SELECT DateJoined,  DateLeft,  COUNT(*) AS Registrations  FROM dbo.Registrations  GROUP BY DateJoined,  DateLeft  ) AS d GROUP BY DATEDIFF(MONTH, 0, DateJoined),  DATEDIFF(MONTH, 0, DateLeft) 
more ▼

answered Oct 15, 2009 at 08:55 AM

Peso gravatar image

Peso
1.6k 5 6 8

Can you give us some more detailed guidance on how you'd like us to use this, please?
Oct 16, 2009 at 03:30 PM Phil Factor
I thought getting a number of metrics (rowocunt, cpu, duration and reads) could indicate how different each people's test lab is performing. My primary concern performance wise is that on my laptop the indirectly approach is twice as fast as the direct approach. On Phils test harness, the direct approach is almost three times faster than the indirect approach.
Oct 18, 2009 at 10:59 AM Peso
(comments are locked)
10|1200 characters needed characters left

Coulnd't resist to submit a quirky update just to compare results.

declare @RESULT table ( [month] int primary key, joined int, [left] int, total int ) declare @quirky int =0 declare @now date = '20091001'

insert @result ([month], joined, [left], total) select [MONTH] as datejoined , SUM( joined ) as joined , SUM( [left] ) as [left] , SUM( joined - [left] ) as total from ( select datediff(m,0,datejoined) as [month], COUNT(*) as joined, 0 as [left] from registrations WITH (TABLOCKX) group by datediff(m,0,datejoined) union all select datediff(m,0,dateleft), 0, COUNT(*) from registrations WITH (TABLOCKX) where dateleft is not null and dateleft < @now -- last group by datediff(m,0,dateleft) ) aux group by [month]

update @RESULT set @quirky = total = @quirky + total output dateadd(m,inserted.[month],0) as [date], inserted.joined, inserted.[left], inserted.total OPTION (MAXDOP 1)
more ▼

answered Oct 16, 2009 at 01:07 PM

Gustavo gravatar image

Gustavo
592 4 4 7

Gustavo. Great to hear from you again. This seems to work well!
Oct 16, 2009 at 02:06 PM Phil Factor
Using Phil's million record sample, I get wrong result on last record #106.
Oct 21, 2009 at 10:01 AM Peso
(comments are locked)
10|1200 characters needed characters left

Another try, since pre-aggregation is a proved to be good:

declare @now int = datediff(m,0,'20091001')

create table #pre ( datejoined int , dateleft int, total int) insert #pre(datejoined, dateleft, total ) select DATEDIFF(m,0,DateJoined) as datejoined , DATEDIFF(m,0,DateLeft) as dateleft , COUNT(*) as total from registrations group by DATEDIFF(m,0,DateJoined) , DATEDIFF(m,0,DateLeft)

create table #registration_JOINED ( [month] int primary key, joined int, [left] int, total int) insert #registration_JOINED select datejoined as datejoined , sum(total) as joined , 0 as [left] , 0 as total from #pre --with (index (IDX_join)) group by datejoined

create table #registration_LEFT ( [month] int primary key, [left] int) insert #registration_LEFT select dateleft as dateleft , SUM(total) as [left] from #pre --with (index (IDX_left)) where isnull(dateleft,@now) < @now group by dateleft

declare @quirky int = 0 update #registration_JOINED set [left] = isnull(l.[left],0) , @quirky = total = @quirky + j.joined - isnull(l.[left],0) output dateadd(m,inserted.[month],0) as [date], inserted.joined, inserted.[left], inserted.total
from #registration_JOINED j left join #registration_LEFT l on ( j.[month] = l.[month] )
more ▼

answered Oct 16, 2009 at 08:05 PM

Gustavo gravatar image

Gustavo
592 4 4 7

This is very quick, and at the time it was submitted, it was the quickest, but I've now had one very slightly quicker from Peso. If you can shave a bit more off......
Oct 17, 2009 at 02:56 PM Phil Factor
Using Phil's million record sample, I get wrong result on last record #106.
Oct 21, 2009 at 10:02 AM Peso
(comments are locked)
10|1200 characters needed characters left

Well, i may be out of the time, but its a last try with just a small improvement... But in my desktops tests i couldn't beat Peso's last one.

Edit: included a nonclustered index on dateleft, and acording to Peso's suggestion, it helped a little bit. Thanks.

Edit2: included an indirect pre-aggregation approach, that could be faster depending on the machine executed.

declare @quirky int =0
declare @now int = datediff(m,0,'20091001')

create table #pre ( datejoined int , dateleft int, total int)
insert #pre
    select	DATEDIFF(m,0,DateJoined) as datejoined
    		, DATEDIFF(m,0,DateLeft) as dateleft
    		, COUNT(*) as total
    from  registrations
    group by DATEDIFF(m,0,DateJoined)
    		, DATEDIFF(m,0,DateLeft)

-- contribution by PESO...
create index IDX_PESO on #pre( dateleft ) include(total) where dateleft is not null

create table #result( [month] int primary key, joined int, [left] int, total int )  		
insert #result
    select	[MONTH] as [month]
    		, SUM( joined ) as joined
    		, SUM( [left] ) as [left]
    		, SUM( joined - [left] ) as total
    from ( 	select	datejoined as [month], sum(total) as joined, 0 as [left]
    		from	#pre 
    		group by datejoined
    		union	all
    		select  dateleft, 0, SUM(total)
    		from	#pre WITH ( index(IDX_PESO) )
    		where dateleft < @now
    		group by dateleft
    	) aux
    group by [month]

update #RESULT set @quirky = total = @quirky + total
output dateadd(m,inserted.[month],0) as [date], inserted.joined, inserted.[left], inserted.total
OPTION (MAXDOP 1)

Using Peso's Indirect Approach on the pre-aggregation i got faster results on my notebook, but as he noticed, i might get slower results on some computers:

declare @quirky int =0 declare @now int = datediff(m,0,'20091001')

create table #pre ( datejoined int , dateleft int, total int) insert #pre SELECT DATEDIFF(MONTH, 0, DateJoined), DATEDIFF(MONTH, 0, DateLeft), SUM(Registrations) FROM ( SELECT DateJoined, DateLeft, COUNT(*) AS Registrations FROM dbo.Registrations GROUP BY DateJoined, DateLeft ) AS d GROUP BY DATEDIFF(MONTH, 0, DateJoined), DATEDIFF(MONTH, 0, DateLeft)

-- contribution by PESO... create index IDX_PESO on #pre( dateleft ) include(total) where dateleft is not null

create table #result( [month] int primary key, joined int, [left] int, total int )
insert #result select [MONTH] as [month] , SUM( joined ) as joined , SUM( [left] ) as [left] , SUM( joined - [left] ) as total from ( select datejoined as [month], sum(total) as joined, 0 as [left] from #pre group by datejoined union all select dateleft, 0, SUM(total) from #pre WITH ( index(IDX_PESO) ) where dateleft < @now group by dateleft ) aux group by [month]

update #RESULT set @quirky = total = @quirky + total output dateadd(m,inserted.[month],0) as [date], inserted.joined, inserted.[left], inserted.total OPTION (MAXDOP 1)
more ▼

answered Oct 19, 2009 at 10:15 AM

Gustavo gravatar image

Gustavo
592 4 4 7

Have you created a filtered index over dateleft? create nonclustered index ix_peso on #pre (dateleft) include (total) where dateleft is not null
Oct 19, 2009 at 10:27 AM Peso
Wanna try that now, thanks...
Oct 19, 2009 at 11:14 AM Gustavo
It might not benefit for such a small table as #pre, but for your other suggestions...
Oct 19, 2009 at 11:30 AM Peso
Used another of Peso's Ideas, and changed the pre-aggregate, i believe i can still make it a little better, but i am just tired of trying it right now...
Oct 19, 2009 at 02:57 PM Gustavo
I get wrong result on last record using Phil's million record table. Both #3 and #4
Oct 21, 2009 at 09:36 AM Peso
(comments are locked)
10|1200 characters needed characters left

This is my last entry (I promise!): I can't measure if it runs fast or slow, because on my laptop ALL top fast queries take 3.2 seconds to run on the 1 million rows table. It's strange but it's so. It's a blind shot... I'm using a string to stuff the peopleLeft count without joining. Not very elegant, but it works. It runs on the assumption that nobody can unsubscribe before joining.

DECLARE @monthly_joins int DECLARE @monthly_leaves int DECLARE @running_users int DECLARE @current_month char(6) DECLARE @strLeft char(2000) DECLARE @xchar nvarchar(max) DECLARE @xml XML DECLARE @now date DECLARE @y2k date

SELECT @xchar = '', @monthly_joins = 0, @monthly_leaves = 0, @current_month = NULL, @running_users = 0, @now = '20091001', @y2k = '2000-01-01', @strLeft = REPLICATE('0',2000)

; WITH QRY ( MonthJoined, MonthLeft, PeopleCount ) AS ( SELECT TOP 2147483647 DATEDIFF(MONTH, @y2k, DateJoined) AS MonthJoined, DATEDIFF(MONTH, @y2k, DateLeft) AS MonthLeft, COUNT(*) AS PeopleCount FROM dbo.Registrations WHERE DateJoined < @now GROUP BY DATEDIFF(MONTH, @y2k, DateJoined), DATEDIFF(MONTH, @y2k, DateLeft) ORDER BY 1 ) SELECT
@running_users = @running_users - CASE WHEN MonthJoined = ISNULL(@current_month, MonthJoined) THEN 0 ELSE @monthly_leaves END, @xchar = CASE WHEN MonthJoined = ISNULL(@current_month, MonthJoined) THEN @xchar ELSE @xchar + '<A MONTH="'+ ISNULL(@current_month,MonthJoined) +'" JOINS="'+ CAST(@monthly_joins AS varchar(10)) +'" LEAVES="'+ CAST(@monthly_leaves AS varchar(10)) +'" TOT="'+ CAST(@running_users AS varchar(10)) +'"/>' END, @monthly_joins = CASE WHEN MonthJoined = ISNULL(@current_month,MonthJoined) THEN @monthly_joins + PeopleCount ELSE PeopleCount END, @strLeft = CASE WHEN MonthLeft IS NOT NULL THEN STUFF(@strLeft,MonthLeft * 8,8,CAST(CAST(SUBSTRING(@strLeft,MonthLeft * 8,8) AS int) + PeopleCount AS char(8))) ELSE @strLeft END, @monthly_leaves = ISNULL(CAST(SUBSTRING(@strLeft,MonthJoined * 8,8) AS int),0), @running_users = @running_users + PeopleCount, @current_month = MonthJoined FROM QRY AS A

SELECT @xchar = @xchar + '<A MONTH="'+ @current_month +'" JOINS="'+ CAST(@monthly_joins AS varchar(10)) +'" LEAVES="'+ CAST(@monthly_leaves AS varchar(10)) +'" TOT="'+ CAST(@running_users - @monthly_leaves AS varchar(10)) +'"/>', @xml = CAST('<R>' + @xchar + '</R>' AS XML)

SELECT DATEADD(MONTH, T.results.value('@MONTH[1]','int'), @y2k) as [month], T.results.value('@JOINS[1]','int') as joins, T.results.value('@LEAVES[1]','int') as leaves, T.results.value('@TOT[1]','int') as tot FROM @xml.nodes('/R/A') AS T(results)
more ▼

answered Oct 19, 2009 at 02:41 PM

Gianluca Sartori gravatar image

Gianluca Sartori
228 1 4

I get wrong result using Phil's million record sample. Changing the code " @now = '20091001'" to " @now = '20091101'" makes it work.
Oct 21, 2009 at 10:06 AM Peso
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

New code box

There's a new way to format code on the site - the red speech bubble logo will automatically format T-SQL for you. The original code box is still there for XML, etc. More details here.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

SQL Server Central

Need long-form SQL discussion? SQLserverCentral.com is the place.

Topics:

x985
x343
x8
x7
x5

asked: Oct 11, 2009 at 06:51 PM

Seen: 16983 times

Last Updated: Jan 28, 2013 at 01:34 AM