login about faq
11
11

Phil Factor SQL Speed Phreak Competition: No 1

This competition is now over, but the winner, Peso, got an Amazon Voucher for $60, and the privilege of being able to display the 'Phil Factor SQL Speed Phreak' award on their own site

It was quite a struggle with some close competition from many of those who participated in this competition. However, Peso came up with a blindingly fast winner that produced an aggregation from a million rows in a third of a second. Now, we're all scratching our heads trying to come up with the fastest way of solving....

The ‘FIFO Stock Inventory’ SQL Problem

...on this site


(here is the original preamble.)

I really genuinely don't know the answer to this question: What is the fastest way in SQL Server (any version) to provide the following subscription report.

It is a reasonable request. We have a subscription list with 10,000 subscribers and we need to do a management report that gives the monthly breakdown of the Date, the number of current subscribers at the end of the month, the number of resignations in the month ('Unsubscribes'), and the number of new subscribers. The list should be in date order, and the date should be just the date of the first day of the month.

The table is in this form (simplified from the way we'd do it in a real system of course)

CREATE TABLE [dbo].[Registrations]
    (
     [Registration_ID] [int] IDENTITY(1, 1)
                             NOT NULL,
     [FirstName] [varchar](80) NOT NULL,
     [LastName] [varchar](80) NOT NULL,
     [DateJoined] [datetime] NOT NULL,
     [DateLeft] [datetime] NULL,
     CONSTRAINT [PK_Registrations] PRIMARY KEY CLUSTERED 
        ([DateJoined], [LastName], [FirstName])
    )
CREATE INDEX idxDateJoined 
    ON Registrations (DateJoined, DateLeft, Registration_ID)

You are welcome to change the two indexes to suit your solution. I'll give you a reasonable amount of data to try stuff out on. 10,000 faked data entries of subscribers is the most that I can reasonably ask you to download, (The list is here) but I shall be taking each solution, putting it in in a test harness consisting of a million subscribers. (I may even make it 1,169,187 to celebrate SQLServerCentral's subscription list) , and find the fastest way of doing it. I have ideas of my own of the way to do this but I suspect they're wrong.

Note that in the sample data the subscriptions that extend to sept 2010 are those people who’ve paid for a year’s subscription only rather than those which have ongoing renewals (e.g. direct Debit). ‘Now’ is the end of September 2009.

I will allow you to use a number table, but you can assume that every month has new subscribers in it. You can use views, or temporary table, but the time taken for their creation will be included in the timings. You can use a cursor if you don't mind a sharp intake of breath from Jeff. You can use any version of SQL Server that you like.

The winner will be amongst the tied fastest entrants (generally there is a group of these) and it will be the one with the highest number of votes. We'll announce the winner in a week's time on 19th October.

Who knows, if you are the only person who ever discovers this site and this competition, then the winner will be you!


OK End of day 1 and we've had some extremely good results (in milliseconds) from the original 10,000 data table
Matt                 16
Kev Riley            93
Graham               60
Peso                 33
Gianluca             16
AndyM                170
Joe Harris           15533
William brewer       406

I've had to temporarily disqualify Kev, AndyM and Joe Harris as I couldn't get their results to match the correct values. All the other results agreed. The results for Matt, Peso and Gianluca are all very close indeed, too close to get a result from 10,000 rows, so I'll have to up the test table to a million rows. From experience, I know that a lot could change between now and next Monday. I feel sure that Graham's SQL can be tweaked. and I expect to see the slow ones to be back in the running with some fine-tuning. Anyone else?


OK End of Day 2, and Peso has streaked ahead with his third version. it is now too fast to measure on 10000 rows so I shall have to move to a larger database for the performance tests. Now I've put result-checking into the test harness Gianluca has had to be temporarily disqualified until he can cure his result, which is slightly out.

Matt                 16 ms          
Kev Riley            76 ms *         
Graham               60 ms         
Peso                 30 ms         
Gianluca             33 ms *         
AndyM                170 ms *       
William brewer       406 ms        
Peso 2               16 ms         
Peso 3               0 ms           
  • Result of SQL is incorrect
    Day 3. Due to a couple of glitches I haven't been able to complete all the timings on the million-row table. Here is what I've done so far!
graham 1        6986ms  
Peso 1           890ms  
Peso 2          1173ms  
Peso 3          1170ms  
Matt 1           596ms  
Matt 2           873ms  
Gianluca 1      4546ms  
Peso 4B          940ms  
Andriy Z        1200ms *
graham 2        1656ms  


Day 5

Currently re-creating the test harness! timings on the 1 Million row tables are as follows. remember that these are preliminary timings and Peso and I are trying to scratch our heads to see why we are getting such different results. In the test harness, the results are inserted into a table variable and subsequently checked for validity.

Entry       Elapsed time in milliseconds
graham 1    6983 ms
Peso 1       890 ms
Peso 2      1186 ms
Peso 3      1173 ms
Matt 1       576 ms
Matt 2       860 ms
Gianluca 1  4550 ms
Peso 4B      936 ms
Peso 4d      830 ms
Peso 4e      313 ms
Andriy Z    1203 ms
Graham 2    1640 ms
Brewer 2     406 ms
Peso 1d     1076 ms
Gustavo 1    580 ms
Gianluca 4  2390 ms

..at the moment I write this, William Brewer's entry (incorporating ideas from Graham, Peso and Matt) seems to be in the lead! (Saturday, it is now Peso 4e) However, I'll be holding the competition open until Monday evening GMT to allow for people who didn't hear that the competition was on last Monday.

Just if you thought that some potential winners were emerging, look at the results with the original 10,000 row table. Yes, quite different. (We'd have to use a C# or VB test-harness to sort out the speed differences!) One would never be able to sort out the fastest from such close competition!

Entry       Elapsed time in milliseconds
graham 1    80 ms
Peso 1      13
Peso 2      16
Peso 3      16
Matt 1      13
Matt 2      33
Gianluca 1  30
Peso 4B     16
Andriy Z    13
Graham 2    16
Brewer 2    30
Peso 1d     33
Gustavo 1   16
Gianluca 4  13
Peso 4d     16

The complete million long test table is now on http://www.simple-talk.com/blogbits/philf/registrations.zip if you would like to fine-tune your winning entry


Monday. Here are the final rankings These are the simple 'elapsed time' measurements. I'll follow this up as soon as possible with the details. Matt has very kindly written a special C# test harness for accurate timings that I'm dusting out at the moment. I'm planning to do a full write-up on my SSC blog as there are some important lessons for anyone doing this sort of reporting task. After all, there is a world of difference between the time and CPU loading of the best entrants and the Ho-hums. Even the Ho-hums are a lot better than some of the production code I've seen.

These timings are very tight so in all fairness, I have to also award Gustavo, as runner-up, the right to proudly display the 'Phil Factor SQL Speed Phreak' award. After all, what is 30 milliseconds in processing a million rows. Peso has to get the title, not only for the best entry (Peso 4E), but for his energy (he was on paternity leave!) and for the way he helped and advised the other entrants. A true champion. Special Thanks to Matt for all the advice, and the test harness.

Peso4E was such a good entry that it even struck terror into Celko-Prizewinner Barry, and decided him not to enter the competition. However William and Matt deserve special commendation for their entries which remain brilliantly fast over a million rows.

Peso 4E       313
Gustavo 3     343
Gustavo 4     346
Brewer 2      423
Peso 1e       470
Peso 1        500
Gustavo 1     563
Matt 1        580
Matt 1B       593
Peso 1d       856
GianLuca Final 856
Peso 4d       856
Peso 1D       860
Matt 2        873
Matt3         923
Peso 4B       940
Graham 2      1106
Peso 3        1156
Peso 2        1170
Andriy Z      1233 *
Gustavo/peso  2800
Gianluca 1    4500
graham 1      4656

Remember that the winning routines were calculating aggregate reports on a million-row tables in between a third and half a second. These included calculations that you will sometimes read has to be done using a cursor!

(* in results means result needs minor tweaking to make it conform.

This question is marked "community wiki".

asked Oct 11 '09 at 21:51

Phil%20Factor's gravatar image

Phil Factor
72219

edited Nov 06 '09 at 11:21

Peso's gravatar image

Peso
887212

Phil: Do we have to account for those expirations in the future or not? As you said that we could assume that every month has new subscribers, it seems to me that we do not (because those future months have no new subscribers yet)?

(Oct 13 '09 at 22:58) RBarryYoung

Hmm, something odd here, I'm getting results of ~30ms for Peso #3.

(Oct 14 '09 at 00:58) RBarryYoung

I can't really compare my results - because I didn't even run it with 10K rows - i went straight to a million... Are you doing the same Phil?

(Oct 14 '09 at 08:58) Matt Whitfield ♦

Phil - one more thing - how are you doing timings? If it's from within a .NET app - consider using the Stopwatch class which will give much more fine-grained timings on most systems...

(Oct 14 '09 at 11:12) Matt Whitfield ♦

Barry, sorry, I didn't see your question. Yes, you can ignore the expirations in the future. these are annual subs which in real life have to have an expiration, so i put them in to make the data more realistic.

(Oct 14 '09 at 14:12) Phil Factor

Matt, I'd be happy to use a Net App for the timings as this has a better granularity for the timings. At the moment I'm finding that a million rows gives a clear enough distinction in SQL but by the end of the week...who knows? You seem to be in the lead as I write this but not my a huge amount. wait 'til Barry get's stuck in.

(Oct 14 '09 at 14:14) Phil Factor

Hmm, I doubt that I'll even be able to equat what Matt and Peso are doing, Phil. :-)

(Oct 14 '09 at 17:30) RBarryYoung

Phil - Could you give us some info on the disk and cpu that the tests are running on? Becuase on my machine #2 runs reliably 20-25% quicker than #1 - but my HDD is a raptor, and my cpu is only a 2GHz Athlon... have you some insight there?

(Oct 15 '09 at 00:28) Matt Whitfield ♦

Phil - sorry - another thought - could you post your million rows so we can test with the accurate data?

(Oct 15 '09 at 10:06) Matt Whitfield ♦

Phil - have added v1b - would it be possible to add my previous v3 to the above list (i just want to see where it is in terms of perf)

(Oct 16 '09 at 19:26) Matt Whitfield ♦

Phil, I do miss your original entry... And Matt 1b.

(Oct 19 '09 at 19:14) Peso

Good competiton guys - and Peso - CONGRATULATIONS - double whammy - winning the compo and a new addition to the family! Really happy for you!

(Oct 19 '09 at 19:27) Matt Whitfield ♦

Thank you Matt! Where can I read about the test harness you wrote? Do you write on a blog?

(Oct 19 '09 at 20:08) Peso

Peso - no I don't have a blog... yet... But I did email it to you! :)

(Oct 19 '09 at 21:56) Matt Whitfield ♦

Phil - Can you make sure you mark Peso's answer as such.

Matt - If you want a blog on SQL Server Central, let me know and I should be able to get one set up. Steve may also be interested in getting you to write an article about the test harness.

(Oct 20 '09 at 10:51) Melvyn Harbour 1 ♦♦

Phil, how often are you planning to have these fun competitions?

(Oct 20 '09 at 20:48) Peso

Re: Phil, how often are you planning to have these fun competitions?

I can't manage one for a fortnight or so due to PASS. It really needs quite a bit of attention, and running the test harness. There will DEFINITELY be one from me after then. However, it might be interesting if the winner of the previous PFSSP competition hosts the next one....

(Oct 21 '09 at 10:50) Phil Factor

I quite like the idea of a 'pass the torch' idea. Person who wins the last round contributes the next challenge. Peso?

(Oct 21 '09 at 10:56) Melvyn Harbour 1 ♦♦

I can do that! The challenge as I see it is the difference of performance labs each time. For example, I have a box with 8 CPU's and 32 GB of RAM. Another winner might have access to a dual core 4 GB machine only. And another task is to keep same level of abstraction so that Red Gate still sponsor with the $60.

(Oct 21 '09 at 11:46) Peso

Ot did you mean that the previous winner just comes up with an idea , sample data and DDL? Tests and everything else are still managed by Phil?

(Oct 21 '09 at 11:48) Peso

I'll speak to the chaps in publishing/marketing about how we keep the sponsorship going. Shouldn't be a great problem I'd have thought. I'm not sure what the best approach is regarding platform. Perhaps we should start a question for that.

(Oct 21 '09 at 11:48) Melvyn Harbour 1 ♦♦

Actually - yes, like your suggestion of keeping tests etc still controlled by Phil. I'll have a word and see if we can get that confirmed. Also means that if we're going to sponsor it regularly, there's a small degree of control over it!

(Oct 21 '09 at 11:49) Melvyn Harbour 1 ♦♦
1

I especially liked this competition due to the fact there was no template for a solution! And I personally would like to keep these competitions this way. Real world problems that everyone might encounter one day.

The first competition (Celko's Prime number) already had existing algorithms so that competition was just about adapting them to T-SQL. Not so much fun even if it was educational. Congrats again Barry!

The second competition (Celko's Data Warehouse) was more fun because there was no given solution, and was a real world problem. Congrats again Gianluca!

What do you think?

(Oct 21 '09 at 11:59) Peso
2

Yes. I've just spoken to Richard of Red-Gate Marketing, who are our sponsors, and he's agreed to keep the prize money going. I'm tied up for the next fortnight, but I'm happy to advise on tests. (I'm taking a week off next week to stare at seal colonies through telescopes), then I'll be at PASS staring at Brent, and various MVPs, through binoculars. The competition will be in very safe hands with Peso!

(Oct 21 '09 at 13:11) Phil Factor

Great news this kind of competition will happen on a frequent basis, they are really interesting and its possible that every week more and more new guys ( like myself ) would come up and participate.

BTW, i really enjoyed the Award Winner Logo...

(Oct 21 '09 at 18:46) Gustavo - Bart

I don't know a weekly competition is that good. It takes time to come up with an interesting idea, create proper sample data and test the suggestions. I believe a monthly competition is ok.

(Oct 21 '09 at 21:15) Peso

Yes, It all takes a bit of time to do properly, and I still have not finished the write-up and Matt's C# tests. Once a fortnight sounds about right.

(Oct 22 '09 at 16:25) Phil Factor

I agree, once a fortnight or once monthly is good. And I think there should be a good defined measure of how a winner is decided (i.e. 'fastest average over 10 runs' or something similar).

M

(Oct 23 '09 at 16:06) Matt Whitfield ♦

Phil: I also suggest that you add a special tag just for these challenges. Something like "Speed-Phreak".

(Oct 23 '09 at 17:50) RBarryYoung
showing 5 of 29 show all

/*******************************************************************************
    Peso 4d - 20091015
*******************************************************************************/

-- Step 1 - Create an intermediate staging table
-- Table variable will not do (even with no logging) because table variables
-- cannot benefit from parallelism. I am using a temp table for the 103 months
-- or records between Mar 2001 and Sep 2009, with each record is 14 bytes
CREATE TABLE    #Stage
        (
            theMonth SMALLINT NOT NULL,
            PeopleJoined INT NOT NULL,
            PeopleLeft INT NOT NULL,
            Subscribers INT NOT NULL
        )

-- Step 2 - Populate the staging table
INSERT      #Stage
        (
            theMonth,
            PeopleJoined,
            PeopleLeft,
            Subscribers
        )
SELECT      u.theMonth,
            -- Old school pivoting is slightly more efficient than PIVOT
            -- It also gives us the ability to return 0 instead of NULL
        SUM(CASE WHEN u.theCol = 'DateJoined' THEN u.Registrations ELSE 0 END) AS PeopleJoined,
        SUM(CASE WHEN u.theCol = 'DateLeft' THEN u.Registrations ELSE 0 END) AS PeopleLeft,
        0 AS Subscribers
FROM        (
            -- Do the full aggregation with final key before PIVOT
            SELECT      DATEDIFF(MONTH, 0, DateJoined) AS DateJoined,
                    DATEDIFF(MONTH, 0, DateLeft) AS DateLeft,
                    SUM(Registrations) AS Registrations
            FROM        (
                            -- Do some heavy-lifting pre-aggregation
                            -- It is better to UNPIVOT about 3,450 records (number of days since March 2001, plus 10% drop-offs)
                            -- than 1,169,187 records directly (average of 373 registrations per day)
                        SELECT      DateJoined,
                                DateLeft,
                                COUNT(*) AS Registrations
                        FROM        dbo.Registrations
                        GROUP BY    DateJoined,
                                DateLeft
                    ) AS d
            GROUP BY    DATEDIFF(MONTH, 0, DateJoined),
                    DATEDIFF(MONTH, 0, DateLeft)
        ) AS d
UNPIVOT     (
            theMonth
            FOR theCol IN (d.DateJoined, d.DateLeft)
        ) AS u
GROUP BY    u.theMonth
        -- Exclude those records for months not having any new subscribers
HAVING      SUM(CASE WHEN u.theCol = 'DateJoined' THEN u.Registrations ELSE 0 END) > 0

-- Prepare running total
DECLARE @Subscribers INT = 0

-- Set up and prepare an ordered update CTE
;WITH Yak (theMonth, PeopleJoined, PeopleLeft, Subscribers)
AS (
    SELECT      TOP 2147483647
            DATEADD(MONTH, theMonth, 0) AS theMonth,
            PeopleJoined,
            PeopleLeft,
            Subscribers
    FROM        #Stage
    ORDER BY    theMonth
)

-- Step 3 - Do both the running total and the result output
UPDATE  Yak
SET @Subscribers = Subscribers = @Subscribers + PeopleJoined - PeopleLeft
OUTPUT  inserted.theMonth,
    inserted.PeopleJoined,
    inserted.PeopleLeft,
    inserted.Subscribers


/*******************************************************************************
    Peso 4e - 20091017
*******************************************************************************/
CREATE TABLE    #Stage
        (
            theMonth SMALLINT NOT NULL,
            PeopleJoined INT NOT NULL,
            PeopleLeft INT NOT NULL,
            Subscribers INT NOT NULL
        )

INSERT      #Stage
        (
            theMonth,
            PeopleJoined,
            PeopleLeft,
            Subscribers
        )
SELECT      u.theMonth,
        SUM(CASE WHEN u.theCol = 'DateJoined' THEN u.Registrations ELSE 0 END) AS PeopleJoined,
        SUM(CASE WHEN u.theCol = 'DateLeft' THEN u.Registrations ELSE 0 END) AS PeopleLeft,
        0 AS Subscribers
FROM        (
            SELECT      DATEDIFF(MONTH, 0, DateJoined) AS DateJoined,
                    DATEDIFF(MONTH, 0, DateLeft) AS DateLeft,
                    COUNT(*) AS Registrations
            FROM        dbo.Registrations2
            GROUP BY    DATEDIFF(MONTH, 0, DateJoined),
                    DATEDIFF(MONTH, 0, DateLeft)
        ) AS d
UNPIVOT     (
            theMonth
            FOR theCol IN (d.DateJoined, d.DateLeft)
        ) AS u
GROUP BY    u.theMonth
HAVING      SUM(CASE WHEN u.theCol = 'DateJoined' THEN u.Registrations ELSE 0 END) > 0

DECLARE @Subscribers INT = 0

;WITH Yak (theMonth, PeopleJoined, PeopleLeft, Subscribers)
AS (
    SELECT      TOP 2147483647
            DATEADD(MONTH, theMonth, 0) AS theMonth,
            PeopleJoined,
            PeopleLeft,
            Subscribers
    FROM        #Stage
    ORDER BY    theMonth
)

UPDATE  Yak
SET @Subscribers = Subscribers = @Subscribers + PeopleJoined - PeopleLeft
OUTPUT  inserted.theMonth,
    inserted.PeopleJoined,
    inserted.PeopleLeft,
    inserted.Subscribers

answered Oct 13 '09 at 20:52

Peso's gravatar image

Peso
887212

edited Oct 17 '09 at 20:44

Edit the CTE and add this WHERE clause

WHERE PeopleJoined > 0

To only get those months where people have joined.

(Oct 14 '09 at 13:46) Peso

Peso, it is great to have the explanation in the comments. I loved the nice example of a use for the OUTPUT Clause

(Oct 14 '09 at 13:49) Phil Factor

It is giving months in the future with me

(Oct 14 '09 at 13:50) Phil Factor

Actually, with the peopleJoined>0 in place it would work because I've already said that there is no month without anyone joining.

(Oct 14 '09 at 13:53) Phil Factor

Un-freakin-believable Peso! This proc is incredibly fast.

(Oct 14 '09 at 14:24) RBarryYoung

FYI: you should be able to add comments now.

(Oct 14 '09 at 14:25) RBarryYoung

Yes! I have more than 50 points now... Wohoo!

(Oct 14 '09 at 15:09) Peso

The "WHERE PeopleJoined > 0" is there to filter out the "future leaving people".

(Oct 14 '09 at 15:10) Peso

Peso 4d is now 30% faster than original Peso IV because of parallelism enabled with temp table instead of table variable. Peso 4d also has 10% less reads than original Peso IV.

(Oct 15 '09 at 22:09) Peso

Thank you Barry!

(Feb 01 at 09:46) Peso
showing 5 of 10 show all

Here's my answer....

WITH MonthlySummaries([Date], joined, [left])--a table collected from two aggregations, giving us
      AS (SELECT    joined.[date],--those who joined and those who left in each month
                    joined.joined,
                    COALESCE([left].[left], 0)
          FROM      (SELECT CAST('01 '+LEFT(DATENAME(month, DateJoined), 3)
                            +' '+DATENAME(year, DateJoined) AS DATETIME) AS [Date],
                            COUNT(*) AS [joined]
                     FROM   Registrations
                     GROUP BY CAST('01 '+LEFT(DATENAME(month, DateJoined),
                                              3)+' '+DATENAME(year, DateJoined) AS DATETIME)
                    ) joined
                    LEFT OUTER JOIN (SELECT COUNT(*) AS [left],
                                            [date] = CAST('01 '
                                            +LEFT(DATENAME(month, Dateleft),
                                                  3)+' '+DATENAME(year, Dateleft) AS DATETIME)
                                     FROM   Registrations
                                     GROUP BY CAST('01 '
                                            +LEFT(DATENAME(month, Dateleft),
                                                  3)+' '+DATENAME(year, Dateleft) AS DATETIME)
                                    ) [left] ON joined.date=[left].date
         )
SELECT  MonthlySummaries.date,
        [joined],
        [left],
        c.subscribers
FROM    MonthlySummaries
        INNER JOIN (SELECT  a.date,
                            SUM(b.joined)-SUM(b.[left]) AS subscribers
                    FROM    MonthlySummaries a
                            INNER JOIN Monthlysummaries b ON b.date<=a.date
                    GROUP BY a.date
                   ) c ON c.date=MonthlySummaries.date
ORDER BY monthlySummaries.date

answered Oct 12 '09 at 13:24

Graham's gravatar image

Graham
411

Graham, thanks a lot for that. Very neat this one, Very neat....

(Oct 12 '09 at 13:54) Phil Factor

Graham has done a second one which, on a million rows, is currently in the fastest group!

(Oct 14 '09 at 14:02) Phil Factor

William has now done a new version incorporating some ideas from Matt and Peso. William's version seem the quickest on my harness but we'll have to wait to see!

(Oct 16 '09 at 18:26) Phil Factor

Here is my attempt... Bearing in mind that I have only had 1 coffee so far today, this may yet get revised:

-- Matt 1
DECLARE @TempData TABLE (Joined [smallint], Unsubscribed [smallint], Num int)
INSERT INTO @TempData (Joined, Unsubscribed, Num)
SELECT Joined, Unsubscribed, COUNT(*) FROM
 (SELECT DATEDIFF(m, 0, [DateJoined]) AS Joined,
         DATEDIFF(m, 0, [DateLeft]) AS Unsubscribed
    FROM registrations) idat
GROUP BY [Joined], [Unsubscribed]

CREATE TABLE #TempMonths (MonthNumber [smallint], Joined [int], Unsubscribed [int], JoinedToDate [int], UnsubscribedToDate [int])
CREATE CLUSTERED INDEX [IX_TempMonths_1] ON #TempMonths (MonthNumber)
INSERT INTO #TempMonths (MonthNumber, Joined)
SELECT Joined, SUM(Num) FROM @TempData
GROUP BY Joined

UPDATE #TempMonths SET Unsubscribed = idat.UnsubscribedCount
FROM #TempMonths tm INNER JOIN
     (SELECT Unsubscribed AS UnsubscribedMonthNumber, SUM(Num) AS UnsubscribedCount FROM @TempData GROUP BY Unsubscribed) idat
  ON [idat].[UnsubscribedMonthNumber] = tm.MonthNumber

DECLARE @PrevJoined [int], @PrevUnsubscribed [int]
SET @PrevJoined = 0
SET @PrevUnsubscribed = 0

UPDATE #TempMonths
   SET @PrevJoined = JoinedToDate = @PrevJoined + Joined,
       @PrevUnsubscribed = UnsubscribedToDate = @PrevUnsubscribed + ISNULL(Unsubscribed, 0)
  FROM #TempMonths WITH(INDEX(0),TABLOCKX)
OPTION (MAXDOP 1)

SELECT DATEADD(m, MonthNumber, 0) AS MonthDate, 
       Joined AS JoinedThisMonth, 
       ISNULL(Unsubscribed, 0) AS UnsubscribedThisMonth, 
       JoinedToDate - UnsubscribedToDate AS EndOfMonthSubscribers 
  FROM #TempMonths 
 ORDER BY MonthDate

Here's version 2 (+2 cups of coffee)

-- Matt 2
CREATE TABLE #TempData (
  Joined       SMALLINT,
  Unsubscribed SMALLINT,
  Num          INT     
);

INSERT INTO #TempData (Joined, Unsubscribed, Num)
SELECT   Joined, Unsubscribed, SUM(Num)
FROM     (SELECT DATEDIFF(m, 0, [DateJoined]) AS Joined, DATEDIFF(m, 0, [DateLeft]) AS Unsubscribed, [Num]
          FROM   (SELECT   [DateJoined], [DateLeft], COUNT(*) AS Num
                  FROM     registrations WITH (TABLOCKX)
                  GROUP BY DateJoined, DateLeft) AS innerdat) AS idat
GROUP BY [Joined], [Unsubscribed];

CREATE TABLE #TempMonths (
  MonthNumber        SMALLINT,
  Joined             INT     ,
  Unsubscribed       INT     ,
  JoinedToDate       INT     ,
  UnsubscribedToDate INT     
);

CREATE CLUSTERED INDEX [IX_TempMonths_1]
  ON #TempMonths(MonthNumber);

INSERT INTO #TempMonths (MonthNumber, Joined, Unsubscribed)
SELECT td1.Joined, td1.Num, td2.Num
FROM   (SELECT   Joined, SUM(Num) AS Num
        FROM     #TempData
        GROUP BY Joined) AS td1
       LEFT OUTER JOIN
       (SELECT   Unsubscribed, SUM(Num) AS Num
        FROM     #TempData
        GROUP BY Unsubscribed) AS td2
       ON td2.Unsubscribed = td1.Joined;

DECLARE @PrevJoined AS INT, @PrevUnsubscribed AS INT;

SET @PrevJoined = 0;

SET @PrevUnsubscribed = 0;

UPDATE #TempMonths
SET    @PrevJoined       = JoinedToDate       = @PrevJoined + Joined,
       @PrevUnsubscribed = UnsubscribedToDate = @PrevUnsubscribed + ISNULL(Unsubscribed, 0)
FROM   #TempMonths WITH (INDEX (0), TABLOCKX)
OPTION (MAXDOP 1);

SELECT DATEADD(m, MonthNumber, 0) AS MonthDate, Joined AS JoinedThisMonth, ISNULL(Unsubscribed, 0) AS UnsubscribedThisMonth, JoinedToDate - UnsubscribedToDate AS EndOfMonthSubscribers
FROM   #TempMonths;

Here is version 3 - which is faster still on my machine, although only a slight tweak. Interesting to see the results Phil puts out given that 2 is slower on his rig than 1 is.

-- Matt 3
CREATE TABLE #TempData (
  Joined       SMALLINT,
  Unsubscribed SMALLINT,
  Num          INT     
);
CREATE CLUSTERED INDEX [IX_TempData_1]
  ON #TempData(Joined);
CREATE INDEX [IX_TempData_2]
  ON #TempData(Unsubscribed);

INSERT INTO #TempData (Joined, Unsubscribed, Num)
SELECT   Joined, Unsubscribed, SUM(Num)
FROM     (SELECT DATEDIFF(m, 0, [DateJoined]) AS Joined, DATEDIFF(m, 0, [DateLeft]) AS Unsubscribed, [Num]
          FROM   (SELECT   [DateJoined], [DateLeft], COUNT(*) AS Num
                  FROM     registrations WITH (TABLOCKX)
                  GROUP BY DateJoined, DateLeft) AS innerdat) AS idat
GROUP BY [Joined], [Unsubscribed];

CREATE TABLE #TempMonths (
  MonthNumber        SMALLINT,
  Joined             INT     ,
  Unsubscribed       INT     ,
  JoinedToDate       INT     ,
  UnsubscribedToDate INT     
);

CREATE CLUSTERED INDEX [IX_TempMonths_1]
  ON #TempMonths(MonthNumber);

INSERT INTO #TempMonths (MonthNumber, Joined, Unsubscribed)
SELECT td1.Joined, td1.Num, td2.Num
FROM   (SELECT   Joined, SUM(Num) AS Num
        FROM     #TempData
        GROUP BY Joined) AS td1
       LEFT OUTER JOIN
       (SELECT   Unsubscribed, SUM(Num) AS Num
        FROM     #TempData
        GROUP BY Unsubscribed) AS td2
       ON td2.Unsubscribed = td1.Joined;

DECLARE @PrevJoined AS INT, @PrevUnsubscribed AS INT;

SET @PrevJoined = 0;

SET @PrevUnsubscribed = 0;

UPDATE #TempMonths
SET    @PrevJoined       = JoinedToDate       = @PrevJoined + Joined,
       @PrevUnsubscribed = UnsubscribedToDate = @PrevUnsubscribed + ISNULL(Unsubscribed, 0)
FROM   #TempMonths WITH (INDEX (0), TABLOCKX)
OPTION (MAXDOP 1);

SELECT DATEADD(m, MonthNumber, 0) AS MonthDate, Joined AS JoinedThisMonth, ISNULL(Unsubscribed, 0) AS UnsubscribedThisMonth, JoinedToDate - UnsubscribedToDate AS EndOfMonthSubscribers
FROM   #TempMonths;

Ok this is my final one:

-- Matt 1b
DECLARE @TempData TABLE (Joined [smallint], Unsubscribed [smallint], Num int, UNIQUE (Joined, Unsubscribed))
INSERT INTO @TempData (Joined, Unsubscribed, Num)
SELECT Joined, Unsubscribed, COUNT(*) FROM
 (SELECT DATEDIFF(m, 0, [DateJoined]) AS Joined,
         DATEDIFF(m, 0, [DateLeft]) AS Unsubscribed
    FROM registrations WITH (TABLOCKX)) idat
GROUP BY [Joined], [Unsubscribed]

CREATE TABLE #TempMonths (MonthNumber [smallint], Joined [int], Unsubscribed [int], JoinedToDate [int], UnsubscribedToDate [int])
CREATE CLUSTERED INDEX [IX_TempMonths_1] ON #TempMonths (MonthNumber)

INSERT INTO #TempMonths (MonthNumber, Joined, Unsubscribed)
SELECT td1.Joined, td1.Num, td2.Num
FROM   (SELECT Joined, SUM(Num) AS Num FROM @TempData
        GROUP BY Joined) AS td1
       LEFT OUTER JOIN
       (SELECT Unsubscribed, SUM(Num) AS Num FROM @TempData
        GROUP BY Unsubscribed) AS td2
       ON td2.Unsubscribed = td1.Joined;

DECLARE @PrevJoined [int], @PrevUnsubscribed [int]
SET @PrevJoined = 0
SET @PrevUnsubscribed = 0

UPDATE #TempMonths
   SET @PrevJoined = JoinedToDate = @PrevJoined + Joined,
       @PrevUnsubscribed = UnsubscribedToDate = @PrevUnsubscribed + ISNULL(Unsubscribed, 0)
  FROM #TempMonths WITH(INDEX(0),TABLOCKX)
OPTION (MAXDOP 1)

SELECT DATEADD(m, MonthNumber, 0) AS MonthDate, 
       Joined AS JoinedThisMonth, 
       ISNULL(Unsubscribed, 0) AS UnsubscribedThisMonth, 
       JoinedToDate - UnsubscribedToDate AS EndOfMonthSubscribers 
  FROM #TempMonths

answered Oct 12 '09 at 09:27

Matt%20Whitfield's gravatar image

Matt Whitfield ♦
161311515

edited Oct 16 '09 at 19:25

Please drink another one and try to shave a few milliseconds off it and you could have a winner, Matt. C'mon!

(Oct 13 '09 at 18:16) Phil Factor

I drank another two! New version below in new answer :)

(Oct 13 '09 at 21:00) Matt Whitfield ♦

Dan,g this thing is good, Matt. Similar to what I put together, only much better. (which is why I didn't post mine... ;-) ).

(Oct 13 '09 at 22:46) RBarryYoung

Barry - Thanks - that means a lot to me coming from you. I posted another version below which (at least on my machine) runs about 20% faster... Thanks again

(Oct 13 '09 at 23:19) Matt Whitfield ♦

Yes, this looks to be very fast. I think Barry will come in with a good solution soon

(Oct 14 '09 at 14:06) Phil Factor

I doubt it, Matt and Peso's solution are incredibly fast, much faster than I thought was possible!

(Oct 14 '09 at 17:33) RBarryYoung

I have a similar machine that Phil has (dual core, 4gb, sql server 2008 r2) on a laptop, so the disks aren't that good. My IV(b) is there twice as fast as Matt2. Go figure :-)

I have a duration of 95 ms for my IV(b), and 200 ms for Matt 2.

(Oct 15 '09 at 05:25) Peso

That's always the way when it gets down to it, isn't it?? :)

(Oct 15 '09 at 10:07) Matt Whitfield ♦

Maybe the difference in timings are server settings? I have enabled "Date correlation" option.

(Oct 15 '09 at 19:30) Peso

Actually, I think that I have "Date correlation" on too!

(Oct 15 '09 at 23:48) RBarryYoung

Also, I am concerned about Phil's "table variable unit testing" approach. It seems when inserting into a table variable parallelism isn't an option. See http://sqlblog.com/blogs/peter_larsson/archive/2009/10/15/performance-consideration-when-using-a-table-variable.aspx

(Oct 16 '09 at 21:11) Peso
showing 5 of 11 show all

I'm using the "Quirky XML Select" for this:

DECLARE @monthly_joins int
DECLARE @monthly_leaves int
DECLARE @running_users int
DECLARE @current_month date
DECLARE @xchar nvarchar(max)
DECLARE @xml XML
DECLARE @now date

SELECT @xchar = '', @monthly_joins = 0, @monthly_leaves = 0, @current_month = NULL, @running_users = 0, @now = '20091001'

SELECT 
    @xchar = 
        CASE 
            WHEN eventDate = ISNULL(@current_month, eventDate) 
                THEN @xchar 
            ELSE 
                CASE
                    WHEN @current_month < @now 
                        THEN @xchar + '<A MONTH="'+ CONVERT(char(8),ISNULL(@current_month,eventDate),112) +'" JOINS="'+ CAST(@monthly_joins AS varchar(10)) +'" LEAVES="'+ CAST(@monthly_leaves AS varchar(10)) +'" TOT="'+ CAST(@running_users AS varchar(10)) +'"/>' 
                        ELSE @xchar
                    END
        END,
    @monthly_joins = 
        CASE 
            WHEN eventDate = ISNULL(@current_month,eventDate) 
                THEN @monthly_joins + JOINS 
            ELSE JOINS 
        END,
    @monthly_leaves = 
        CASE 
            WHEN eventDate = ISNULL(@current_month,eventDate) 
                THEN @monthly_leaves + LEAVES
            ELSE LEAVES 
        END,
    @running_users = @running_users + JOINS - LEAVES,
    @current_month = eventDate
FROM (
        SELECT CAST(CONVERT(char(6), DateJoined, 112) + '01' AS DATE) AS eventDate, JOINS = 1, LEAVES = 0
        FROM registrations
        UNION ALL
        SELECT CAST(CONVERT(char(6), DateLeft, 112) + '01' AS DATE) AS eventDate, JOINS = 0, LEAVES = CASE WHEN DateLeft IS NOT NULL THEN 1 ELSE 0 END
        FROM registrations
) AS Data
ORDER BY eventDate

IF @current_month < @now
    SELECT @xchar = @xchar + '<A MONTH="'+ CONVERT(char(8), @current_month, 112) +'" JOINS="'+ CAST(@monthly_joins AS varchar(10)) +'" LEAVES="'+ CAST(@monthly_leaves AS varchar(10)) +'" TOT="'+ CAST(@running_users AS varchar(10)) +'"/>'
SELECT @xml = CAST('<R>' + @xchar + '</R>' AS XML)

SELECT DISTINCT CONVERT(date, T.results.value('@MONTH[1]','char(6)') + '01', 112) as [month],
    T.results.value('@JOINS[1]','int') as joins,
    T.results.value('@LEAVES[1]','int') as leaves,
    T.results.value('@TOT[1]','int') as tot
FROM @xml.nodes('/R/A') AS T(results)

There's still some index needed, but I'm going with this for the moment. I'll be back with indexes maybe.

answered Oct 12 '09 at 14:03

Gianluca%20Sartori's gravatar image

Gianluca Sartori
1366

edited Oct 14 '09 at 08:08

Gianluca, It is an extraordinarily clever solution but unfortunately the result is not quite right. It is strange but it is out by only a small amount, but I'm afraid it is enough to disqualify you until you can do a fix.... Also, while you are about it, could you shave a few milliseconds off it as Peso is a nose ahead now!

(Oct 13 '09 at 18:14) Phil Factor

I would be glad to take acklowledgement for this technique, but it's Barry Young that came up with this. I tried to do the same in the past with the SELECT ... UNION ALL, but it doesn't perform the same! I'll take the time to fix my SQL, but I'm not sure that Peso can be beaten this time. He's an incredibly clever SQL developer and he can find unconventional solutions nobody else would have thought of!

(Oct 14 '09 at 07:37) Gianluca Sartori

Ok, I found what was wrong and I fixed it. I'm not very happy with this solution, so I'll try to find something faster.

(Oct 14 '09 at 08:04) Gianluca Sartori

I get correct result after changing "@now = '20091001'" to "@now = '20091101'"

(Oct 21 '09 at 12:48) Peso

Here is my second attempt without a "quirky" update

-- Peso 2
DECLARE @Stage TABLE
    (
        theMonth DATE PRIMARY KEY CLUSTERED,
        PeopleJoined INT NOT NULL,
        PeopleLeft INT NOT NULL,
        Subscribers INT NOT NULL
    )

INSERT      @Stage
        (
            theMonth,
            PeopleJoined,
            PeopleLeft,
            Subscribers
        )
SELECT      p.theMonth,
        p.DateJoined AS PeopleJoined,
        COALESCE(p.DateLeft, 0) AS PeopleLeft,
        0 AS Subscribers
FROM        (
            SELECT      DATEADD(MONTH, DATEDIFF(MONTH, 0, u.theDate), 0) AS theMonth,
                    u.theCol,
                    COUNT(*) AS Items
            FROM        dbo.Registrations AS r
            UNPIVOT     (
                        theDate
                        FOR theCol IN (r.DateJoined, r.DateLeft)
                    ) AS u
            GROUP BY    DATEADD(MONTH, DATEDIFF(MONTH, 0, u.theDate), 0),
                    u.theCol
        ) AS d
PIVOT       (
            MAX(d.Items)
            FOR theCol IN ([DateJoined], [DateLeft])
        ) AS p
WHERE       p.DateJoined IS NOT NULL
ORDER BY    p.theMonth

UPDATE  x
SET x.Subscribers = (SELECT SUM(w.PeopleJoined) - SUM(w.PeopleLeft) FROM @Stage AS w WHERE w.theMonth <= x.theMonth)
FROM    @Stage AS x

SELECT      theMonth,
        PeopleJoined,
        PeopleLeft,
        Subscribers
FROM        @Stage
ORDER BY    theMonth

Here is my third attempt, with a "quirky" update

-- Peso 3 "Quirky update"
DECLARE @Stage TABLE
    (
        theMonth DATE PRIMARY KEY CLUSTERED,
        PeopleJoined INT NOT NULL,
        PeopleLeft INT NOT NULL,
        Subscribers INT NOT NULL
    )

INSERT      @Stage
        (
            theMonth,
            PeopleJoined,
            PeopleLeft,
            Subscribers
        )
SELECT      p.theMonth,
        p.DateJoined AS PeopleJoined,
        COALESCE(p.DateLeft, 0) AS PeopleLeft,
        0 AS Subscribers
FROM        (
            SELECT      DATEADD(MONTH, DATEDIFF(MONTH, 0, u.theDate), 0) AS theMonth,
                    u.theCol,
                    COUNT(*) AS Items
            FROM        dbo.Registrations AS r
            UNPIVOT     (
                        theDate
                        FOR theCol IN (r.DateJoined, r.DateLeft)
                    ) AS u
            GROUP BY    DATEADD(MONTH, DATEDIFF(MONTH, 0, u.theDate), 0),
                    u.theCol
        ) AS d
PIVOT       (
            MAX(d.Items)
            FOR theCol IN ([DateJoined], [DateLeft])
        ) AS p
WHERE       p.DateJoined IS NOT NULL
ORDER BY    p.theMonth

DECLARE @Subscribers INT = 0

UPDATE  @Stage
SET @Subscribers = Subscribers = @Subscribers + PeopleJoined - PeopleLeft

SELECT      theMonth,
        PeopleJoined,
        PeopleLeft,
        Subscribers
FROM        @Stage
ORDER BY    theMonth

I don't think a quirky update is necessary here. Every month (every record) is 15 bytes (Date 3 bytes, Join 4 bytes, Leave 4 bytes and Subscribers 4 bytes; all non null).

Since a page is the minimum amount of data read by SQL Server anyway, you can have about 44 years (and 9 months) in a single page. The "triangular join" update will be sufficient for those records (537 records for 44.78 years of data).

SQLServerCentral started in March 2001, so there can be only 103 records (102 months since March 2001 to September 2009). A triangular join update for 103 records is really nothing.

answered Oct 12 '09 at 20:59

Peso's gravatar image

Peso
887212

edited Oct 13 '09 at 07:21

Peso, this goes so fast I can't measure it with 10,000 rows. It just says it took no time at all. I'm going to have to upgrade to a million row table.

(Oct 13 '09 at 16:26) Phil Factor

You so need to send me your specs of your test lab so I can replicate it here...

(Oct 13 '09 at 18:18) Peso

You have plans to reveal the ideas you wrote about initially?

(Oct 13 '09 at 18:58) Peso

I've got my own code in the harness too but it is slower than yours at the moment. It isn't a 'quirky': I didn't think the Quirky Update would work so fast as it has.

(Oct 13 '09 at 20:55) Phil Factor

I was able to solve it without the Quirky Update also, Peso (and linear time too), it's just not as fast. I had to spend too much effort handling those stupid months with no Unsubscribed count, grrrr...

(Oct 13 '09 at 22:48) RBarryYoung

The requirements says you can assume every month has at least one new subscriber.

(Oct 13 '09 at 23:09) Peso
showing 5 of 6 show all

Oh well, I'll throw in a bone too. Phil, are you continously going to update us of the timings on your machine? This code allows you to enter any month to see what the status was then. The code also includes months were there are no new subscribers, just in case. And no quirky update of any sort.

DECLARE @Viewpoint INT

SET     @ViewPoint = DATEDIFF(MONTH, 0, GETDATE())

SELECT      DATEADD(MONTH, v.Number + d. Joined, 0) AS theMonth,
        SUM(CASE WHEN v.Number = 0 THEN d.NewSubscribers ELSE 0 END) AS NewSubscribes,
        SUM(CASE v.Number + d. Joined WHEN d.Leave THEN d.NewSubscribers ELSE 0 END) AS Unsubscribes,
        SUM(d.NewSubscribers) - SUM(CASE v.Number + d.Joined WHEN d.Leave THEN d.NewSubscribers ELSE 0 END) AS CurrentSubscribers
FROM        (
            SELECT      DATEDIFF(MONTH, 0, DateJoined) AS Joined,
                        CASE
                            WHEN DATEDIFF(MONTH, 0, DateLeft) < @ViewPoint THEN DATEDIFF(MONTH, 0, DateLeft)
                            ELSE @ViewPoint
                        END AS Leave,
                        COUNT(*) AS NewSubscribers
            FROM        dbo.Registrations
            WHERE       DateJoined < DATEADD(MONTH, @ViewPoint, 0)
            GROUP BY    DATEDIFF(MONTH, 0, DateJoined),
                        CASE
                            WHEN DATEDIFF(MONTH, 0, DateLeft) < @ViewPoint THEN DATEDIFF(MONTH, 0, DateLeft)
                            ELSE @ViewPoint
                        END
        ) AS d
INNER JOIN  master..spt_values AS v ON v.Type = 'P'
            AND v.Number <= d.Leave - d.Joined
WHERE       v.Number + d.Joined < @Viewpoint
GROUP BY    v.Number + d.Joined
ORDER BY    v.Number + d.Joined

Thank you for fixing the formatting!

answered Oct 12 '09 at 13:29

Peso's gravatar image

Peso
887212

edited Oct 12 '09 at 19:48

Yes, I'll pop it all in the test-harness tonight and see what gives.

(Oct 12 '09 at 13:53) Phil Factor

I get correct result on Phil's million record samples after changing "DATEDIFF(MONTH, 0, GETDATE())" to "DATEDIFF(MONTH, -1, GETDATE())"

(Oct 21 '09 at 12:45) Peso

I almost didn't post this when I saw how similar it is to some of the other answers. It's pretty fast and really simple (I thought) so here it goes...

DECLARE @Count INT
DECLARE @data TABLE
(
    RptDate DATETIME PRIMARY KEY,
    JoinCount INT,
    LeaveCount INT,
    SubscriberCount INT
)

SET @Count = 0

INSERT @data
(
    RptDate,
    JoinCount,
    LeaveCount
)
SELECT
    joins.JoinMonth AS RptDate,
    joins.JoinCount,
    ISNULL(leaves.LeaveCount, 0)
FROM
    (
    SELECT
        r.DateJoined + 1 - DATEPART(dd, r.DateJoined) AS JoinMonth,
        COUNT(*) AS JoinCount
    FROM
        dbo.Registrations r
    GROUP BY
        r.DateJoined + 1 - DATEPART(dd, r.DateJoined)
    ) AS joins
LEFT OUTER JOIN
    (
    SELECT
        r.DateLeft + 1 - DATEPART(dd, r.DateLeft) AS LeaveMonth,
        COUNT(*) AS LeaveCount
    FROM
        dbo.Registrations r
    WHERE
        r.DateLeft IS NOT NULL
    GROUP BY
        r.DateLeft + 1 - DATEPART(dd, r.DateLeft)
    ) AS leaves
ON
    joins.JoinMonth = leaves.LeaveMonth

UPDATE
    d
SET
    @Count = d.SubscriberCount = d.JoinCount - d.LeaveCount + @Count
FROM
    @data d

SELECT
    d.RptDate,
    d.JoinCount,
    d.LeaveCount,
    d.SubscriberCount
FROM
    @data d

answered Oct 12 '09 at 15:45

AndyM's gravatar image

AndyM
213

I get wrong results with this one. The code "r.DateJoined + 1 - DATEPART(dd, r.DateJoined)" assumes DateJoined is set to 00:00:00.000 for time part.

(Oct 21 '09 at 12:40) Peso

Ideally solution would use window aggregates with a RANGE partition to calculate all of the aggregates in a single pass over the data. Unfortunately this is not currently supported in SQL Server…

In the following solution temp tables could be subqueries but subqueries cause SQL Server to choke. The speed of this solution is dependent on the performance of the TEMP database. A nearly full or badly fragmented TEMP db file will slow things down.

--Create temp table of months, assuming data contains all possible months...
SELECT DISTINCT ReportMonth
  INTO #months
  FROM (
SELECT DISTINCT DATEADD(d, -(DatePart(DAY,DateJoined)-1), DateJoined) ReportMonth FROM  Registrations --Adds
 UNION
SELECT DISTINCT DATEADD(d, -(DatePart(DAY,DateLeft)-1), DateLeft) ReportMonth FROM  Registrations --Drops
 ) a
 WHERE ReportMonth IS NOT NULL

--Create a temp table for Adds, summarised by month
SELECT DATEADD(d, -(DatePart(DAY,DateJoined)-1), DateJoined) ReportMonth
      ,COUNT([Registration_ID]) Adds
  INTO  #adds
  FROM  Registrations
 GROUP BY DATEADD(d, -(DatePart(DAY,DateJoined)-1), DateJoined)

--Create a temp table for Drops, summarised by month
SELECT DATEADD(d, -(DatePart(DAY,DateLeft)-1), DateLeft)   DropMonth
       ,COUNT([Registration_ID]) Drops
  INTO  #drops
  FROM  Registrations
 WHERE  DateLeft IS NOT NULL
 GROUP BY DATEADD(d, -(DatePart(DAY,DateLeft)-1),   DateLeft)

--Join all 3 temp tables together and
SELECT m.ReportMonth
      ,ISNULL(a.Adds,0) SubsAdded
      ,ISNULL(d.Drops,0) SubsDropped
--Use 2 correlated subqueries to calculate the running total.
      ,(SELECT SUM(Adds) FROM #adds WHERE #adds.ReportMonth <= a.ReportMonth)
      - ISNULL((SELECT SUM(Drops) FROM #drops WHERE #drops.DropMonth <= a.ReportMonth),0) TotalSubs
  FROM #months m
  LEFT JOIN #adds a
         ON m.ReportMonth = a.ReportMonth
  LEFT JOIN #drops d
         ON a.ReportMonth = d.DropMonth
ORDER BY ReportMonth
This answer is marked "community wiki".

answered Oct 12 '09 at 16:06

Joe%20Harris's gravatar image

Joe Harris
261

I get wrong result with this. The code "DATEADD(d, -(DatePart(DAY,DateJoined)-1), DateJoined)" assumes time part of DateJoined is set to 00:00:00.000

(Oct 21 '09 at 12:41) Peso
  SELECT Dates.date AS [date],
    (SELECT COUNT(*) FROM registrations r WHERE DateAdd(second, -1,DateAdd(month,1,Dates.date)) 
        BETWEEN r.dateJoined AND COALESCE(r.DateLeft,GETDATE() )) AS subscribers,
    (SELECT COUNT(*) FROM registrations r WHERE datepart (month,r.dateJoined) =datepart (month,Dates.date)
                    and datepart (year,r.dateJoined) =datepart (year,Dates.date)) AS joined,
    (SELECT COUNT(*) FROM registrations r WHERE datepart (month,r.dateleft) =datepart (month,Dates.date)
                    and datepart (year,r.dateLeft) =datepart (year,Dates.date)) AS [left]

FROM
    (SELECT 
      [date]=CAST('01 '+LEFT(DATENAME(month,DateJoined),3)+' '+DATENAME(year,DateJoined) AS DATETIME)
    FROM Registrations
    GROUP BY CAST('01 '+LEFT(DATENAME(month,DateJoined),3)+' '+DATENAME(year,DateJoined) AS DATETIME)) dates
ORDER BY date   

answered Oct 12 '09 at 17:39

William%20Brewer's gravatar image

William Brewer
614

I get wrong result on last record (#106) using Phil's million record sample. Only for "Subscribers" column.

(Oct 21 '09 at 12:43) Peso

Phil told my that my answer was in big trouble, so I felt I ought to write another, only to find that Graham had pretty well got there already. I also looked carefully at Matt an Peso's entries. Think of this as a joint entry

CREATE TABLE #Preaggregation
(
 MonthJoined INT,
 MonthLeft INT,
 HowMany INT
)
INSERT  INTO #Preaggregation
    (
     MonthJoined,
     MonthLeft,
     HowMany
    )
    SELECT  [monthJoined],
            [MonthLeft],
            COUNT(*) AS [HowMany]
    FROM    (SELECT DATEDIFF(MONTH, 0, DateJoined) AS [MonthJoined],
                    DATEDIFF(MONTH, 0, COALESCE(dateleft, GETDATE())) AS [Monthleft]
             FROM   Registrations
            ) f
    GROUP BY MonthJoined,
            MonthLeft
            WITH CUBE 
CREATE INDEX idxMonthJoined ON #preaggregation (monthjoined, MonthLeft, Howmany)
CREATE INDEX idxMonthLeft ON #preaggregation (monthLeft, MonthJoined, Howmany) ;
WITH    JoinedAndLeft
      AS (SELECT    P1.MonthJoined,
                    p1.howmany AS PeopleJoined,
                    pj.howmany AS [Peopleleft]--,
          FROM      #preaggregation p1
                    INNER JOIN #preaggregation pj ON pj.monthleft=p1.monthjoined
          WHERE     pj.monthjoined IS NULL
                    AND p1.monthJoined IS NOT NULL
                    AND p1.monthleft IS NULL
         )
SELECT  DATEADD(MONTH, thisMonth.Monthjoined, 0) AS [Date],
        MIN(thisMonth.PeopleJoined)as [joined],
        MIN(thisMonth.PeopleLeft) as [left],
        SUM(previous.PeopleJoined-previous.PeopleLeft) as [subscribers]
FROM    JoinedAndleft thisMonth
        INNER JOIN JoinedAndLeft previous ON previous.Monthjoined<=ThisMonth.Monthjoined
GROUP BY thisMonth.Monthjoined
ORDER BY thisMonth.monthjoined

answered Oct 16 '09 at 17:00

William%20Brewer's gravatar image

William Brewer
614

edited Oct 16 '09 at 21:49

This seems the fastest yet on a million rows. The significant factor seems to be the use of the CUBE to do the pre-aggregation, along with Matt's trick of doing the initial aggregation on a derived table, and some of Peso's methods of speeding up the calculation (use of an integer for intermediary date calculations).

(Oct 16 '09 at 18:29) Phil Factor

Phil - does this produce the correct result on your data?

(Oct 16 '09 at 19:00) Matt Whitfield ♦

It does, if you change "DATEDIFF(MONTH, 0, COALESCE(dateleft, GETDATE())) AS [Monthleft]" to "DATEDIFF(MONTH, 0, COALESCE(dateleft, dateadd(month, 1,GETDATE()))) AS [Monthleft]"

(Oct 21 '09 at 12:32) Peso
Your answer
toggle preview

Copyright © 2002-2009 Simple Talk Publishing. All Rights Reserved.
Ask SQL Server Central is a community service provided by Red Gate.
Powered By OSQA