question

Scot Hauder avatar image
Scot Hauder asked

Phil Factor Challenge #4 - Sub Challenge

Here is one technique I was exploring, which I could not figure out, with the ip-group-country table. Given that several countries are contiguous in the ip range I wanted to group by these and get the MAX, MIN ip-start and then I would only have to check the "between" ranges of about half of the 180,000 rows. I've tried every combination of ranking functions to no avail. Here is what I want: how to calculate the second column below so I can group on it!

US  1
US  1
CA  2
IN  3
US  4
CA  5
CA  5
DE  6
DE  6
US  7
challenge
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Scot Hauder avatar image
Scot Hauder answered
@Jeff Moden Well I think I finally figured it out. This is why it isn't so straight forward: I basically have to look at each individual row one at a time while SQL and the windowing functions want to work on sets. CREATE TABLE #ip_group_country ( [id] int ,[ip_start] bigint ,[ip_cidr] varchar(50) ,[country_code] char(2) ,[country_name] varchar(50)) INSERT #ip_group_country([id],[ip_start],[ip_cidr],[country_code],[country_name]) VALUES (397, 203658304, '12.35.148.64/26', 'PR', 'Puerto Rico'), (398, 203658368, '12.35.148.128/28', 'PR', 'Puerto Rico'), (399, 203658384, '12.35.148.144/28', 'US', 'United States'), (400, 203658400, '12.35.148.160/28', 'PR', 'Puerto Rico'), (401, 203658416, '12.35.148.176/29', 'VI', 'Virgin Islands U.S.'), (402, 203658424, '12.35.148.184/27', 'US', 'United States'), (403, 203658456, '12.35.148.216/28', 'US', 'United States'), (404, 203658472, '12.35.148.232/29', 'US', 'United States'), (405, 203658480, '12.35.148.240/24', 'PR', 'Puerto Rico'), (406, 203658736, '12.35.149.240/28', 'PR', 'Puerto Rico'), (407, 203658752, '12.35.150.0/26', 'VI', 'Virgin Islands U.S.'), (408, 203658816, '12.35.150.64/28', 'VI', 'Virgin Islands U.S.'), (409, 203658832, '12.35.150.80/28', 'US', 'United States'), (410, 203658848, '12.35.150.96/26', 'PR', 'Puerto Rico'), (411, 203658912, '12.35.150.160/28', 'VI', 'Virgin Islands U.S.'), (412, 203658928, '12.35.150.176/28', 'PR', 'Puerto Rico'), (413, 203658944, '12.35.150.192/29', 'PR', 'Puerto Rico') ;WITH BaseTable AS ( SELECT id, ip_start, ip_cidr, country_code, country_name, [r] = ROW_NUMBER() OVER (ORDER BY ip_start) FROM #ip_group_country ) ,a AS ( SELECT id, ip_start, ip_cidr, country_code, country_name, r, IPGroup = 1 FROM BaseTable WHERE r = 1 UNION ALL SELECT bt.id, bt.ip_start, bt.ip_cidr, bt.country_code, bt.country_name, bt.r, CASE WHEN bt.country_code <> a.country_code THEN IPGroup + 1 ELSE IPGroup END FROM a INNER JOIN BaseTable [bt] ON bt.r = a.r + 1 ) SELECT MIN(ip_start) [IPStart], MAX(ip_start) [IPEnd] ,country_code FROM a GROUP BY IPGroup,country_code ORDER BY MIN(ip_start) DROP TABLE #ip_group_country
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Matt Whitfield avatar image
Matt Whitfield answered

Well I don't know how to give you what you've asked for, but I'm pretty sure I can give you what you want:

CREATE TABLE [#numbered_groups] (
  [ip_start] [bigint] NOT NULL,
  [country_code] [char] (2) COLLATE Latin1_General_CI_AS NULL,
  [row] [int] PRIMARY KEY CLUSTERED
) 

INSERT INTO [#numbered_groups] ([ip_start], [country_code], [row])
SELECT [ip_start],
       [country_code],
       ROW_NUMBER() OVER (ORDER BY [ip_start]) AS [row] 
FROM [dbo].[ip_group_country]

SELECT [gc].[ip_start],
       [gc].[country_code] FROM [#numbered_groups] [gc] LEFT OUTER JOIN [#numbered_groups] [gc1]
ON gc1.[row] = [gc].[row] - 1
WHERE [gc].[country_code] != ISNULL([gc1].[country_code], '')
ORDER BY [gc].[ip_start]

DROP TABLE [#numbered_groups]
4 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Thanks for the answer Matt, I had already looked at doing that but decided the join was too expensive. I "almost" have what I need with this:
SELECT CAST(ip_start AS bigint) ip_start
      ,[country_code]
      ,ROW_NUMBER() OVER (ORDER BY country_code)-ROW_NUMBER() OVER (ORDER BY ip_start) MyDenseRank
FROM [NFL].[dbo].[ip_group_country]
but the numbers eventually collide. NB. I wouldn't have asked this if I thought it would be faster than what I am doing now, it just bothers me when something that should be simple is deceptively ore difficult
0 Likes 0 ·
Well. It depends really on what you want - because I can't see the reduction in rows causing any significant boost in speed overall given that the computation you want will always involve a scan of one sort or another...
0 Likes 0 ·
Right, that's why I've banished the idea for this competition. Just brainstorming and thought a scan over half the number of rows would save more time than it does with this table, perhaps a larger table would see a greater gain
0 Likes 0 ·
Yeah, I mean I've just posted a CLR solution which just leverages a clustered index on ip_group_country and the whole thing runs in between 300 and 350ms on my box... I can't really see any way that the pre-aggregation would yield enough of a speed bump to make it worth while...
0 Likes 0 ·
Phil Factor avatar image
Phil Factor answered

One really bad approach is to use a SELECT INTO query with the IDENTITY function and an ORDER BY clause. Because of a quirk of the optimiser, The identity values are not guaranteed to have the same order as the order that is provided by the ORDER BY clause. It may work, but it may not. Even if you use TOP or SET ROWCOUNT, it doesn't guarantee the order of the rows.

If you either use Matt's approach, or create a table that contains a column with the IDENTITY property and then run an INSERT .. SELECT … GROUP BY ORDER BY query to populate this table, you'll get the correct sequence. As far as I know, there is no performance advantage either way. However, I'm not sure that every country enjoys an unbroken sequence in the table. Have you checked that?

8 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Yes. There are still a lot of tricks you can do with computed columns and identity fields. People who are still using SQL Server 2000 like the Identity property because they haven't got ROW_NUMBER. A lot of people are still forced to use SQL Server 2000
1 Like 1 ·
Does INSERT INTO have the same issues?
0 Likes 0 ·
Can you think of a good use for the IDENTITY function now that we have ROW_NUMBER()?
0 Likes 0 ·
I mean the IDENTITY function, not the IDENTITY property...
0 Likes 0 ·
Show more comments
Jeff Moden avatar image
Jeff Moden answered

What a dilema for me... Scot has been 100% truthful about what this is for but what this is for is a contest with a high value reward for winning. I usually believe in rewarding absolute honesty but, in this case, I could be giving someone a unfair advantage in a contest.

Scot... my apologies. If you're still interested in the answer after the challenge, post back and I'll take a crack at it (I have a super fast "gap" finder that we can warp). I will give you a "Google Hint" though... search for "SQL ISLANDS AND GAPS". Just be careful... some of the methods are comparatively slow.

--Jeff Moden

6 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Thanks Jeff, I will still be interested after the competition. I know how to find gaps/islands but I think you will find my simple (to describe) problem subtly more difficult.
0 Likes 0 ·
Jeff - if you pull out of your hat a way to achieve this that actually makes the overall solution quicker (given it only yields a 33% reduction in rows over a very easily indexable table) I will officially award you 1000 magic points!!!! :)
0 Likes 0 ·
Scot... how so?
0 Likes 0 ·
Matt... I guess it'll depend on how Scot replies.
0 Likes 0 ·
@Jeff If you look at my sample the underlying problem is that SQL works on sets with the windowing function and wants to group the country codes before applying the rownumber but I basically want to increment the value if the country code is different than the row before it, otherwise keep it the same. Yes, I can use a quirky update but I was looking for a SELECT only solution. @Matt, you are right, not much of a boost in speed and I had already figured out a faster workaround but was thrown by this problem, so I posted it, per the chance that someone had done something similar in the past
0 Likes 0 ·
Show more comments
Jeff Moden avatar image
Jeff Moden answered

Matt,

On the subject of what to use the IDENTITY function for...

First, here's the link that talks about the order of SELECT/INTO with an IDENTITY function no being guaranteed...

link text

The problem with that fine article is... they don't show any code that actually fails. They only say that they could fail "in one of three ways". If you use OPTION(MAXDOP 1) to prevent parallelism, then none of the demonstration code posted on that article will ever fail.

Further, it looks like they fixed some major problems that occurred in SQL Server 2000 when they came out with SQL Server 2005. The following code proves that an ordered SELECT/INTO with an IDENTITY function can fail in SQL Server 2000 but works correctly in SQL Server 2005.

--===== Conditionally drop any temp tables to simplify doing reruns of the code
     IF OBJECT_ID('TempDB..#Insert') IS NOT NULL
        DROP TABLE #Insert  

     IF OBJECT_ID('TempDB..#SelectInto') IS NOT NULL
        DROP TABLE #SelectInto  

     IF OBJECT_ID('TempDB..#Data') IS NOT NULL
        DROP TABLE #Data  

     IF OBJECT_ID('TempDB..#Tally') IS NOT NULL
        DROP TABLE #Tally  

--===== Create and populate a temporary Tally table on the fly
 SELECT TOP 11001
        IDENTITY(INT,0,1) AS N
   INTO #Tally
   FROM Master.dbo.SysColumns sc1
  CROSS JOIN Master.dbo.SysColumns sc2  

--===== Add a Primary Key to maximize performance
  ALTER TABLE #Tally
    ADD PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100  

--===== Create some known, stable test data
 SELECT CAST(t1.N AS TINYINT)  AS SomeValue,
        CAST(t2.N AS INT)      AS SomeID,
        CAST(t3.N AS SMALLINT) AS SomeOtherID
   INTO #Data
   FROM #Tally t1,
        #Tally t2,
        #Tally t3
  WHERE t1.N BETWEEN 1 AND 10
    AND t2.N BETWEEN 1 AND 10
    AND t3.N BETWEEN 1 AND 10  

--===== Precreate the table that will use only INSERT
 CREATE TABLE #Insert
        (
        RowNum      INT IDENTITY(1,1),
        SomeValue   TINYINT,
        SomeID      INT,
        SomeOtherID SMALLINT,
        RefDate     DATETIME
        )  

--===== Do the ordered SELECT/INTO using the IDENTITY function
 SELECT IDENTITY(INT,1,1) AS RowNum,
        SomeValue,
        SomeID,
        SomeOtherID,
        GETDATE() AS RefDate
   INTO #SelectInto
   FROM #Data
  ORDER BY SomeValue, SomeID, SomeOtherID, GETDATE()
 OPTION (MAXDOP 1)  

--===== Do the ordered INSERT into an established table
 INSERT #Insert (SomeValue,SomeID,SomeOtherID,refdate)
 SELECT SomeValue,
        SomeID,
        SomeOtherID,
        GETDATE()
   FROM #Data 
  ORDER BY SomeValue, SomeID, SomeOtherID, GETDATE()
 OPTION (MAXDOP 1)  

--===== Show any differences between the two tables
 SELECT i.*, si.*
   FROM #Insert i
  INNER JOIN #SelectInto si 
     ON si.SomeValue   = i.SomeValue 
    AND si.SomeID      = i.SomeID
    AND si.SomeOtherID = i.SomeOtherID
  WHERE i.RowNum      <> si.RowNum

If you comment out the GETDATE() in the SELECT/INTO in ORDER BY, then it works correctly in SQL Server 2000, as well. The key here is that including GETDATE() in the ORDER BY is a computational mistake... since it will give the same value for all rows, there's just no sense in including it. GETDATE() is also "indeterminent" and unless you're trying to get a random order with something like NEWID(), I would never include an indeterminent function in an ORDER BY because that's a programming mistake the could screw up the order.

SELECT/INTO with an IDENTITY can also go out of order across a linked server and when using the OUTPUT clause and they both have work arounds. For the linked server problem, the work around is, in fact, to precreate the target table (which also solves another problem I won't get into here). For the problem that occurs with the OUTPUT clause, you must simply reference one of the columns actually being inserted and not just the IDENTITY column.

Just like the "quirky update", people will claim that SELECT/INTO with an IDENTITY could break with the next CU or SP so why use it? The answer is incredible speed. SELECT/INTO is twice as fast as an INSERT into an existing table and if you follow "the rules", it won't break once established. Still, a CU or SP could break it. So what do you do? The answer is either give up the speed and use an INSERT into an existing table or do the thing that everyone should do... before you install a CU, Hot Fix, or SP in production, you absolutely should do full regression testing on a test server. Even well documented and well established features can break or change without warning "in the interest of security".

As a side bar, I've been using SELECT/INTO with the IDENTITY function since SQL Server 2000 came out and have never had a problem with it. Of course, I don't do anything with it that doesn't make sense like doing an ORDER BY that includes an indeterminent function and I always use OPTION(MAXDOP 1). Use of "Black Arts" techniques is NOT for casual users... if someone doesn't want to do the necessary research and extreme testing or they're going to do willy-nilly upgrades of SQL Server with CUs, Hot Fixes, and SPs, then they should follow the generally accepted "safe thing to do". Me? I'll never forsake accuracy for speed but I'll definitely go for the speed if I've found no problems with accuracy. I've also found that CUs, Hot Fixes, and SPs don't affect "features" like this except in a good way... the optimizer has been getting better and better.

--Jeff Moden

6 comments
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Generally, any operation that does a Quirky update such as a running total can be tested if you put an IDENTITY in the table you're updating in the same order you want the update to work. A simple check with a tally will then set a variable that then triggers a RAISERROR! Even theSQLGuru will have to bite his lip then. I must put this into my code for the challenge just to show off a bit. Remind me Jeff!
0 Likes 0 ·
I'd forgotten the extra speed of the SELECT INTO, Jeff. If you can think of a quick check in the code that the optimizer hasn't gone doolally, then I'm a convert to the technique. I sometimes suspect that the Dome-Heads at Microsoft come out with these edicts to cover up a bug that they didn't spot.
0 Likes 0 ·
Jeff - do you see a big difference in speed between select into with identity over select into with row_number? Because ithat's what I was asking about really - the identity function .v. row_number...
0 Likes 0 ·
Phil... Heh... I don't know if you read my latest rendition of the "Running Total" article but I included code that did the very test you suggest and still beat a cursor, to boot... but it still didn't satisfy the relational purists. Man! I took a lot of heat but I won't let it deter me. ;-)
0 Likes 0 ·
Phil... on the subject of SELECT/INTO with the IDENTITY... the quick check would be to ensure that each row contained greater data in whatever the sort order of the insert was according to a self join on the IDENTITY column. Haven't tested to see if it's worth it CPU and Duration wise, though. Personally, I trust it to work correctly when I follow the self imposed rules I follow when I use it.
0 Likes 0 ·
Show more comments
Jeff Moden avatar image
Jeff Moden answered

Sorry about the formatting in the code... like so many forums, this one eats blank lines. It's a real aggravation and I wish the makers of technical forums like this one would figure out that people do take pride in the formatting of their code and accomodate them for once. Who ever picked out the colorization in the code needs to be taken out for a nice pork chop dinner.

--Jeff Moden

10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.