It must be that time again!Phil Factor SQL Speed Phreak Challenge #4 is here, and our friends at Red Gate have offered some truly excellent prizes this time: First Prize: $100 and a license for SQL Data Generator There will also be three categorised runner-up prizes:
The winners of these prizes will get a licensed copy of either SQL Prompt or SQL Refactor, whichever they prefer. Of course, the first place entry will also have privilege and honour of displaying this rather nice trophy jpg on your blog / website / bedroom wall in perpetuity.
So, the task...It turns out that Robert Bar has a younger brother. Reggie Bar was asked by his company to so some analytics work on their web logs. Reggie started by using the information in this article to use LogParser to import the data from his IIS logs into the database. So far, so good. Reggie then got hold of a geolocation IP database from here. Again, so far, so good. That’s where the good times ended, however. Reggie decided that the best way to approach this task was to use cursors, and even on a small amount of data the solution is taking longer and longer to execute. Management aren’t happy, they need a solution that is fast and won’t get hugely slower over time. The dataThe main data sources are two standard sources of information – the output of LogParser (in the table The data (update)There is now available another table IISAudit2, which is exactly the same as the old table, except for the fact that it adds a sequence number to the data, to aid sorting the rows in a consistent and repeatable manner (something you would probably do if loading this data in from your own IIS server anyway). Feel free to use this in your solutions if it makes it easier for you - the file is available (again kindly hosted by Peso) here. The data (update 2)Peso found a bug in Reggie's code, where the lookup on the ip_group_country got it wrong for one value because it used < rather than <=. New code, both data files, DDL & bcp commands, again kindly hosted by Peso, are here. The outputThe output required is three result sets:
Note that in this challenge, the order of the data in the result sets does matter. The visitor summaries include the following calculations:
The page summary includes the following calculations:
Note that for the country summary by day report, the day to which a visit is added is the day on which the visit started. To lookup the geolocation data for an IP
The rulesThe rules for the challenge are here. The finishing date: Midnight GMT 28th February 2010 Submissions must be in by 28th Feb 2010 BootnoteReggie Bar's solution will only return the correct result from a table loaded directly from the source data. He really should have put an order by on the cursor select, shouldn't he? *cough* Leaderboard
ResultCongratulations to Peso - his 2c entry is the winner!
(comments are locked)
|
|
Peso 2c 2010-02-23 Same setup and teardown as Peso 2a (single index). The main code It's getting close now!!
Feb 23 '10 at 06:34 PM
Matt Whitfield ♦♦
I did compare with COMP and they are exact matches, using IISAudit2 table and the new code from Reggie with ORDER BY seq_no added to the CURSOR definition.
Feb 23 '10 at 07:31 PM
Peso
+1 Got this to work. Very fast good effort Peso. I can only dream of writing SQL like you! :-D
Feb 23 '10 at 07:33 PM
Ian Roke
(comments are locked)
|
This runs in about 5 seconds on my machine which is hugely impressive. The Goal URL score could be put in the aggregation stage so it wouldn't be too much of an additional overhead.
Feb 17 '10 at 10:18 AM
Ian Roke
Thanks for that. I haven't done any work in optimizing it yet so there is a potential for shaving quite a bit off the timings. Last time I checked, the main report was exactly the same as Reggie's but the country reports were quite different.
Feb 17 '10 at 01:41 PM
Phil Factor
Phil - do you want me to split this into setup, main & teardown and add deletes from the temporary table to the beginning of main? Also with the results - could well be the ordering issue, because there are multiple page requests from the same IP in the same second, which idiot me didn't notice previously... I thought of adding an IDENTITY to the set after bulk loading - what do you think?
Feb 17 '10 at 02:19 PM
Matt Whitfield ♦♦
As this is the first cut, I'd say do the timings on the code as it is if is less trouble. We only want rough timings at this stage. I'd say leave the import table as is now. I reckon it is good practice to take data in as near as dammit to what it was like when it was in the import file. It is then part of the task to cope with things like this. This is why I put in the surrogate key into the session table. The problem is finding a primary key that will actually add value. It doesn't help to use the surrogate, of course.
Feb 17 '10 at 03:45 PM
Phil Factor
I did the algorithm to find the source of the IP address using excel and even a pocket calculator, and I still get different results to Reggie, so I must have got my understanding of the process entirely wrong.
Feb 17 '10 at 03:45 PM
Phil Factor
(comments are locked)
|
|
Dave Ver 1.b - I think the results are now spot on ,bold claim i know The Setup
The Teardown
And the execution code
(comments are locked)
|
|
Peso 3a, almost there. In this solution there are no "quirkyness" at all. Resultset 1 and 2 are perfect match, but resultset #3 still need some work. Do you want me to add this to the harness or wait until the match is there? I'll do another timing run tomorrow (just off to bed)
Feb 25 '10 at 07:52 PM
Matt Whitfield ♦♦
Add it, so I can see if I am going the right direction.
Feb 26 '10 at 08:01 AM
Peso
It's now up there...
Feb 26 '10 at 05:47 PM
Matt Whitfield ♦♦
And it's almost valid, or just plain invalid? ;-)
Feb 26 '10 at 06:36 PM
Peso
I'm not sure actually - I was so tired yesterday I just ran it quickly and went to sleep! But it looks like 2c is doing to clinch it... unless someone submits an amazingly fast and correct solution in the next 4 minutes
Feb 27 '10 at 08:57 PM
Matt Whitfield ♦♦
(comments are locked)
|
This is definitely closer - There's a few foibles now (for example, country 'BA' in the country summary shows 9 visits in your results, and it should be 8), but definitely much closer! Also the ordering in your country by day summary should be day, country and not country, day... But this is definitely the closest :)
Feb 21 '10 at 12:50 PM
Matt Whitfield ♦♦
But is Phil allowed to enter his own competition? ;-p
Feb 21 '10 at 03:25 PM
Ian Roke
Of course! :) At the rate he's going, he's going to be the winner!
Feb 21 '10 at 04:23 PM
Matt Whitfield ♦♦
What is the correct ORDER BY clause for Reggies code, so we can compare our results? I get parallellism all over the places so it's hard to know.
Feb 21 '10 at 05:50 PM
Peso
C'mon you others. I'm feeling lonely. Isn't anyone else going to enter the competition?
Feb 21 '10 at 07:15 PM
Phil Factor
(comments are locked)
|
1 2 next page »


A visit is one or more page requests from the same IP address within a set time limit...We should include POSTs as a page request, correct?
Scot - correct - any sort of activity from that same IP keeps the session going...
I wouldn't do this in reality, because an entire enterprise or ISP can share an IP address. You actually have to look closely at the UserAgent to work out individual visitors as well as the IP address, but I suspect that Reggie didn't know about that.
Agreed - Ideally you'd get the session ID and use that...
Even looking at UserAgent isn't the ideal way Phil because in an enterprise they all tend to be on the same one. Session ID is a good alternative however if I tend to have a requirement for user tracking through a system I have some code that saves user details on a page load (ASP.NET/C#) that stores details of the user logged in or otherwise which I can then correlate to the stats.