It must be that time again!
Phil Factor SQL Speed Phreak Challenge #4 is here, and our friends at Red Gate have offered some truly excellent prizes this time:
First Prize: $100 and a license for SQL Data Generator
There will also be three categorised runner-up prizes:
- Best Cursor-based solution
- Best CLR solution
- Best Unusual/experimental attempt
The winners of these prizes will get a licensed copy of either SQL Prompt or SQL Refactor, whichever they prefer.
Of course, the first place entry will also have privilege and honour of displaying this rather nice trophy jpg on your blog / website / bedroom wall in perpetuity.
So, the task...
It turns out that Robert Bar has a younger brother. Reggie Bar was asked by his company to so some analytics work on their web logs. Reggie started by using the information in this article to use LogParser to import the data from his IIS logs into the database. So far, so good.
Reggie then got hold of a geolocation IP database from here. Again, so far, so good.
That’s where the good times ended, however. Reggie decided that the best way to approach this task was to use cursors, and even on a small amount of data the solution is taking longer and longer to execute. Management aren’t happy, they need a solution that is fast and won’t get hugely slower over time.
The data
The main data sources are two standard sources of information – the output of LogParser (in the table [dbo].[IISAudit]
) and the geolocation IP address (in the table [dbo].[ip_group_country]
). The data for both of these tables is kindly being hosted by Peso here. Also included in the data file are the DDL for the tables, the BCP commands to import the data, and Reggie’s solution to the problem.
The data (update)
There is now available another table IISAudit2, which is exactly the same as the old table, except for the fact that it adds a sequence number to the data, to aid sorting the rows in a consistent and repeatable manner (something you would probably do if loading this data in from your own IIS server anyway). Feel free to use this in your solutions if it makes it easier for you - the file is available (again kindly hosted by Peso) here.
The data (update 2)
Peso found a bug in Reggie's code, where the lookup on the ip_group_country got it wrong for one value because it used < rather than <=. New code, both data files, DDL & bcp commands, again kindly hosted by Peso, are here.
The output
The output required is three result sets:
- Visitor summary by country by day
- Visitor summary by country
- Page summary by day
Note that in this challenge, the order of the data in the result sets does matter.
The visitor summaries include the following calculations:
VisitCount
– this is the number of unique visits. A visit is one or more page requests from the same IP address within a set time limit. The time limit is configurable through the @SessionTimeout parameter in Reggie’s solution. Note that only pages with the extensions .asp, .aspx, .htm, .html and .php are considered as page views.PageCount
– this is the total number of pages viewed. Again note that only pages with the extensions .asp, .aspx, .htm, .html and .php are considered as page views.MaxPagesPerVisit
– this is the highest number of pages per visit.AvgPagesPerVisit
– this is the average number of pages per visit, to 2 decimal places.MaxTimeOnSite
– this is the maximum time that any particular visitor spent on the site. The time on site is defined as the difference between the first and last page load for a particular visit.AvgTimeOnSite
- this is the average time that any particular visitor spent on the site.GoalCount
– this is the number of times that any of a particular set of pages were loaded. A single visit may build up multiple goals, and viewing the same goal page multiple times counts as multiple goals.BounceRate
– this is the number of visits where only a single page was loaded, divided by the total number of visits, and expressed as a percentage to two decimal places.
The page summary includes the following calculations:
PageCount
– this is the number of times that the page was loaded that day.EntryRate
– this is the number of times that the page was the first page of the visit, divided byPageCount
, and expressed as a percentage to two decimal places.ExitRate
– this is the number of times that the page was the last page of the visit, divided byPageCount
, and expressed as a percentage to two decimal places.BounceRate
– this is the number of times that the page was both the first and the last page of the visit, divided byPageCount
, and expressed as a percentage to two decimal places.
Note that for the country summary by day report, the day to which a visit is added is the day on which the visit started.
To lookup the geolocation data for an IP A.B.C.D
, you need to find the row in [dbo].[ip_group_country]
that has the highest [ip_start]
value that is less than or equal to:
(A * 16777216) + (B * 65536) + (C * 256) + D
The rules
The rules for the challenge are here.
The finishing date: Midnight GMT 28th February 2010
Submissions must be in by 28th Feb 2010
Bootnote
Reggie Bar's solution will only return the correct result from a table loaded directly from the source data. He really should have put an order by on the cursor select, shouldn't he? *cough*
Leaderboard
+---------------------------------+----------+----------+----------+----------+----------+----------+
| Name | Min | Avg | Med | Max | Total | Runs |
+---------------------------------+----------+----------+----------+----------+----------+----------+
| Matts CLR Not An Entry | 292 | 376 | 301 | 1,807 | 7,522 | 20 |
| Peso 2c | 1,718 | 1,863 | 1,779 | 3,572 | 37,263 | 20 |
| Peso 2a (Invalid) | 1,681 | 2,385 | 1,925 | 4,745 | 47,706 | 20 |
| Dave1b (Almost Valid!!) | 2,259 | 2,466 | 2,269 | 5,967 | 49,334 | 20 |
| Dave1a (Invalid) | 2,469 | 2,533 | 2,483 | 3,287 | 50,660 | 20 |
| Peso 3a (Invalid) | 2,697 | 2,863 | 2,723 | 4,927 | 57,279 | 20 |
| Peso 1a 20100216 (Invalid) | 2,855 | 3,020 | 2,866 | 4,037 | 60,417 | 20 |
| Phil 1b (Invalid) | 3,743 | 4,013 | 3,865 | 6,788 | 80,275 | 20 |
| Phil 1a (Invalid) | 5,040 | 5,422 | 5,172 | 10,274 | 108,451 | 20 |
+---------------------------------+----------+----------+----------+----------+----------+----------+
Result
Congratulations to Peso - his 2c entry is the winner!