x
login about faq Site discussion (meta-askssc)

Phil Factor SQL Speed Phreak Challenge #4 - The Log Parsing Problem

It must be that time again!

Phil Factor SQL Speed Phreak Challenge #4 is here, and our friends at Red Gate have offered some truly excellent prizes this time:

First Prize: $100 and a license for SQL Data Generator

There will also be three categorised runner-up prizes:

  • Best Cursor-based solution
  • Best CLR solution
  • Best Unusual/experimental attempt

The winners of these prizes will get a licensed copy of either SQL Prompt or SQL Refactor, whichever they prefer.

Of course, the first place entry will also have privilege and honour of displaying this rather nice trophy jpg on your blog / website / bedroom wall in perpetuity.

Speed Phreak Trophy

So, the task...

It turns out that Robert Bar has a younger brother. Reggie Bar was asked by his company to so some analytics work on their web logs. Reggie started by using the information in this article to use LogParser to import the data from his IIS logs into the database. So far, so good.

Reggie then got hold of a geolocation IP database from here. Again, so far, so good.

That’s where the good times ended, however. Reggie decided that the best way to approach this task was to use cursors, and even on a small amount of data the solution is taking longer and longer to execute. Management aren’t happy, they need a solution that is fast and won’t get hugely slower over time.

The data

The main data sources are two standard sources of information – the output of LogParser (in the table [dbo].[IISAudit]) and the geolocation IP address (in the table [dbo].[ip_group_country]). The data for both of these tables is kindly being hosted by Peso here. Also included in the data file are the DDL for the tables, the BCP commands to import the data, and Reggie’s solution to the problem.

The data (update)

There is now available another table IISAudit2, which is exactly the same as the old table, except for the fact that it adds a sequence number to the data, to aid sorting the rows in a consistent and repeatable manner (something you would probably do if loading this data in from your own IIS server anyway). Feel free to use this in your solutions if it makes it easier for you - the file is available (again kindly hosted by Peso) here.

The data (update 2)

Peso found a bug in Reggie's code, where the lookup on the ip_group_country got it wrong for one value because it used < rather than <=. New code, both data files, DDL & bcp commands, again kindly hosted by Peso, are here.

The output

The output required is three result sets:

  • Visitor summary by country by day
  • Visitor summary by country
  • Page summary by day

Note that in this challenge, the order of the data in the result sets does matter.

The visitor summaries include the following calculations:

  • VisitCount – this is the number of unique visits. A visit is one or more page requests from the same IP address within a set time limit. The time limit is configurable through the @SessionTimeout parameter in Reggie’s solution. Note that only pages with the extensions .asp, .aspx, .htm, .html and .php are considered as page views.
  • PageCount – this is the total number of pages viewed. Again note that only pages with the extensions .asp, .aspx, .htm, .html and .php are considered as page views.
  • MaxPagesPerVisit – this is the highest number of pages per visit.
  • AvgPagesPerVisit – this is the average number of pages per visit, to 2 decimal places.
  • MaxTimeOnSite – this is the maximum time that any particular visitor spent on the site. The time on site is defined as the difference between the first and last page load for a particular visit.
  • AvgTimeOnSite - this is the average time that any particular visitor spent on the site.
  • GoalCount – this is the number of times that any of a particular set of pages were loaded. A single visit may build up multiple goals, and viewing the same goal page multiple times counts as multiple goals.
  • BounceRate – this is the number of visits where only a single page was loaded, divided by the total number of visits, and expressed as a percentage to two decimal places.

The page summary includes the following calculations:

  • PageCount – this is the number of times that the page was loaded that day.
  • EntryRate – this is the number of times that the page was the first page of the visit, divided by PageCount, and expressed as a percentage to two decimal places.
  • ExitRate – this is the number of times that the page was the last page of the visit, divided by PageCount, and expressed as a percentage to two decimal places.
  • BounceRate – this is the number of times that the page was both the first and the last page of the visit, divided by PageCount, and expressed as a percentage to two decimal places.

Note that for the country summary by day report, the day to which a visit is added is the day on which the visit started.

To lookup the geolocation data for an IP A.B.C.D, you need to find the row in [dbo].[ip_group_country] that has the highest [ip_start] value that is less than or equal to:

(A * 16777216) + (B * 65536) + (C * 256) + D

The rules

The rules for the challenge are here.

The finishing date: Midnight GMT 28th February 2010

Submissions must be in by 28th Feb 2010

Bootnote

Reggie Bar's solution will only return the correct result from a table loaded directly from the source data. He really should have put an order by on the cursor select, shouldn't he? *cough*

Leaderboard

+---------------------------------+----------+----------+----------+----------+----------+----------+
| Name                            | Min      | Avg      | Med      | Max      | Total    | Runs     |
+---------------------------------+----------+----------+----------+----------+----------+----------+
| Matts CLR Not An Entry          |      292 |      376 |      301 |    1,807 |    7,522 |       20 |
| Peso 2c                         |    1,718 |    1,863 |    1,779 |    3,572 |   37,263 |       20 |
| Peso 2a (Invalid)               |    1,681 |    2,385 |    1,925 |    4,745 |   47,706 |       20 |
| Dave1b (Almost Valid!!)         |    2,259 |    2,466 |    2,269 |    5,967 |   49,334 |       20 |
| Dave1a (Invalid)                |    2,469 |    2,533 |    2,483 |    3,287 |   50,660 |       20 |
| Peso 3a (Invalid)               |    2,697 |    2,863 |    2,723 |    4,927 |   57,279 |       20 |
| Peso 1a 20100216 (Invalid)      |    2,855 |    3,020 |    2,866 |    4,037 |   60,417 |       20 |
| Phil 1b (Invalid)               |    3,743 |    4,013 |    3,865 |    6,788 |   80,275 |       20 |
| Phil 1a (Invalid)               |    5,040 |    5,422 |    5,172 |   10,274 |  108,451 |       20 |
+---------------------------------+----------+----------+----------+----------+----------+----------+

Result

Congratulations to Peso - his 2c entry is the winner!

more ▼

asked Feb 11 '10 at 10:49 AM in Default

Matt Whitfield gravatar image

Matt Whitfield ♦♦
29.2k 56 63 87

A visit is one or more page requests from the same IP address within a set time limit...We should include POSTs as a page request, correct?

Feb 12 '10 at 01:16 AM Scot Hauder

Scot - correct - any sort of activity from that same IP keeps the session going...

Feb 12 '10 at 07:46 AM Matt Whitfield ♦♦

I wouldn't do this in reality, because an entire enterprise or ISP can share an IP address. You actually have to look closely at the UserAgent to work out individual visitors as well as the IP address, but I suspect that Reggie didn't know about that.

Feb 13 '10 at 04:27 PM Phil Factor

Agreed - Ideally you'd get the session ID and use that...

Feb 13 '10 at 06:40 PM Matt Whitfield ♦♦

Even looking at UserAgent isn't the ideal way Phil because in an enterprise they all tend to be on the same one. Session ID is a good alternative however if I tend to have a requirement for user tracking through a system I have some code that saves user details on a page load (ASP.NET/C#) that stores details of the user logged in or otherwise which I can then correlate to the stats.

Feb 14 '10 at 06:20 AM Ian Roke
(comments are locked)
10|1200 characters needed characters left

11 answers: sort voted first

Peso 2c 2010-02-23

Same setup and teardown as Peso 2a (single index).

The main code

SET NOCOUNT ON

DECLARE @SessionTimeout INT = 900

CREATE TABLE    #Goal
        (
            theURL VARCHAR(255),
            CONSTRAINT PK_Goal PRIMARY KEY CLUSTERED
            (
                theURL
            )
        )

INSERT  #Goal
    (
        theURL
    )
VALUES  ('ASPNETPage29.aspx'),
    ('ASPNETPage9.aspx')

CREATE TABLE    #Stats
        (
            RowID INT IDENTITY(1, 1),
            CONSTRAINT PK_Stats PRIMARY KEY CLUSTERED
            (
                RowID
            ), 
            cIP VARCHAR(15) NOT NULL,
            DateStr CHAR(10) NOT NULL,
            PageTime DATETIME NOT NULL,
            isGoal TINYINT NOT NULL,
            theURL VARCHAR(255) NOT NULL,
            CountryCode CHAR(2) NOT NULL,
            theVisit INT,
            isBounce TINYINT,
            isEntry TINYINT,
            isExit TINYINT,
            [PageCount] INT,
            theTime INT
        )

INSERT      #Stats
        (
            cIP,
            DateStr,
            PageTime,
            isGoal,
            theURL,
            CountryCode
        )
SELECT      i.cIP,
        i.DateStr,
        CAST(i.DateStr + ' ' + i.TimeStr AS DATETIME) AS PageTime,
        CASE
            WHEN g.theURL IS NULL THEN 0
            ELSE 1
        END AS isGoal,
        i.csUriStem AS theURL,
        igc.CountryCode
FROM        dbo.IISAudit2 AS i
LEFT JOIN   #Goal AS g ON g.theURL = i.csUriStem
CROSS APPLY (
            SELECT TOP(1)   w.Country_Code
            FROM        dbo.Ip_Group_Country AS w
            WHERE       w.ip_start <= CAST(16777216 AS BIGINT) * PARSENAME(i.cIP, 4) + 65536 * PARSENAME(i.cIP, 3) + 256 * PARSENAME(i.cIP, 2) + PARSENAME(i.cIP, 1)
            ORDER BY    w.ip_start DESC
        ) AS igc(CountryCode)
WHERE       PARSENAME(i.csUriStem, 1) IN ('asp', 'aspx', 'htm', 'html', 'php')
ORDER BY    i.cIP,
        i.Seq_No

DROP TABLE #Goal

DECLARE @IP VARCHAR(15),
    @PageTime DATETIME,
    @Visit INT = 0

UPDATE  #Stats
SET @Visit = theVisit = CASE
                    WHEN cIP <> @IP THEN @Visit + 1
                    WHEN DATEDIFF(SECOND, @PageTime, PageTime) < @SessionTimeout THEN @Visit
                    ELSE @Visit + 1
                END,
    @IP = cIP,
    @PageTime = PageTime

UPDATE      s
SET     s.isBounce =    CASE 
                    WHEN q.minID = s.RowID THEN q.isBounce
                    END,
        s.isEntry = CASE
                    WHEN q.minID = s.RowID THEN 1
                END,
        s.isExit =  CASE
                    WHEN q.maxID = s.RowID THEN 1
                END,
        s.[PageCount] = CASE
                    WHEN q.minID = s.RowID THEN q.[PageCount]
                END,
        s.theTime = CASE
                    WHEN q.minID = s.RowID THEN q.theTime
                END,
        s.PageTime = q.theDate
FROM        #Stats AS s
INNER JOIN  (
            SELECT      w.theVisit,
                    MIN(w.minID) AS minID,
                    MAX(w.maxID) AS maxID,
                    CASE MAX(CASE WHEN s.RowID = w.minID THEN s.theURL ELSE NULL END)
                        WHEN MAX(CASE WHEN s.RowID = w.maxID THEN s.theURL ELSE NULL END) THEN 1
                        ELSE 0
                    END AS isBounce,
                    COUNT(*) AS [PageCount],
                    MAX(w.theTime) AS theTime,
                    DATEDIFF(DAY, 0, MIN(theDate)) AS theDate
            FROM        (
                        SELECT      theVisit,
                                MIN(RowID) AS minID,
                                MAX(RowID) AS maxID,
                                DATEDIFF(SECOND, MIN(PageTime), MAX(PageTime)) AS theTime,
                                MIN(PageTime) AS theDate
                        FROM        #Stats
                        GROUP BY    theVisit
                    ) AS w
            INNER JOIN  #Stats AS s ON s.theVisit = w.theVisit
            GROUP BY    w.theVisit
        ) AS q ON q.theVisit = s.theVisit

-- 1
SELECT      CountryCode,
        PageTime AS DayOnly,
        COUNT(DISTINCT theVisit) AS VisitCount,
        SUM([PageCount]) AS [PageCount],
        MAX([PageCount]) AS MaxPagesPerVisit,
        CAST(1.0E * SUM([PageCount]) / COUNT(DISTINCT theVisit) AS DECIMAL(12, 2)) AS AvgPagesPerVisit,
        MAX(theTime) AS MaxTimeOnSite,
        SUM(theTime) / COUNT(DISTINCT theVisit) AS AvgTimeOnSite,
        SUM(isGoal) AS GoalCount,
        CAST(100.0E * SUM(CASE WHEN [PageCount] = 1 THEN 1 ELSE 0 END) / COUNT(DISTINCT theVisit) AS DECIMAL(12, 2)) AS BounceRate
FROM        #Stats
GROUP BY    CountryCode,
        PageTime
ORDER BY    PageTime,
        CountryCode

-- 2
SELECT      CountryCode,
        COUNT(DISTINCT theVisit) AS VisitCount,
        SUM([PageCount]) AS [PageCount],
        MAX([PageCount]) AS MaxPagesPerVisit,
        CAST(1.0E * SUM([PageCount]) / COUNT(DISTINCT theVisit) AS DECIMAL(12, 2)) AS AvgPagesPerVisit,
        MAX(theTime) AS MaxTimeOnSite,
        SUM(theTime) / COUNT(DISTINCT theVisit) AS AvgTimeOnSite,
        SUM(isGoal) AS GoalCount,
        CAST(100.0E * SUM(CASE WHEN [PageCount] = 1 THEN 1 ELSE 0 END) / COUNT(DISTINCT theVisit) AS DECIMAL(12, 2)) AS BounceRate
FROM        #Stats
GROUP BY    CountryCode
ORDER BY    CountryCode

-- 3
SELECT      uriStem,
        PageDay,
        SUM([PageCount]) AS [PageCount],
        CAST(100.0E * SUM(isEntry) / SUM([PageCount]) AS DECIMAL(12, 2)) AS EntryRate,
        CAST(100.0E * SUM(isExit) / SUM([PageCount]) AS DECIMAL(12, 2)) AS ExitRate,
        CAST(100.0E * SUM(isBounce) / SUM([PageCount]) AS DECIMAL(12, 2)) AS BounceRate
FROM        (
            SELECT      theURL AS uriStem,
                    PageTime AS PageDay,
                    0 AS [PageCount],
                    SUM(isEntry) AS isEntry,
                    SUM(isExit) AS isExit,
                    SUM(isBounce) AS isBounce
            FROM        #Stats
            GROUP BY    PageTime,
                    theUrl

            UNION ALL

            SELECT      theURL AS uriStem,
                    DateStr AS PageDay,
                    COUNT(*) AS [PageCount],
                    0 AS isEntry,
                    0 AS isExit,
                    0 AS isBounce
            FROM        #Stats
            GROUP BY    DateStr,
                    theUrl
        ) AS d
GROUP BY    PageDay,
        uriStem
ORDER BY    PageDay,
        uriStem

DROP TABLE  #Stats
more ▼

answered Feb 23 '10 at 09:24 AM

Peso gravatar image

Peso
1.6k 4 6 8

It's getting close now!!

Feb 23 '10 at 06:34 PM Matt Whitfield ♦♦

I did compare with COMP and they are exact matches, using IISAudit2 table and the new code from Reggie with ORDER BY seq_no added to the CURSOR definition.

Feb 23 '10 at 07:31 PM Peso

+1 Got this to work. Very fast good effort Peso. I can only dream of writing SQL like you! :-D

Feb 23 '10 at 07:33 PM Ian Roke
(comments are locked)
10|1200 characters needed characters left
/*
    Phil1a --runs in 3.5 secs on my machine not including set up and tear down. 
    Sorry, didn't spot the code for the goal stuff but as it seemed to be all
    zeros in Reggie's results I just put a zero in. i'll add it for the next try
    DECLARE @timer TABLE
  (
   Time_ID INT,
   whatWasIt VARCHAR(30),
   whenWasIt DATETIME DEFAULT GETDATE()
  )*/

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#CountrySummary%' ) 
  DROP TABLE #CountrySummary
CREATE TABLE #CountrySummary
  (
   CountryCode [varchar](2),
   DayOnly [datetime],
   VisitCount [int],
   [PageCount] [int],
   MaxPagesPerVisit [int],
   AvgPagesPerVisit [decimal](12, 2),
   MaxTimeOnSite [int],
   AvgTimeOnSite [int],
   GoalCount [int],
   BounceRate [decimal](12, 2)
  )

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#CountryOverallSummary%' ) 
  DROP TABLE #CountryOverallSummary
CREATE TABLE #CountryOverallSummary
  (
   CountryCode [varchar](2),
   VisitCount [int],
   [PageCount] [int],
   MaxPagesPerVisit [int],
   AvgPagesPerVisit [decimal](12, 2),
   MaxTimeOnSite [int],
   AvgTimeOnSite [int],
   GoalCount [int],
   BounceRate [decimal](12, 2)
  )

----this is the main table used for processing the relevant log entries
IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#session%' ) 
  DROP TABLE #session
CREATE TABLE #session
  (
   Surrogate INT IDENTITY(1, 1),
   TheDateTime DATETIME,
   URIStem VARCHAR(20),
   TheIP VARCHAR(15),
   session INT DEFAULT 0,
   firstdot INT,
   Seconddot INT,
   Thirddot INT,
   BigIP BIGINT,
   CountryCode VARCHAR(2),
   IsLastHit INT DEFAULT 0,
   IsFirstHit INT
  )

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#countries%' ) 
  DROP TABLE #countries
CREATE TABLE #Countries
  (
   surrogate_id INT IDENTITY(1, 1),
   Previous_IP BIGINT DEFAULT 0,
   IP_start BIGINT CONSTRAINT PK_Goal PRIMARY KEY CLUSTERED,
   countryCode VARCHAR(2)
  )

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#aggregate%' ) 
  DROP TABLE #aggregate
CREATE TABLE #aggregate
  (
   [Pagecount] INT,
   [day] DATETIME,
   TheIP VARCHAR(15),
   Session INT,
   LastHit DATETIME,
   FirstHit DATETIME,
   CountryCode VARCHAR(2)
  )

-- INSERT  INTO @timer (WhatWasIt) SELECT  'start'

DECLARE @SessionTimeout INT, --how many seconds to assume new session
  @Session INT, --session number
  @LastDateTime DATETIME, --the date/time of the previous hit
  @ThePreviousIP VARCHAR(15),--the IP of the previous hit
  @firstdot INT, --the place in the IP string of the first dot
  @Seconddot INT,--the place in the IP string of the Second dot
  @Thirddot INT,--the place in the IP string of the third dot
  @bigIP BIGINT,--the int rendering of an IP address
  @IsFirstHit INT


SELECT  @SessionTimeout = 900, @Session = 0,
		@LastDateTime = '1 jan 2000', @ThePreviousIP = ''
--firstly, we get the relevant parts of the log table in a lightweight 
--temporary table
INSERT  INTO #session
		(TheDateTime, TheIP, URIStem)
		SELECT  TheDateTime, TheIP, csuriStem
		FROM    (SELECT CONVERT(DATETIME, datestr + ' ' + Timestr, 120)
						AS TheDateTime,
						LEFT(cip, 15) AS TheIP, csuriStem
				 FROM   iisaudit
				 WHERE  RIGHT(csuriStem, 5) IN ('.aspx',
						'.html') OR RIGHT(csuriStem, 4) IN (
						'.asp', '.htm', '.php')
				) f
		ORDER BY TheIP, TheDateTime
--and work out which seesion everything belongs to. This code relies 
--on a Quirky update

UPDATE  #session
SET     @Session = session = CASE WHEN TheIP <> @ThePreviousIP OR DATEDIFF(Second, @LastDateTime, TheDateTime) > @SessionTimeout
								  THEN @Session + 1
								  ELSE @Session
							 END,
		@isFirstHit = isFirstHit = CASE WHEN TheIP <> @ThePreviousIP OR DATEDIFF(Second, @LastDateTime, TheDateTime) > @SessionTimeout
										THEN 1
										ELSE 0
								   END,
		@LastDateTime = TheDateTime,
		@ThePreviousIP = TheIP,
		@firstdot = firstdot = CHARINDEX('.', TheIP, 1),
		@Seconddot = Seconddot = CHARINDEX('.', TheIP,
										   @firstdot + 1),
		@Thirddot = Thirddot = CHARINDEX('.', TheIP,
										 @SecondDot + 1),
		@BigIP = BigIP = CONVERT(BIGINT, 16777216) * (LEFT(TheIP, @FirstDot - 1)) + CONVERT(BIGINT, 65536) * (SUBSTRING(TheIP, @FirstDot + 1, @SecondDot - @FirstDot - 1)) + CONVERT(BIGINT, 256) * (SUBSTRING(TheIP, @SecondDot + 1, @ThirdDot - @SecondDot - 1)) + CONVERT(BIGINT, SUBSTRING(TheIP, @ThirdDot + 1, 20))

-- To lookup the geolocation data for an IP A.B.C.D, you need to find the row in [dbo].[ip_group_country] 
-- that has the highest [ip_start] value that is less than or equal to:
-- (A * 16777216) + (B * 65536) + (C * 256) + D
-- INSERT  INTO @timer (WhatWasIt) SELECT  'start loading countries'
--now we stock a country table in order to work out where the IP is from

INSERT  INTO #Countries
		(IP_Start, CountryCode)
		SELECT  ip_Start, Country_Code
		FROM    ip_group_country
		ORDER BY ip_start
--now we chamelessly do a quirky update to get the
-- bigint version of the previous IP address
-- INSERT  INTO @timer (WhatWasIt) SELECT  'update countries'

SELECT  @BigIP = 0
UPDATE  #countries
SET     @BigIP = Previous_IP = @BigIP,
		@BigIP = IP_Start

--create index IPStartPreviousIndex on #countries(Previous_IP,IP_Start)--
--create index BigIP_CoverTheDateTime on #Session(BigIP,TheDateTime)

--we only want to do this once as it is costly so we put
--it in a temporary table
-- INSERT  INTO @timer (WhatWasIt)  SELECT  'do the aggregate'

INSERT  INTO #aggregate
		SELECT  COUNT(*),
				DATEADD(d, DATEDIFF(d, 0, [TheDateTime]), 0)
				AS day, TheIP, session, MAX(TheDateTime),
				MIN(TheDateTime), c.countryCode
		FROM    #session s INNER JOIN #countries c
				  ON BigIP BETWEEN Previous_IP AND IP_Start
		GROUP BY TheIP, session,
				DATEADD(d, DATEDIFF(d, 0, [TheDateTime]), 0),
				c.countryCode

-- INSERT  INTO @timer (WhatWasIt) SELECT  'find last hits'
UPDATE  #session
SET     islasthit = 1
FROM    #session s INNER JOIN #aggregate
		  ON lastHit = theDateTime
-- INSERT  INTO @timer (WhatWasIt) SELECT  'report aggregates'
--and now we create the result tables.

INSERT  INTO [#CountrySummary]
		([CountryCode], [DayOnly], [VisitCount], [PageCount],
		 [MaxPagesPerVisit], [AvgPagesPerVisit],
		 [MaxTimeOnSite], [AvgTimeOnSite], goalcount,
		 [BounceRate])
		SELECT  countryCode, day, COUNT(*), SUM([PageCount]),
				MAX([PageCount]), AVG([PageCount] * 1.0),
				MAX(DATEDIFF(s, [FirstHit], [LastHit])),
				AVG(DATEDIFF(s, [FirstHit], [LastHit])), 0,
				SUM(CASE WHEN [PageCount] = 1 THEN 1
						 ELSE 0
					END) * 100.0 / (COUNT(*) * 1.0)
		FROM    #aggregate
		GROUP BY [CountryCode], day

INSERT  INTO [#CountryOverallSummary]
		([CountryCode], [VisitCount], [PageCount],
		 [MaxPagesPerVisit], [AvgPagesPerVisit],
		 [MaxTimeOnSite], [AvgTimeOnSite], [GoalCount],
		 [BounceRate])
		SELECT  countryCode, COUNT(*), SUM([PageCount]),
				MAX([PageCount]), AVG([PageCount] * 1.0),
				MAX(DATEDIFF(s, [FirstHit], [LastHit])),
				AVG(DATEDIFF(s, [FirstHit], [LastHit])), 0,
				SUM(CASE WHEN [PageCount] = 1 THEN 1
						 ELSE 0
					END) * 100.0 / (COUNT(*) * 1.0)
		FROM    #aggregate
		GROUP BY [CountryCode] 

Select * from [#CountryOverallSummary]

Select * from [#CountrySummary]

SELECT  uriStem,
		DATEADD(d, DATEDIFF(d, 0, [TheDateTime]), 0) AS PageDay,
		COUNT(*) AS pagecount,
		CONVERT([decimal](12, 2), SUM(isFirstHit) * 100.0 / COUNT(*))
		AS [EntryRate],
		CONVERT([decimal](12, 2), SUM(isLastHit) * 100.0 / COUNT(*))
		AS [ExitRate],
		CONVERT([decimal](12, 2), SUM(CASE WHEN isFirstHit + isLastHit = 2
										   THEN 1
										   ELSE 0
									  END) * 100.0 / COUNT(*))
		AS [BounceRate]
FROM    #session
GROUP BY uriStem,
		DATEADD(d, DATEDIFF(d, 0, [TheDateTime]), 0)
ORDER BY PageDay, uristem

/*-- INSERT  INTO @timer  (WhatWasIt)  SELECT  'end'
SELECT  * FROM    @timer*/

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#CountrySummary%' ) 
  DROP TABLE #CountrySummary

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#CountryOverallSummary%' ) 
  DROP TABLE #CountryOverallSummary

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#session%' ) 
  DROP TABLE #session

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#countries%' ) 
  DROP TABLE #countries

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#aggregate%' ) 
  DROP TABLE #aggregate
more ▼

answered Feb 16 '10 at 08:45 PM

Phil Factor gravatar image

Phil Factor
3.2k 8 9 14

This runs in about 5 seconds on my machine which is hugely impressive. The Goal URL score could be put in the aggregation stage so it wouldn't be too much of an additional overhead.

Feb 17 '10 at 10:18 AM Ian Roke

Thanks for that. I haven't done any work in optimizing it yet so there is a potential for shaving quite a bit off the timings. Last time I checked, the main report was exactly the same as Reggie's but the country reports were quite different.

Feb 17 '10 at 01:41 PM Phil Factor

Phil - do you want me to split this into setup, main & teardown and add deletes from the temporary table to the beginning of main? Also with the results - could well be the ordering issue, because there are multiple page requests from the same IP in the same second, which idiot me didn't notice previously... I thought of adding an IDENTITY to the set after bulk loading - what do you think?

Feb 17 '10 at 02:19 PM Matt Whitfield ♦♦

As this is the first cut, I'd say do the timings on the code as it is if is less trouble. We only want rough timings at this stage.

I'd say leave the import table as is now. I reckon it is good practice to take data in as near as dammit to what it was like when it was in the import file. It is then part of the task to cope with things like this. This is why I put in the surrogate key into the session table. The problem is finding a primary key that will actually add value. It doesn't help to use the surrogate, of course.

Feb 17 '10 at 03:45 PM Phil Factor

I did the algorithm to find the source of the IP address using excel and even a pocket calculator, and I still get different results to Reggie, so I must have got my understanding of the process entirely wrong.

Feb 17 '10 at 03:45 PM Phil Factor
(comments are locked)
10|1200 characters needed characters left

Dave Ver 1.b - I think the results are now spot on ,bold claim i know

The Setup

Create unique clustered index idxCtryIp on ip_group_country (Ip_start) 
go
Create index idxIpVisitTime on IISAudit (Cip,DateStr,TimeStr)
go
create function SplitIp(@IpAddress char(20))
returns table
as
return(
with cteposn1
as(
select CHARINDEX('.',@ipaddress) as pos1
),
cteposn2
as
(
select pos1,CHARINDEX('.',@ipaddress,pos1+1) as pos2
 from cteposn1
),
cteposn3
as
(
select pos1,pos2,CHARINDEX('.',@ipaddress,pos2+1) as pos3
 from cteposn2
),
cteposn4
as
(
select pos1,pos2,pos3
  from cteposn3
)

select cast(LEFT(@ipaddress,pos1-1) as bigint) * 16777216 +
       cast(SUBSTRING(@ipaddress,pos1+1,(pos2-pos1)-1) as integer) *  65536 +
       cast(SUBSTRING(@ipaddress,pos2+1,(pos3-pos2)-1) as integer) * 256 +
       cast(SUBSTRING(@ipaddress,pos3+1,(LEN(@ipaddress)-pos3))  as integer) as IpNum   
  from cteposn4
)

go
create Function isValidExt(@UriStem varchar(255))
returns table
as
return
(

with ctePos

as
(
   Select right(@UriStem,4) as rght
)
,
cteExt
as
(
Select case when left(rght,1) = '.' then
          substring(rght,2,4) else
          rght end as ext
 from ctePos
)
Select 1 as ValidExt
  from cteExt
  join (Select 'asp' as tExt union all
        Select 'aspx' union all
        Select 'htm' union all
        select 'html' union all
        select 'php') as tExt on cteExt.ext = tExt.tExt
)

go

The Teardown

drop index ip_group_country.idxCtryIp
drop index idxIpVisitTime.IISAudit
Drop function SplitIp
drop Function isValidExt

And the execution code

set nocount on 

CREATE TABLE #goalURLs (csUriStem [varchar](255))
INSERT INTO [#goalURLs] ([csUriStem])
VALUES ('ASPNETPage29.aspx')
INSERT INTO [#goalURLs] ([csUriStem])
VALUES ('ASPNETPage9.aspx')




Create TAble #VisitLog
(
IpAddress char(20),
DateStr  char(10),
TimeStr  Char(10),
DateTime dateTime,
IpVisit   integer,
GoalCount integer,
Uri       varchar(255),
LogId     integer 
)
Create unique clustered index visitlogidx on #visitlog(IpAddress,DateTime,LogId)
insert into #VisitLog(IpAddress,DateStr,TimeStr,DateTime,ipVisit,GoalCount,Uri,LogId)
select IIS.cIp,DateStr,Timestr,
       convert(datetime,DateStr +' '+TimeStr+'.000' ) as SinceDTime,0,
       case when #goalURLs.csUriStem is not null then  1 else 0 end,
       IIS.csUriStem,
       ROW_NUMBER() over (partition by IIS.cIp order by DateStr,Timestr)
from IISAudit IIS
cross apply dbo.isValidExt(csUriStem)
left join #goalURLs
    on #goalURLs.csUriStem = IIS.csUriStem

Declare @SessionTimeOut integer
Declare @VisitCount integer
Declare @SessionTimesOutWhen datetime
Declare @IpOn                varchar(20)
Declare @VisitView           integer
Select @IpOn = '0.0'
select @VisitCount =0
Declare @LVisit integer
Select @SessionTimeOut = 900


update #VisitLog
   set @VisitCount = case when @IpOn <> #VisitLog.IpAddress then 0 
                          when #VisitLog.DateTime  > @SessionTimesOutWhen then @VisitCount+1
                          else @VisitCount  end,
       @SessionTimesOutWhen =  DATEADD(s,@SessionTimeOut,#VisitLog.DateTime),
       @VisitCount = case when @VisitCount =0 then 1 else @VisitCount end,

       IpVisit = @VisitCount,
       @IpOn=  #VisitLog.IpAddress
 from  #VisitLog
 OPTION (MAXDOP 1)

Create index uri on #visitlog(uri,DateTime)

Create Table #VisitSummary
(
IpAddress char(20) not null,
CountryCode char(2) not null,
SessionStart datetime not null,
SessionEnd Datetime not null,
VisitCount integer not null,
Pages      decimal(6,2) not null,
GoalCount  integer not null,
SessionEntryPage varchar(255) not null,
SessionExitPage varchar(255) not null
)
Create unique Clustered index idxVisitSummary on #VisitSummary(IpAddress,SessionStart)

;with ctePageSum
as
(
 select ipAddress,DateTime,IpVisit,GoalCount,
        Uri,
        LogId,
        ROW_NUMBER() over (partition by ipAddress,IpVisit order by LogId) as StartRowN,
        ROW_NUMBER() over (partition by ipAddress,IpVisit order by LogId desc) as EndRowN
  from #VisitLog
)
insert into #VisitSummary(IpAddress,SessionStart,SessionEnd,VisitCount,Pages,GoalCount,SessionEntryPage,SessionExitPage,CountryCode)
Select ipAddress,
       min(case when StartRowN =1 then DateTime end),
       min(case when EndRowN =1 then DateTime end),
       IpVisit,
       count(*),
       SUM(GoalCount),
       min(case when StartRowN =1 then Uri end),
       min(case when EndRowN =1 then Uri end),
       ctry.country_code

  from ctePageSum   
  cross apply dbo.SplitIp(ctePageSum.IpAddress) nIp
  cross apply ( Select top(1) country_code
                 from ip_group_country 
                where ip_start  < = nIp.IpNum
                order by ip_start desc)  as Ctry

 group by ipAddress,IpVisit, nIp.IpNum,Ctry.country_code;



;with cteResults
as
(
select countrycode , 
       cast (DATEDIFF(DAY,0,SessionStart) AS datetime) as DayOnly,
       count(*)   as VisitCount,
       sum(#VisitSummary.Pages) as PageCount,
       max(#VisitSummary.Pages) as MaxPagesPerVisit,
       round(Avg(#VisitSummary.Pages),2) as AvgPagesPerVisit,
       max(datediff(SECOND,SessionStart,SessionEnd)) as MaxTimeOnSite,
       round(avg(datediff(SECOND,SessionStart,SessionEnd)),2) as AvgTimeOnSite,
       SUM(#VisitSummary.GoalCount) as GoalCount,
       sum(case when #VisitSummary.Pages = 1 then 1 else 0 end) BounceCount
  from #VisitSummary
group by countrycode,DATEDIFF(DAY,0,SessionStart)
)
Select CountryCode,dayOnly,visitcount,pagecount,maxPagesPerVisit,AvgPagesPerVisit,
       MaxTimeOnSite,AvgTimeOnSite,GoalCount,
       round(case when BounceCount =0 then 0 else 100.0/(VisitCount/cast(BounceCount as decimal(15,5))) end ,2) as BounceRate
 from  cteResults
order by 2,1

;with cteResults
as
(
select countrycode , 
       count(*)   as VisitCount,
       sum(#VisitSummary.Pages) as PageCount,
       max(#VisitSummary.Pages) as MaxPagesPerVisit,
       round(Avg(#VisitSummary.Pages),2) as AvgPagesPerCount,
       max(datediff(second,SessionStart,SessionEnd)) as MaxTimeOnSite,
       avg(datediff(second,SessionStart,SessionEnd)) as AvgTimeOnSite,
       SUM(#VisitSummary.GoalCount) as GoalCount,
       sum(case when #VisitSummary.Pages = 1 then 1 else 0 end) BounceCount
  from #VisitSummary
group by countrycode
)
Select CountryCode,visitcount,pagecount,maxPagesPerVisit,AvgPagesPerCount,
       MaxTimeOnSite,AvgTimeOnSite,GoalCount,
       round(case when BounceCount =0 then 0 else 100.0/(VisitCount/cast(BounceCount as decimal(15,5)))end ,2) as BounceRate

 from  cteResults
order by 1


;with cteEntry
as
(
select DATEDIFF(DAY,0,SessionStart) as SessDate,
       SessionEntryPage,
       count(*) as EntryCount,
       SUM(case when SessionExitPage = SessionEntryPage then 1 else 0 end) as BounceCount
  from #VisitSummary
 group by DATEDIFF(DAY,0,SessionStart),
       SessionEntryPage
),
cteExit
as
(
select DATEDIFF(DAY,0,SessionStart) as SessDate,
       SessionExitPage,
       count(*) as ExitCount
  from #VisitSummary
 group by DATEDIFF(DAY,0,SessionStart),
       SessionExitPage
),
cteVisit
as
( 
select DATEDIFF(DAY,0,DateTime) as VisitDate,
       uri,
       count(*) as VisitCount
  from #VisitLog
 group by DATEDIFF(DAY,0,DateTime),Uri
),
ctePageSumm
as
(
Select VisitDate,
       uri,
       cast(isNull(EntryCount,0) as decimal(10,5)) as EntryCount,
       cast(isNull(ExitCount,0)  as decimal(10,5)) as Exitcount,
       cast(isNull(VisitCount,0) as decimal(10,5)) as VisitCount,
       cast(ISNULL(BounceCount,0) as decimal(10,5)) as BounceCount
  from cteVisit 
  left join cteEntry
    on cteEntry.SessDate  = cteVisit.VisitDate
   and cteEntry.SessionEntryPage = cteVisit.uri
  left join cteExit
    on cteExit.SessDate  = cteVisit.VisitDate
   and cteExit.SessionExitPage = cteVisit.uri
)
Select uri   as UriStem,
       CAST(VisitDate AS datetime) as PageDay,
       VisitCount AS PageCount,
       EntryCount,
       round(case when EntryCount =0 then 0 else 100.0 / (VisitCount / EntryCount  ) end,2)  as EntryRate,
       round(case when ExitCount =0 then 0 else 100.0 / (VisitCount / ExitCount  ) end ,2)   as ExitRate,
       round(case when BounceCount =0 then 0 else 100.0 / (VisitCount / BounceCount  ) end,2) as BounceRate
 from ctePageSumm
order by 1,2

Drop TAble #goalURLs
Drop TAble #VisitLog
Drop Table #VisitSummary
more ▼

answered Feb 24 '10 at 07:42 PM

dave ballantyne gravatar image

dave ballantyne
918 1 1 4

(comments are locked)
10|1200 characters needed characters left

Peso 3a, almost there. In this solution there are no "quirkyness" at all. Resultset 1 and 2 are perfect match, but resultset #3 still need some work.

SET NOCOUNT ON

CREATE TABLE    #Goal
        (
            theURL VARCHAR(255) COLLATE FINNISH_SWEDISH_CI_AS,
            CONSTRAINT PK_Goal PRIMARY KEY CLUSTERED
            (
                theURL
            )
        )

INSERT  #Goal
    (
        theURL
    )
VALUES  ('ASPNETPage29.aspx'),
    ('ASPNETPage9.aspx')

DECLARE @SessionTimeout INT = 900

CREATE TABLE    #Source
        (
            cIP VARCHAR(15) NOT NULL,
            PageTime DATETIME NOT NULL,
            recID INT NOT NULL,
            CONSTRAINT PK_Source PRIMARY KEY CLUSTERED
            (
                cIP,
                PageTime,
                recID
            ),
            csUriStem VARCHAR(255) NOT NULL,
            IsGoal INT NOT NULL,
            RowID INT
        )

INSERT      #Source
        (
            cIP,
            PageTime,
            recID,
            csUriStem,
            IsGoal
        )
SELECT      i.cIP,
        i.DateStr + ' ' + i.TimeStr AS PageTime,
        ROW_NUMBER() OVER (PARTITION BY i.cIP ORDER BY i.DateStr, i.TimeStr) AS recID,
        i.csUriStem,
        CASE
            WHEN g.theURL IS NULL THEN 0
            ELSE 1
        END AS IsGoal
FROM        dbo.IISAudit AS i
LEFT JOIN   #Goal AS g ON g.theURL = i.csUriStem
WHERE       PARSENAME(i.csUriStem, 1) IN ('asp', 'aspx', 'htm', 'html', 'php')
        --AND cIP = '82.75.9.215'

DROP TABLE  #Goal

CREATE TABLE    #Lower
        (
            cIP VARCHAR(15) NOT NULL,
            StartTime DATETIME NOT NULL,
            recID INT NOT NULL,
            CONSTRAINT PK_Lower PRIMARY KEY CLUSTERED
            (
                cIP,
                StartTime,
                recID
            ),
            csUriStem VARCHAR(255) NOT NULL
        )

INSERT      #Lower
        (
            cIP,
            StartTime,
            recID,
            csUriStem
        )
SELECT      s.cIP,
        s.PageTime AS StartTime,
        ROW_NUMBER() OVER (PARTITION BY s.cIP ORDER BY s.PageTime) AS recID,
        s.csUriStem
FROM        #Source AS s
LEFT HASH JOIN  #Source AS t ON t.cIP = s.cIP
            AND t.recID = s.recID - 1
WHERE       DATEADD(SECOND, @SessionTimeout, t.PageTime) < s.PageTime
        OR t.cIP IS NULL

CREATE TABLE    #Upper
        (
            cIP VARCHAR(15) NOT NULL,
            EndTime DATETIME NOT NULL,
            recID INT NOT NULL,
            CONSTRAINT PK_Upper PRIMARY KEY CLUSTERED
            (
                cIP,
                EndTime,
                recID
            ),
            csUriStem VARCHAR(255) NOT NULL
        )

INSERT      #Upper
        (
            cIP,
            EndTime,
            recID,
            csUriStem
        )
SELECT      s.cIP,
        DATEADD(SECOND, @SessionTimeout, s.PageTime) AS EndTime,
        ROW_NUMBER() OVER (PARTITION BY s.cIP ORDER BY s.PageTime) AS recID,
        s.csUriStem
FROM        #Source AS s
LEFT HASH JOIN  #Source AS t ON t.cIP = s.cIP
            AND t.recID = s.recID + 1
WHERE       DATEADD(SECOND, @SessionTimeout, s.PageTime) < t.PageTime
        OR t.cIP IS NULL

CREATE TABLE    #Result
        (
            RowID INT IDENTITY(1, 1) NOT NULL,
            cIP VARCHAR(15) NOT NULL,
            StartTime DATETIME NOT NULL,
            EndTime DATETIME NOT NULL,
            CONSTRAINT PK_Result PRIMARY KEY CLUSTERED
            (
                cIP,
                StartTime,
                EndTime
            ),
            DayOnly DATETIME NOT NULL,
            TimeOnSite INT NOT NULL,
            firstURL VARCHAR(255) NOT NULL,
            lastURL VARCHAR(255) NOT NULL,
            IsGoal INT,
            [PageCount] INT,
            CountryCode CHAR(2)
        )

INSERT      #Result
        (
            cIP,
            StartTime,
            EndTime,
            DayOnly,
            TimeOnSite,
            firstURL,
            lastURL
        )
SELECT      l.cIP,
        l.StartTime,
        u.EndTime,
        DATEDIFF(DAY, 0, l.StartTime) AS DayOnly,
        DATEDIFF(SECOND, l.StartTime, u.EndTime) - 900 AS TimeOnSite,
        l.csUriStem AS firstURL,
        u.csUriStem AS lastURL
FROM        #Lower AS l
INNER HASH JOIN #Upper AS u ON u.cIP = l.cIP
WHERE       l.recID = u.recID
ORDER BY    l.cIP,
        l.recID

DROP TABLE  #Lower,
        #Upper

UPDATE      s
SET     s.RowID = r.RowID
FROM        #Result AS r
INNER JOIN  #Source AS s ON s.PageTime BETWEEN r.StartTime AND r.EndTime
WHERE       s.cIP = r.cIP

UPDATE      r
SET     r.IsGoal = w.IsGoal,
        r.[PageCount] = w.[PageCount]
FROM        #Result AS r
INNER JOIN  (
            SELECT      RowID,
                    SUM(IsGoal) AS IsGoal,
                    COUNT(*) AS [PageCount]
            FROM        #Source
            GROUP BY    RowID
        ) AS w ON w.RowID = r.RowID

CREATE TABLE    #Lookup
        (
            cIP VARCHAR(15) NOT NULL,
            IP BIGINT,
            CountryCode VARCHAR(2)
        )

INSERT      #Lookup
        (
            cIP
        )
SELECT      cIP
FROM        #Result
GROUP BY    cIP

UPDATE  #Lookup
SET IP = CAST(16777216 AS BIGINT) * PARSENAME(cIP, 4) + 65536 * PARSENAME(cIP, 3) + 256 * PARSENAME(cIP, 2) + PARSENAME(cIP, 1)

UPDATE  l
SET l.CountryCode = (SELECT TOP(1) x.Country_Code FROM ip_group_country AS x WHERE x.ip_start <= l.IP ORDER BY x.ip_start DESC)
FROM    #Lookup AS l

UPDATE      r
SET     r.CountryCode = l.CountryCode
FROM        #Result AS r
INNER JOIN  #Lookup AS l ON l.cIP = r.cIP

DROP TABLE  #Lookup

-- 1
SELECT      CountryCode,
        DayOnly,
        COUNT(*) AS VisitCount,
        SUM([PageCount]) AS [PageCount],
        MAX([PageCount]) AS MaxPagesPerVisit,
        CAST(1.0E * SUM([PageCount]) / COUNT(*) AS DECIMAL(12, 2)) AS AvgPagesPerVisit,
        MAX(TimeOnSite) AS MaxTimeOnSite,
        SUM(TimeOnSite) / COUNT(*) AS AvgTimeOnSite,
        SUM(isGoal) AS GoalCount,
        CAST(100.0E * SUM(CASE WHEN [PageCount] = 1 THEN 1 ELSE 0 END) / COUNT(*) AS DECIMAL(12, 2)) AS BounceRate
FROM        #Result
GROUP BY    CountryCode,
        DayOnly
ORDER BY    DayOnly,
        CountryCode

-- 2
SELECT      CountryCode,
        COUNT(*) AS VisitCount,
        SUM([PageCount]) AS [PageCount],
        MAX([PageCount]) AS MaxPagesPerVisit,
        CAST(1.0E * SUM([PageCount]) / COUNT(*) AS DECIMAL(12, 2)) AS AvgPagesPerVisit,
        MAX(TimeOnSite) AS MaxTimeOnSite,
        SUM(TimeOnSite) / COUNT(*) AS AvgTimeOnSite,
        SUM(isGoal) AS GoalCount,
        CAST(100.0E * SUM(CASE WHEN [PageCount] = 1 THEN 1 ELSE 0 END) / COUNT(*) AS DECIMAL(12, 2)) AS BounceRate
FROM        #Result
GROUP BY    CountryCode
ORDER BY    CountryCode

-- 3
SELECT      uriStem,
        PageDay,
        SUM([PageCount]) AS [PageCount],
        CAST(100.0E * SUM(IsEntry) / SUM([PageCount]) AS DECIMAL(12, 2)) AS EntryRate,
        CAST(100.0E * SUM(IsExit) / SUM([PageCount]) AS DECIMAL(12, 2)) AS ExitRate,
        CAST(100.0E * SUM(IsBounce) / SUM([PageCount]) AS DECIMAL(12, 2)) AS BounceRate
FROM        (
            SELECT  firstURL AS uriStem,
                DayOnly AS PageDay,
                1 AS IsEntry,
                0 AS IsExit,
                CASE firstURL
                    WHEN lastURL THEN 1
                    ELSE 0
                END AS IsBounce,
                0 AS [PageCount]
            FROM    #Result

            UNION ALL

            SELECT  lastURL AS uriStem,
                DayOnly AS PageDay,
                0 AS IsEntry,
                1 AS IsExit,
                0 AS IsBounce,
                0 AS [PageCount]
            FROM    #Result

            UNION ALL

            SELECT  csUriStem AS uriStem,
                DATEDIFF(DAY, 0, PageTime) AS PageDay,
                0 AS IsEntry,
                0 AS IsExit,
                0 AS IsBounce,
                1 AS [PageCount]
            FROM    #Source
        ) AS d
GROUP BY    PageDay,
        uriStem
ORDER BY    PageDay,
        uriStem

DROP TABLE  #Result,
        #Source
more ▼

answered Feb 25 '10 at 07:51 PM

Peso gravatar image

Peso
1.6k 4 6 8

Do you want me to add this to the harness or wait until the match is there? I'll do another timing run tomorrow (just off to bed)

Feb 25 '10 at 07:52 PM Matt Whitfield ♦♦

Add it, so I can see if I am going the right direction.

Feb 26 '10 at 08:01 AM Peso

It's now up there...

Feb 26 '10 at 05:47 PM Matt Whitfield ♦♦

And it's almost valid, or just plain invalid? ;-)

Feb 26 '10 at 06:36 PM Peso

I'm not sure actually - I was so tired yesterday I just ran it quickly and went to sleep! But it looks like 2c is doing to clinch it... unless someone submits an amazingly fast and correct solution in the next 4 minutes

Feb 27 '10 at 08:57 PM Matt Whitfield ♦♦
(comments are locked)
10|1200 characters needed characters left
/*Phil1b --runs in 2.5 secs on my machine not including set up and tear down. 
Again forgot the code for the goal stuff. It is a lot more accurate this time, and
where I've checked it against pesos and Reggie's code, mine is more accurate than 
either.

DECLARE @timer    TABLE
  (
   Time_ID INT IDENTITY(1, 1),
   whatWasIt VARCHAR(30),
   whenWasIt DATETIME DEFAULT GETDATE()
  ) */

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#CountrySummary%' ) 
  DROP TABLE #CountrySummary
CREATE TABLE #CountrySummary
  (
   CountryCode [varchar](2),
   DayOnly [datetime],
   VisitCount [int],
   [PageCount] [int],
   MaxPagesPerVisit [int],
   AvgPagesPerVisit [decimal](12, 2),
   MaxTimeOnSite [int],
   AvgTimeOnSite [int],
   GoalCount [int],
   BounceRate [decimal](12, 2)
  )

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#CountryOverallSummary%' ) 
  DROP TABLE #CountryOverallSummary
CREATE TABLE #CountryOverallSummary
  (
   CountryCode [varchar](2),
   VisitCount [int],
   [PageCount] [int],
   MaxPagesPerVisit [int],
   AvgPagesPerVisit [decimal](12, 2),
   MaxTimeOnSite [int],
   AvgTimeOnSite [int],
   GoalCount [int],
   BounceRate [decimal](12, 2)
  )

----this is the main table used for processing the relevant log entries
IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#session%' ) 
  DROP TABLE #session
/*CREATE TABLE #session
  (
   Surrogate INT IDENTITY(1, 1),
   TheDay int,
   TheDateTime DATETIME,
   URIStem VARCHAR(20),
   TheIP VARCHAR(15),
   session INT DEFAULT 0,
   firstdot INT,
   Seconddot INT,
   Thirddot INT,
   BigIP BIGINT,
   IsLastHit INT DEFAULT 0,
   IsFirstHit INT
  )*/

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#countries%' ) 
  DROP TABLE #countries
CREATE TABLE #Countries
  (
   surrogate_id INT IDENTITY(1, 1),
   Next_IP BIGINT DEFAULT 0,
   IP_start BIGINT,
   countryCode VARCHAR(2),
   CONSTRAINT idxClusterd PRIMARY KEY (IP_Start DESC)
  )


IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#aggregate%' ) 
  DROP TABLE #aggregate
CREATE TABLE #aggregate
  (
   [Pagecount] INT,
   [day] INT,
   TheIP VARCHAR(15),
   bigIP BIGINT,
   Session INT,
   LastHit DATETIME,
   FirstHit DATETIME,
   LastSessionID INT,
   CountryCode VARCHAR(2)
  )

-- INSERT  INTO @timer (WhatWasIt) SELECT  'start'

DECLARE @SessionTimeout INT, --how many seconds to assume new session
  @Session INT, --session number
  @LastDateTime DATETIME, --the date/time of the previous hit
  @ThePreviousIP VARCHAR(15),--the IP of the previous hit
  @firstdot INT, --the place in the IP string of the first dot
  @Seconddot INT,--the place in the IP string of the Second dot
  @Thirddot INT,--the place in the IP string of the third dot
  @bigIP BIGINT,--the int rendering of an IP address
  @IsFirstHit INT,
  @NextCountry VARCHAR(2)


SELECT  @SessionTimeout = 900, @Session = 0,
		@LastDateTime = '1 jan 2000', @ThePreviousIP = ''
--firstly, we get the relevant parts of the log table in a lightweight 
--temporary table
SELECT  IDENTITY( INT,1, 1 ) AS Session_ID,
		CONVERT(INT, 0) AS TheDay, TheDateTime,
		CONVERT (VARCHAR(20), csURIStem) AS URIStem,
		CONVERT(VARCHAR(15), TheIP) AS TheIP, 0 AS session,
		0 AS firstdot, 0 AS Seconddot, 0 AS Thirddot,
		CONVERT(BIGINT, 0) AS BigIP, 0 AS IsLastHit,
		0 AS IsFirstHit,
		CONVERT(VARCHAR(2), ' ') AS countrycode
INTO    #session
FROM    (SELECT CONVERT(DATETIME, datestr + ' ' + Timestr, 120)
				AS TheDateTime, LEFT(cip, 15) AS TheIP,
				csuriStem
		 FROM   iisaudit
		 WHERE  RIGHT(csuriStem, 5) IN ('.aspx', '.html') OR RIGHT(csuriStem, 4) IN (
				'.asp', '.htm', '.php')
		) f
ORDER BY TheIP, TheDateTime
--and work out which seesion everything belongs to. This code relies 
--on a Quirky update

UPDATE  #session
SET     @Session = session = CASE WHEN TheIP <> @ThePreviousIP OR DATEDIFF(Second, @LastDateTime, TheDateTime) > @SessionTimeout
								  THEN @Session + 1
								  ELSE @Session
							 END,
		TheDay = DATEDIFF(d, 0, [TheDateTime]),
		@isFirstHit = isFirstHit = CASE WHEN TheIP <> @ThePreviousIP OR DATEDIFF(Second, @LastDateTime, TheDateTime) > @SessionTimeout
										THEN 1
										ELSE 0
								   END,
		@LastDateTime = TheDateTime,
		@ThePreviousIP = TheIP,
		@firstdot = firstdot = CHARINDEX('.', TheIP, 1),
		@Seconddot = Seconddot = CHARINDEX('.', TheIP,
										   @firstdot + 1),
		@Thirddot = Thirddot = CHARINDEX('.', TheIP,
										 @SecondDot + 1),
		@BigIP = BigIP = CONVERT(BIGINT, 16777216) * (LEFT(TheIP, @FirstDot - 1)) + CONVERT(BIGINT, 65536) * (SUBSTRING(TheIP, @FirstDot + 1, @SecondDot - @FirstDot - 1)) + CONVERT(BIGINT, 256) * (SUBSTRING(TheIP, @SecondDot + 1, @ThirdDot - @SecondDot - 1)) + CONVERT(BIGINT, SUBSTRING(TheIP, @ThirdDot + 1, 20))

-- To lookup the geolocation data for an IP A.B.C.D, you need to find the row in [dbo].[ip_group_country] 
-- that has the highest [ip_start] value that is less than or equal to:
-- (A * 16777216) + (B * 65536) + (C * 256) + D
-- INSERT  INTO @timer (WhatWasIt) SELECT  'start loading countries'
--now we stock a country table in order to work out where the IP is from

INSERT  INTO #Countries
		(IP_Start, CountryCode)
		SELECT  ip_Start, Country_Code
		FROM    ip_group_country

--now we chamelessly do a quirky update to get the
-- bigint version of the previous IP address
-- INSERT  INTO @timer (WhatWasIt) SELECT  'update countries'

SELECT  @BigIP = 0, @NextCountry = ''
UPDATE  #countries
SET     @BigIP = Next_IP = @BigIP - 1,
		@BigIP = IP_Start

--we only want to do this once as it is costly so we put
--it in a temporary table
-- INSERT  INTO @timer (WhatWasIt)  SELECT  'do the aggregate'
INSERT  INTO #aggregate
		SELECT  COUNT(*), Theday, MAX(TheIP), MAX(BigIP),
				session, MAX(TheDateTime), MIN(TheDateTime),
				MAX(Session_ID), 0--, countryCode
		FROM    #session
		GROUP BY session, TheDay
--, countryCode
--now do the countrycode
UPDATE  #aggregate
SET     countrycode = c.countrycode
FROM    #aggregate s INNER JOIN #countries c
		  ON BigIP BETWEEN IP_Start AND Next_IP

-- INSERT  INTO @timer (WhatWasIt) SELECT  'find last hits'
UPDATE  #session
SET     islasthit = 1
FROM    #session s INNER JOIN #aggregate
		  ON session_ID = LastSessionID
-- INSERT  INTO @timer (WhatWasIt) SELECT  'report aggregates'
--and now we create the result tables.

/*INSERT  INTO [#CountrySummary]
		([CountryCode], [DayOnly], [VisitCount], [PageCount],
		 [MaxPagesPerVisit], [AvgPagesPerVisit],
		 [MaxTimeOnSite], [AvgTimeOnSite], goalcount,
		 [BounceRate])*/
SELECT  countryCode, DATEADD(d, day, 0), COUNT(*),
		SUM([PageCount]), MAX([PageCount]),
		AVG([PageCount] * 1.0),
		MAX(DATEDIFF(s, [FirstHit], [LastHit])),
		AVG(DATEDIFF(s, [FirstHit], [LastHit])), 0,
		SUM(CASE WHEN [PageCount] = 1 THEN 1
				 ELSE 0
			END) * 100.0 / (COUNT(*) * 1.0)
FROM    #aggregate
GROUP BY [CountryCode], day
ORDER BY countrycode, day

/*INSERT  INTO [#CountryOverallSummary]
		([CountryCode], [VisitCount], [PageCount],
		 [MaxPagesPerVisit], [AvgPagesPerVisit],
		 [MaxTimeOnSite], [AvgTimeOnSite], [GoalCount],
		 [BounceRate])*/
SELECT  countryCode, COUNT(*), SUM([PageCount]),
		MAX([PageCount]), AVG([PageCount] * 1.0),
		MAX(DATEDIFF(s, [FirstHit], [LastHit])),
		AVG(DATEDIFF(s, [FirstHit], [LastHit])), 0,
		SUM(CASE WHEN [PageCount] = 1 THEN 1
				 ELSE 0
			END) * 100.0 / (COUNT(*) * 1.0)
FROM    #aggregate
GROUP BY [CountryCode]
ORDER BY countrycode


--Select * from [#CountrySummary]

SELECT  uriStem,
		DATEADD(d, DATEDIFF(d, 0, [TheDateTime]), 0) AS PageDay,
		COUNT(*) AS pagecount,
		CONVERT([decimal](12, 2), SUM(isFirstHit) * 100.0 / COUNT(*))
		AS [EntryRate],
		CONVERT([decimal](12, 2), SUM(isLastHit) * 100.0 / COUNT(*))
		AS [ExitRate],
		CONVERT([decimal](12, 2), SUM(CASE WHEN isFirstHit + isLastHit = 2
										   THEN 1
										   ELSE 0
									  END) * 100.0 / COUNT(*))
		AS [BounceRate]
FROM    #session
GROUP BY uriStem,
		DATEADD(d, DATEDIFF(d, 0, [TheDateTime]), 0)
ORDER BY PageDay, uristem


-- INSERT  INTO @timer  (WhatWasIt)  SELECT  'end'
/*SELECT  t.WhatWasIt, DATEDIFF(ms, t.WhenWasIt, y.WhenWasIt)
FROM    @timer t INNER JOIN @timer y
		  ON y.time_ID = t.time_ID + 1
UNION ALL
SELECT  'total',
		DATEDIFF(ms,
				 (SELECT WhenWasIt FROM @timer WHERE WhatWasIt
						  = 'start'
				 ),
				 (SELECT WhenWasIt FROM @timer WHERE WhatWasIt
						  = 'end'
				 )) 
    */

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#CountrySummary%' ) 
  DROP TABLE #CountrySummary

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#CountryOverallSummary%' ) 
  DROP TABLE #CountryOverallSummary

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#session%' ) 
  DROP TABLE #session

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#countries%' ) 
  DROP TABLE #countries

IF EXISTS ( SELECT  *
			FROM    tempDB.information_schema.tables
			WHERE   Table_name LIKE '#aggregate%' ) 
  DROP TABLE #aggregate
more ▼

answered Feb 20 '10 at 08:16 PM

Phil Factor gravatar image

Phil Factor
3.2k 8 9 14

This is definitely closer - There's a few foibles now (for example, country 'BA' in the country summary shows 9 visits in your results, and it should be 8), but definitely much closer! Also the ordering in your country by day summary should be day, country and not country, day... But this is definitely the closest :)

Feb 21 '10 at 12:50 PM Matt Whitfield ♦♦

But is Phil allowed to enter his own competition? ;-p

Feb 21 '10 at 03:25 PM Ian Roke

Of course! :) At the rate he's going, he's going to be the winner!

Feb 21 '10 at 04:23 PM Matt Whitfield ♦♦

What is the correct ORDER BY clause for Reggies code, so we can compare our results? I get parallellism all over the places so it's hard to know.

Feb 21 '10 at 05:50 PM Peso

C'mon you others. I'm feeling lonely. Isn't anyone else going to enter the competition?

Feb 21 '10 at 07:15 PM Phil Factor
(comments are locked)
10|1200 characters needed characters left
Your answer
toggle preview:

Up to 2 attachments (including images) can be used with a maximum of 524.3 kB each and 1.0 MB total.

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments



Facebook logo Follow Ask SSC on Facebook
Find Ask SSC on Google+
linkedin logo Find us on LinkedIn

Topics:

x913
x272
x14
x8
x7

asked: Feb 11 '10 at 10:49 AM

Seen: 4600 times

Last Updated: Feb 27 '10 at 09:00 PM

Copyright © 2002-2012 Simple Talk Publishing. All Rights Reserved. If you have any queries, please contact the site administrators.
Ask SQL Server Central is a community service provided by Red Gate.