question

GPO avatar image
GPO asked

Can this linear regression algorithm for SQL Server 2K5 be made reusable?

The following block of code calculates the formula for a trend line using linear regression (method of least squares). It takes as its input two columns of data (i.e. a table) representing the x and y axes of a chart. It's output could be a single row (3 cols) of data representing the slope and y-intercept of the trendline, and the correlation coefficient. Is there a way in SQL 2K5 to create an in-line table-valued function, or a stored procedure that can take a table of x and y coordinates as a parameter? Is there some other way to make this code reusable. Otherwise whenever I want a trendline in the reporting client (SSRS 2008 R2) I'm going to have to copy/paste/hack this code into my work. Supplementary questions: Are there any gotchas in using the float data type here? Often the x axis will be datetime (eg week starting on mm/dd/yyyy). Are there any gotchas in converting a datetime to a numeric? --======= ================================================================================================= --======= Author: (originally) http://stackoverflow.com/users/92092/stephan See: --======= http://stackoverflow.com/questions/2536895/are-there-any-linear-regression-function-in-sql-server --======= Date: Mar 29 2010 09:58 --======= Purpose: Linear regression (method of least squares) --======= Comments: --======= Revisions: WHO WHEN WHAT --======= GPO 20121216 Simplified query (well... kinda) to limit calcs to one --======= regression equation per input dataset. --======= ================================================================================================= SET NOCOUNT ON --------- test data IF OBJECT_ID('tempdb..#some_table') IS NOT NULL DROP TABLE #some_table; SELECT 0 as sourceID, 1.2 as x, 1.0 as y INTO #some_table UNION ALL SELECT 1, 1.6, 1 UNION ALL SELECT 2, 2.0, 1.5 UNION ALL SELECT 3, 2.0, 1.75 UNION ALL SELECT 4, 2.1, 1.85 UNION ALL SELECT 5, 2.1, 2 UNION ALL SELECT 6, 2.2, 3 UNION ALL SELECT 7, 2.2, 3 UNION ALL SELECT 8, 2.3, 3.5 UNION ALL SELECT 9, 2.4, 4 UNION ALL SELECT 10, 2.5, 4 UNION ALL SELECT 11, 3, 4.5; --======= ================================================================================================= --======= linear regression query --======= Get average x==================================================================================== DECLARE @xbar as float; SET @xbar = ( SELECT avg(x) FROM #some_table ); --======= Get average y==================================================================================== DECLARE @ybar as float; SET @ybar = ( SELECT avg(y) FROM #some_table ); --======= Get beta (slope) estimate======================================================================== DECLARE @Beta as float; SET @Beta = ( SELECT SUM((x-@xbar)*(y-@ybar)) / --nullif to stop divided by zero nullif(SUM((x-@xbar)*(x-@xbar)),0) FROM #some_table pd ); --======= Get alpha (constant) estimate==================================================================== DECLARE @Alpha as float; SET @Alpha = @ybar - @xbar * @Beta; --======= Get Total Sum of Squares========================================================================= DECLARE @SS_tot as float; SET @SS_tot = ( SELECT SUM((y-@ybar)*(y-@ybar)) FROM #some_table ); --======= Get Total Sum of Squares due to Error============================================================ DECLARE @SS_err as float; SET @SS_err = ( SELECT SUM((y-(@Alpha+@Beta*x)) * (y-(@Alpha+@Beta*x))) FROM #some_table ); --======= Get r-squared (the correlation coefficient)====================================================== DECLARE @r_squared as float; SET @r_squared = CASE WHEN @SS_tot = 0 THEN 1.0 ELSE 1.0 - (@SS_err / @SS_tot) END; --======= Joining back to the source data allows the plotting of of the trend line along with the usual --======= plotting of x against y========================================================================== SELECT sourceID ,x ,y ,@Beta * x + @Alpha as y_trend --------- the final output from the sproc/iltvf could be a single row holding the following three values. ,@r_squared as r_squared ,@Alpha as Alpha ,@Beta as Beta FROM #some_table;
sql-server-2005stored-proceduresssrs-2008-r2inline-table-valued-functionlinear-regression
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

scott2718281828 avatar image
scott2718281828 answered
,Arrays can be passed as parameters in SQL 2005 by converting them to character strings, either comma-separated lists or as XML. A minimalist XML version of your test data would look like: Your linear regression algorithm as an inline table-valued function that accepts this parameter could be: CREATE FUNCTION dbo.LinearReqression (@Data AS XML) RETURNS TABLE AS RETURN ( WITH Array AS ( SELECT x = n.value('@x', 'float'), y = n.value('@y', 'float') FROM @Data.nodes('/r/n') v(n) ), Medians AS ( SELECT xbar = AVG(x), ybar = AVG(y) FROM Array ), BetaCalc AS ( SELECT Beta = SUM(xdelta * (y - ybar)) / NULLIF(SUM(xdelta * xdelta), 0) FROM Array CROSS JOIN Medians CROSS APPLY ( SELECT xdelta = (x - xbar) ) xd ), AlphaCalc AS ( SELECT Alpha = ybar - xbar * beta FROM Medians CROSS JOIN BetaCalc), SSCalc AS ( SELECT SS_tot = SUM((y - ybar) * (y - ybar)), SS_err = SUM((y - (Alpha + Beta * x)) * (y - (Alpha + Beta * x))) FROM Array CROSS JOIN Medians CROSS JOIN AlphaCalc CROSS JOIN BetaCalc ) SELECT r_squared = CASE WHEN SS_tot = 0 THEN 1.0 ELSE 1.0 - ( SS_err / SS_tot ) END, Alpha, Beta FROM AlphaCalc CROSS JOIN BetaCalc CROSS JOIN SSCalc ) You can't code your algorithm in a simple query because you have to define intermediate variables, and SQL scoping rules don't allow this. The multiple CTEs and occasional CROSS APPLY let you get around that limitation. Now you just need to know how to turn your data sets into the expected XML structure. DECLARE @DataTable TABLE ( SourceID INT, x FLOAT, y FLOAT ) ; INSERT INTO @DataTable ( SourceID, x, y ) SELECT ID = 0, x = 1.2, y = 1.0 UNION ALL SELECT 1, 1.6, 1 UNION ALL SELECT 2, 2.0, 1.5 UNION ALL SELECT 3, 2.0, 1.75 UNION ALL SELECT 4, 2.1, 1.85 UNION ALL SELECT 5, 2.1, 2 UNION ALL SELECT 6, 2.2, 3 UNION ALL SELECT 7, 2.2, 3 UNION ALL SELECT 8, 2.3, 3.5 UNION ALL SELECT 9, 2.4, 4 UNION ALL SELECT 10, 2.5, 4 UNION ALL SELECT 11, 3, 4.5 ; -- Create and view XML data array DECLARE @DataXML XML ; SET @DataXML = ( SELECT -- FLOAT values are formatted in XML like "1.000000000000000e+000", increasing the character count -- Converting them to VARCHAR first keeps the XML small without sacrificing precision -- They are unpacked as FLOAT in the function either way [@x] = CAST(x AS VARCHAR(20)), [@y] = CAST(y AS VARCHAR(20)) FROM @DataTable FOR XML PATH('n'), ROOT('r') ) ; SELECT @DataXML ; -- Get the results SELECT * FROM dbo.LinearReqression (@DataXML) ; If you have a large dataset with multiple sets of x and y data points associated with some attribute, you can do a linear regression on each set with one query. DECLARE @MultiDataSet TABLE ( PK INT IDENTITY NOT NULL PRIMARY KEY CLUSTERED, Attribute VARCHAR(50) NOT NULL, x FLOAT, y FLOAT ) ; -- Get four different subsets of your test data, with the help of spt_values INSERT INTO @MultiDataSet (Attribute, x, y) SELECT v.name, d.x, d.y FROM master.dbo.spt_values v INNER JOIN @DataTable d ON d.SourceID - v.number BETWEEN 0 AND 5 WHERE v.type = 'E' ; -- Build a separate XML array for each Attribute, and call the function WITH AttrList AS ( SELECT DISTINCT Attribute FROM @MultiDataSet ) SELECT AttrList.Attribute, r_squared, Alpha, Beta FROM AttrList CROSS APPLY ( SELECT DataXML = ( SELECT [@x] = CAST(x AS VARCHAR(20)), [@y] = CAST(y AS VARCHAR(20)) FROM @MultiDataSet WHERE Attribute = AttrList.Attribute FOR XML PATH('n'), ROOT('r'), TYPE ) ) x CROSS APPLY ( SELECT r_squared, Alpha, Beta FROM dbo.LinearReqression (x.DataXML) ) lr The XML was constructed with attributes rather than elements because it is more compact, it usually is parsed more efficiently, and elements are much more likely (than attributes) to have NULL numeric values come back as 0 after a round trip XML conversion.
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

robbin avatar image
robbin answered
how about these [link text][1]Are there any Linear Regression Function in SQL Server? [link text][2]linear regression The first link is closest to what you want. [1]: http://stackoverflow.com/questions/2536895/are-there-any-linear-regression-function-in-sql-server [2]: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=21948
1 comment
10 |1200 characters needed characters left characters exceeded

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.

Hi Robbin Thanks for you answer. The links you suggest suffer from exactly the same problem as the above block of code (in fact the link in the header of my code is the same as the one you've posted). My problem is that in SS2K5 I can't see a way to wrap them in an ILTVF or a stored procedure to make the code reusable. This is because you need to pass in a table. I was thinking I could (a) put the x and y values in a permanent table and use newid() to create a unique identifier for that block of x and y values, then (b) pass the newid() value into the sproc then (c) do the calcs, then (d) delete those rows based on the newid() value once the r-squared, slope and intercept are determined. That's getting pretty messy though. There has to be a more robust way. Surely.
0 Likes 0 ·

Write an Answer

Hint: Notify or tag a user in this post by typing @username.

Up to 2 attachments (including images) can be used with a maximum of 512.0 KiB each and 1.0 MiB total.