This time I will take the opportunity to talk about how to calculate the covariance function. Not only because I recently answered a question on a forum how to write and calculate the formula in SQL Server, but also since it is interesting in terms of performance. The general formula for calculating the covariance value is
Now, let’s test this formula in SQL Server. You can use this small sample set.
Translating the formula above into T-SQL is simply made of returning the x value, the average x value, the y value and the average y value, so the code is trivial to write and it look like this.
This, of course, returns the correct value of 5.2 to you, but rather slow. When running on a slightly larger sample set of 2,500 rows the performance starts to degrade dramatically. This is due to the 3 Table Spool operators in the execution plan. When running the code on the larger set of sample rows, I get this output from SET STATISTICS ON
As you can see, the 3 Table Spools creates a very high amount of reads.
Now my priority was to try to get rid of all Table Spools and get a much better execution plan.
The first task was to rewrite the formula. I noticed I could expand and collapse the formula to get a nicer looking one. Doing that, I got this equivalent algorithm instead
Writing the T-SQL code for this formula is actually easier to write, than the code for the first formula!
Now there is no need for a CTE or a derived table. Running this code instead on the larger sample set, returns this data from SET STATISTICS IO ON.
As you can see, there is now minimum CPU overhead and very little duration. Running both solutions on a much larger sample set of 6,250,000 rows returns this data from SET STATISTICS IO ON.