CHECKSUM


The idea for this article started when someone got the same CHECKSUM value for the string “bQ” and “aa”. Then, just a few hours later, someone else reported a collision for the string “ABA” and “ACQ”. They both used CHECKSUM for creating hash-values for their data.
However, the collisions made me curious and I decided to investigate the internals of CHECKSUM. Armed with a paper and pencil I decided to do some basic check. As always, I use a prime number for the investigation and this time I decided to go for lower case “a” (ascii value 97, a prime number).

I wrote the following query

And I got the following result

And here we have very interesting findings!

  1. It seems there is a cycle of 16 characters
  2. All values up to seven characters are 16 times greater than the previous value

16 times greater (in binary) is a value shifted 4 times. Also, shifting a value 4 bits for 8 times, is 4 bytes (INT32) which also is the return data type of CHECKSUM.

With this in mind, I dissected all the values above into factors of 16 and offset of 97 (the prime number I used).

My assumption was correct! Every value up to seven characters in length follows the same rule of 16 to the power of length, times the character value. But for the 8th character, something happens.

It’s time to investigate on a bit level what is going on. Diagramming the 36 bits used for the calculation, reveals an interesting thing. As seen in this image it is evident that when the value overflows an INT32, the last 4 bits is also XOR’ed to the result.

And that is all. We know how the internals work. The last piece of the puzzle was the “overflow” thing.

Now, when we know how CHECKSUM is calculated, it’s easy to write a function to emulate the built-in function in SQL, and also accept IMAGE datatype. Original CHECKSUM does not work on IMAGE or TEXT/NTEXT/XML.

With this algorithm, you could easyily write a SQLCLR replacement as well.

CREATE FUNCTION dbo.fnBinaryChecksum
(
        @Data VARBINARY(MAX)
)
RETURNS INT
AS
BEGIN
        DECLARE @Index INT = 1,
                @MaxIndex INT = DATALENGTH(@Data),
                @SUM BIGINT = 0,
                @Overflow TINYINT;

        WHILE @Index <= @MaxIndex
                SELECT  @SUM = (16 * @SUM) ^ SUBSTRING(@Data, @Index, 1),
                        @Overflow = @SUM / 4294967296,
                        @SUM = @SUM - @Overflow * 4294967296,
                        @SUM = @SUM ^ @Overflow,
                        @Index = @Index + 1;

        IF @SUM > 2147483647
                SET @SUM = @SUM - 4294967296;
        ELSE IF @SUM BETWEEN 32768 AND 65535
                SET @SUM = @SUM - 65536;
        ELSE IF @SUM BETWEEN 128 AND 255
                SET @SUM = @SUM - 256;

        RETURN @SUM;
END