[svn.haxx.se] · SVN Dev · SVN Users · SVN Org · TSVN Dev · TSVN Users · Subclipse Dev · Subclipse Users · this month's index

Checksums vs. Hashes (was Re: Linux Kernel Summit)

From: Deven T. Corzine <deven_at_ties.org>
Date: 2001-04-02 20:39:04 CEST

On Mon, 2 Apr 2001, Greg Stein wrote:

> 2) MD5 is a hash algorithm rather than a checksum
> (although, now that I think of it, is any checksum algorithm anything but
> a hash? mebbe GregH can clarify the hash vs checksum terminology)

There's no reason not to use MD5 or SHA "hash algorithms" as checksums.

Any function with a smaller range than domain qualifies as a hash function,
even the useless "hash" function that returns a constant value. Thus, any
checksum algorithm qualifies as a hash function. Whether or not a given
hash function might be described as a "checksum algorithm" depends on what
the intended purpose of the function is, and how well it detects errors.

A checksum is generally designed to preserve data integrity by detecting
*accidental* corruption of the data. That's why the simplest checksum
algorithms just add together the bytes (tossing or wrapping the overflow)
and "check the sum" against the known value. Of course, CRC-32 became much
more common, because it did a better job of catching more errors than a
simplistic checksum. But they're all hash functions.

There are hash functions that would do a poor job of detecting corrupted
data, and hence would never be described as "checksum" algorithms. Suppose
you have a simplistic checksum routine that gives you an 8-bit additive
checksum. Then divide that 8-bit number by 11 (a prime number) and take
the remainder (modulo operation). The result is only in the range of 0-10,
and would have a 9% chance of returning the same result for corrupted data.
This would be a truly lousy checksum algorithm, but it might be acceptable
to select hash buckets in a small, simplistic hash table implementation --
it's also a hash function, but nobody in their right mind would call it a
"checksum" algorithm, simply because it would perform that role so poorly!

MD5, however, is designed to be a cryptographically-secure hash algorithm,
which means it is designed to detect *malicious* corruption of data as well
as accidental corruption. This is much more difficult, and there is always
some risk of weaknesses in the algorithm, as with encryption algorithms.
In fact, there are some potential weaknesses in MD5 that have been found,
but it hasn't been completely broken. I believe that SHA is considered
more secure than MD5 these days. A typical feature of cryptographically-
secure hash alrogithms is that even tiny changes in the original data will
often change a large proportion of the bits in the hash result. (Also,
cryptographically-secure hash algorithms often use 128 bits or more for the
result, while checksum algorithms more commonly use 32 bits or less.)

Since cryptographically-secure hash algorithms are generally used as a tool
to "fingerprint" data, the uses are a bit more broad than the traditional
error-checking purpose of a checksum, which is probably why they're not
generally called "checksum algorithms". Nevertheless, they should actually
serve quite well as checksums. The main downside of using MD5 (128 bits),
SHA (160 bits) or a similar algorithm, is the additional space needed
beyond traditional checksums like CRC-32 (32 bits). Also, analysis of the
algorithms is more likely to have focused on resistance to attack rather
than error-checking efficacy, but their larger size should make them more
effective than they would be with fewer bits.

MD5 or SHA should make a fine checksum, and may protect against malicious
changes as well. (SHA may be slightly preferable for that.)

Deven
Received on Sat Oct 21 14:36:27 2006

This is an archived mail posted to the Subversion Dev mailing list.

This site is subject to the Apache Privacy Policy and the Apache Public Forum Archive Policy.