On Mon, 2 Apr 2001, Greg Stein wrote:

*> 2) MD5 is a hash algorithm rather than a checksum
*

*> (although, now that I think of it, is any checksum algorithm anything but
*

*> a hash? mebbe GregH can clarify the hash vs checksum terminology)
*

There's no reason not to use MD5 or SHA "hash algorithms" as checksums.

Any function with a smaller range than domain qualifies as a hash function,

even the useless "hash" function that returns a constant value. Thus, any

checksum algorithm qualifies as a hash function. Whether or not a given

hash function might be described as a "checksum algorithm" depends on what

the intended purpose of the function is, and how well it detects errors.

A checksum is generally designed to preserve data integrity by detecting

*accidental* corruption of the data. That's why the simplest checksum

algorithms just add together the bytes (tossing or wrapping the overflow)

and "check the sum" against the known value. Of course, CRC-32 became much

more common, because it did a better job of catching more errors than a

simplistic checksum. But they're all hash functions.

There are hash functions that would do a poor job of detecting corrupted

data, and hence would never be described as "checksum" algorithms. Suppose

you have a simplistic checksum routine that gives you an 8-bit additive

checksum. Then divide that 8-bit number by 11 (a prime number) and take

the remainder (modulo operation). The result is only in the range of 0-10,

and would have a 9% chance of returning the same result for corrupted data.

This would be a truly lousy checksum algorithm, but it might be acceptable

to select hash buckets in a small, simplistic hash table implementation --

it's also a hash function, but nobody in their right mind would call it a

"checksum" algorithm, simply because it would perform that role so poorly!

MD5, however, is designed to be a cryptographically-secure hash algorithm,

which means it is designed to detect *malicious* corruption of data as well

as accidental corruption. This is much more difficult, and there is always

some risk of weaknesses in the algorithm, as with encryption algorithms.

In fact, there are some potential weaknesses in MD5 that have been found,

but it hasn't been completely broken. I believe that SHA is considered

more secure than MD5 these days. A typical feature of cryptographically-

secure hash alrogithms is that even tiny changes in the original data will

often change a large proportion of the bits in the hash result. (Also,

cryptographically-secure hash algorithms often use 128 bits or more for the

result, while checksum algorithms more commonly use 32 bits or less.)

Since cryptographically-secure hash algorithms are generally used as a tool

to "fingerprint" data, the uses are a bit more broad than the traditional

error-checking purpose of a checksum, which is probably why they're not

generally called "checksum algorithms". Nevertheless, they should actually

serve quite well as checksums. The main downside of using MD5 (128 bits),

SHA (160 bits) or a similar algorithm, is the additional space needed

beyond traditional checksums like CRC-32 (32 bits). Also, analysis of the

algorithms is more likely to have focused on resistance to attack rather

than error-checking efficacy, but their larger size should make them more

effective than they would be with fewer bits.

MD5 or SHA should make a fine checksum, and may protect against malicious

changes as well. (SHA may be slightly preferable for that.)

Deven

Received on Sat Oct 21 14:36:27 2006