Re: [OT] Redundancy eliminating file systems, breaking MD5, donatingmoney to OSDL
From: Bart Samwel
Date: Wed Jan 21 2004 - 06:48:21 EST
Matthias Schniedermeyer wrote:
On Sat, Jan 17, 2004 at 02:15:31PM +0100, Bart Samwel wrote:
[...]
Let's take a look at the chances. 30 terabytes is, in a best-case scenario
(with 512-byte blocks) about 6e10 blocks. That would be roughly
6e10*6e10*(2**(-128)), or about 1e-17. With a hundred million machines, the
chances of a collision would be about 1e-9, disregarding the fact that all
these machines have a large chance of containing similar blocks -- their data
isn't truly random, so some blocks have a larger chance of occurring than
others. The data sets on the machines are probably reasonably static, so if
the collision isn't found *at once* the chances of it occurring later are
much smaller. So, even under the most positive assumptions, with a hundred
million machines with 30 terabytes of storage each, it's extremely probable
that you won't find a collision. (A 96-bit hash could have been broken with
this setup however. :) )
There is one fundemental braino in the discussion.
Only HALF the bits are for preventing "accidental" collisions. (The
"birthday" thing). The rest is for preventing to "brute force" an input
that produces the same MD5.(*)
From RFC-1321:
"It is conjectured that the difficulty of coming up with two messages
having the same message digest is on the order of 2^64 operations,
and that the difficulty of coming up with any message having a given
message digest is on the order of 2^128 operations."
So, they say it's on the order of 2^64 operations, whatever their
definition of "operation" is. It probably means that they think you need
to take the MD5 of something in the order of 2^64 random messages in
order to have a reasonable chance of finding a duplicate. The RFC says
nothing about half of the bits being for one purpose and the other for
another; in the algorithm all bits seem to be processed in a similar way.
Let's see where they got their 2^64. If you've got 2^64 messages, you've
got (with the same logic as above) approximately 2^64 * 2^64 * (1 /
2^128) = 100% chance of a birthday collision. This seems to support the
idea that the kind of analysis that I just did corresponds to the way
they look at it.
> *: AFAIR i read this in the specs of SHA1 (160 bits). So i guess this
> is also true for MD5.
It might be that you took the "2^64 operations" as meaning "64 bits" (or
2^80 as 80 bits, in the SHA1 case), while it now seems to be the result
of a birthday-collision calculation. Not a surprising misinterpretation
BTW, because the RFC doesn't specify at all where they got this number.
Btw. I already had (a/the) MD5 collision(*2) in my life.
*2: I had a direcory of about 1,5 Million images and "md5sum"med them to
eliminate doubles. The Log-file, at one point, had the same md5sum as
one of the pictures.
Wow! It might be that MD5 is not such a good hash function after all
then. The security is of course purely based on the hashes of messages
being very randomly distributed. If they really were, the chances of
this happening to you would have been extremely slim (less than 1e-20).
I think the more probable explanation is that MD5 hashes aren't truly
randomly distributed after all. :)
-- Bart
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/