Re: Transparent compression in the FS

From: Mike Fedyk
Date: Sun Oct 26 2003 - 21:23:14 EST


On Thu, Oct 16, 2003 at 04:04:48PM -0700, jw schultz wrote:
> 2. External collision: The smaller hash size to unit count
> ratio the greater liklihood of false positives. To put
> it in the extreme: if you have 10,000 blocks you are
> almost guaranteed to get false positives on a 16 bit
> hash. This is similar to filling the namespace. It is
> external collision that was killing rsync on iso images.

What about making one bit changes in the block at a random location (the
same bit on both sides), and comparing the hashes again.

This would work for something that is trying to save bandwidth over a
network link, but wouldn't save anything in a Compare-by-hash File System
(why hash again when you need to read the block anyway?).

One thought for rsync would be to have a 4 step process (the first two are
how it operates now according to another sub-thread):

1) Check the block with a weak 16bit hash

2) Check the block with a stronger 32bit hash (if step 1 had a collision)

3) Check the block with SHA1 (if step 2 had a collision)

4) Make a one bit change at a random location in the block, but at the same
place on both ends of the transfer and repeat steps 1 to 3

It's arguable how much of an optimization the 16bit hash is, but it's there
and maybe it's a win, and maybe not. It's also arguable just how much
bandwidth you're going to save with the extra steps 3, and the recursive 4.
Maybe 4 should be just a recheck with SHA1, and the recursive check isn't
needed, but I haven't done any hash benchmarks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/