Re: Transparent compression in the FS

From: jw schultz
Date: Sun Oct 26 2003 - 21:46:28 EST


On Sun, Oct 26, 2003 at 06:22:31PM -0800, Mike Fedyk wrote:
> On Thu, Oct 16, 2003 at 04:04:48PM -0700, jw schultz wrote:
> > 2. External collision: The smaller hash size to unit count
> > ratio the greater liklihood of false positives. To put
> > it in the extreme: if you have 10,000 blocks you are
> > almost guaranteed to get false positives on a 16 bit
> > hash. This is similar to filling the namespace. It is
> > external collision that was killing rsync on iso images.
>
> What about making one bit changes in the block at a random location (the
> same bit on both sides), and comparing the hashes again.
>
> This would work for something that is trying to save bandwidth over a
> network link, but wouldn't save anything in a Compare-by-hash File System
> (why hash again when you need to read the block anyway?).
>
> One thought for rsync would be to have a 4 step process (the first two are
> how it operates now according to another sub-thread):
>
> 1) Check the block with a weak 16bit hash
>
> 2) Check the block with a stronger 32bit hash (if step 1 had a collision)
>
> 3) Check the block with SHA1 (if step 2 had a collision)
>
> 4) Make a one bit change at a random location in the block, but at the same
> place on both ends of the transfer and repeat steps 1 to 3
>
> It's arguable how much of an optimization the 16bit hash is, but it's there
> and maybe it's a win, and maybe not. It's also arguable just how much
> bandwidth you're going to save with the extra steps 3, and the recursive 4.
> Maybe 4 should be just a recheck with SHA1, and the recursive check isn't
> needed, but I haven't done any hash benchmarks.

My "extreme" example was for illustration only. 16 bits was
chosen as a size to get the numberspace into a scale where
the issue become obvious.

Rsync uses a minimum of a 48 bit hash per block.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: jw@xxxxxxxxxx

Remember Cernan and Schmitt
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/