Re: Transparent compression in the FS

From: Jeff Garzik
Date: Thu Oct 16 2003 - 16:03:43 EST


Val Henson wrote:
Abstract:

"Recent research has produced a new and perhaps dangerous technique
for uniquely identifying blocks that I will call
compare-by-hash. Using this technique, we decide whether two blocks
are identical to each other by comparing their hash values, using a
collision-resistant hash such as SHA-1. If the hash values match,
we assume the blocks are identical without further ado. Users of
compare-by-hash argue that this assumption is warranted because the
chance of a hash collision between any two randomly generated blocks
is estimated to be many orders of magnitude smaller than the chance
of many kinds of hardware errors. Further analysis shows that this
approach is not as risk-free as it seems at first glance."


I'm curious if anyone has done any work on using multiple different checksums? For example, the cost of checksumming a single block with multiple algorithms (sha1+md5+crc32 for a crazy example), and storing each checksum (instead of just one sha1 sum), may be faster than reading the block off of disk to compare it with the incoming block. OTOH, there is still a mathematical possibility (however-more-remote) of a collission...

With these sorts of schemes, from basic compare-by-hash to one that actually checks the data for a collission, you take a hit no matter what when writing, but reading is still full-speed-ahead. (if anyone is curious, Plan9's venti uses a multi-GB "write buffer", to mitigate the effect of the checksumming on throughput)

So it's easy to tweak the writing algorithms pretty much any which way, as long as the outcome is a unique tag for each block. Hash collisions between two files, for example, could be resolved easily by making each unique tag "$sha1_hash $n", where $n is the unique number differentiating two blocks with the same SHA1 hash.

Jeff



-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/