Re: Using compression before encryption in device-mapper

From: Guillaume Lacôte
Date: Wed Apr 14 2004 - 10:27:29 EST


Le Mercredi 14 Avril 2004 17:02, Pascal Schmidt a écrit :
> On Wed, 14 Apr 2004 16:10:20 +0200, you wrote in linux.kernel:
> > Actually (see my reply to Timothy Miller) I really want to do
> > "compression" even if it does not reduce space: it is a matter of growing
> > the per-bit entropy rather than to gain space (see
> > http://jsam.sourceforge.net).
>
> How is the per-bit entropy higher when the same amount of data (and
> thus entropy on that data) is sometimes contained in *more* bits?
>
> I can see the argument if data is really compressed, because then more
> bits than would normally fit into, say, a sector, contribute to the
> entropy of the final sector.
You are perfectly right. If I recall correctly from Paul Li and Ming Vitanyi,
An Introduction to Kolmogorv Complexity, then the minimum length of an
encoding (e.g. a static two-pass Huffman encoding) is equal to the total
entropy of the text, or one more than that (this is also asymptotically equal
to the Kolmogorov complexity of the text). In particular the static encoding
can not be longer than the original text.

However I do _not_ want to have any form of meta-data, dictionnary, etc. Thus
I need to use dynamic encoding, were the huffman tree is dynamically updated
(both while compressing or while decompressing) as characters are
read/written. The problem is that the encoding "evolves" and in particular it
can be (and usually is) worse than the static encoding.

J. S. Vitter showed that the dynamic algorithm by Faller, Gallager and Knuth
can use twice more bytes plus one bit per byte than the optimal static
Huffman encoding. On the other hand J. S. Vitter discusses dynamic algorithm
that uses only one more bit per byte in the worst case when compared to the
static encoding.

You are right that in this very case, the per-bit entropy will be
(1 - 1/(1+1/8) ) ~ 12% lower than in the original text. The point is that this
case (which has nothing to do with the case where a text can be well
compressed or not, this is the worst _relative_ performance of dynamic versus
static encoding) does not happen "too often".

Note that I wish to prepend random bytes followed by the block of real text
before compressing and ciphering, so as to make the distribution of huffman
trees uniform. The ultimate goal being to let no other solution to an
attacker than to brute force test all possible keys. An indirect consequence
on this is that the probability of being in the "poor dynamic performance"
case does not depend on the data itself (but on the random drawing).

I hope that I made things clearer and that I didn't make any mistake (please
feel free to correct me).

Guillaume.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/