Compressions (was Re: Transparent compression in the FS)

From: Jörn Engel
Date: Sat Oct 18 2003 - 11:19:28 EST


(The newline above is deliberate, no need for a flamewar)
In-Reply-To: <3F90175B.2000502@xxxxxxxx>

On Fri, 17 October 2003 11:22:51 -0500, Eli Carter wrote:
> John Bradford wrote:
> >>
> >>Note that a file compressed with bzip2 is not necessarily smaller than
> >>the same file compressed with gzip. (It can be quite a bit larger in
> >>fact.)
> >
> >Have you noticed that with real-life data, or only test cases?
>
> Real-life data. I don't remember the exact details for certain, but as
> best as I can recall: I was dealing with copies of output from build
> logs, telnet sessions, messages files, or the like (i.e. text) that were
> (many,) many MB in size (and probably highly repetitititititive). I
> wound up with a loop that compressed each file into a gzip and a bzip2,
> compared the sizes, and killed the larger. There were a number of .gz's
> that won. (I have also read that gzip is better at text compression
> whereas bzip2 is better at binary compression. No, I don't remember the
> source.)

Simple example:

dd if=/dev/urandom of=foo count=1 bs=31k
for i in `seq 10`; do cat foo >> bar; done
cp bar foo
time gzip foo
time bzip2 bar
ls -l foo.gz bar.bz2

Notice how bzip2 runs ~80 times longer and output remains ~60% bigger.
And now try this:

time bzip2 foo.gz
ls -l foo.gz.bz2 bar.bz2

How wonderful, bzip2 runs fast again and the result is even smaller.
:)


While this example is constructed, it demonstrates nicely where bzip2
sucks. It also demonstrates how simply bzip2 could be improved.
Anyone up for a fun hack?

Jörn

--
Ninety percent of everything is crap.
-- Sturgeon's Law
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/