Re: Re: more git updates..

From: Andrea Arcangeli
Date: Tue Apr 12 2005 - 19:00:20 EST


On Tue, Apr 12, 2005 at 02:21:58PM -0700, Linus Torvalds wrote:
> The full .git archive for 199 versions of the kernel (the 2.6.12-rc2 one
> and a test-run of 198 patches from Andrew) is 111MB. In other words,
> adding 198 "full" new kernels only grew the archive by 9MB (that's all
> "actual disk usage" btw - the files themselves are smaller, but since they
> all end up taking up a full disk block..)

reiserfs can do tail packing, plus the disk block is meaningless when
fetching the data from the network which is the real cost to worry about
when synchronizing and downloading (disk cost isn't a big deal).

The pagecache cost sounds a very minor one too, since you don't need
the whole data in ram, not even all dentries need to be in cache. This
is one of the reasons why you don't need to run readdir, and why you can
discard the old trees anytime.

At the rate of 9M for every 198 changeset checkins, that means I'll have
to download 2.7G _uncompressible_ (i.e. already compressed with a bad
per-file ratio due the too-small files) for a whole pack including all
changesets without accounting the original 111MB of the original tree,
with rsync -z of git. That compares with 514M _compressible_ with CVS
format on-disk, and with ~79M of the CVS-network download with rsync -z of
the CVS repository (assuming default gzip compression level).

What BKCVS provided with 79M of rsync -z, now is provided with 2.8G of
rsync -z, with a network-bound slowdown of -97.2%. Similar slowdowns
should be expected for synchronizations over time while fetching new
blobs etc...

Ok, BKCVS has less than 60000 checkins due the linearization and
coalescing of pulls that couldn't be represented losslessy in CVS, so
the network-bound slowdown is less than -97.2%, my math is
approximative, but the order of magnitude should remain the same.

Clearly one can write an ad-hoc network protocol instead of using
rsync/wget, but the server will need quite a bit of cpu and ram to do a
checkout/update/sync efficiently to unpack all data and create all
changesets to gzip and transfer.

Anyway git simplicity and immutable hashes robustness certainly makes it
an ideal interim format (and it may even be a very pratical local
live format on-disk, except for the backups), I'm only unsure if it's a
wise idea to build an SCM on top of the current git format or if it's
better to use something like SCCS or CVS to coalesce all diffs of a
single file together and to save space and make rsync -z very efficient
too (or an approach like arch and darcs that stores changesets per file,
i.e. patches).
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/