Re: Ok, explained.. (was Re: [PATCH] mm: fix page_mkclean_one)

From: Linus Torvalds
Date: Fri Dec 29 2006 - 19:00:27 EST




On Fri, 29 Dec 2006, Theodore Tso wrote:
>
> If we do get this fixed for ext4, one interesting question is whether
> people would accept a patch to backport the fixes to ext3, given the
> the grief this is causing the page I/O and VM routines.

I don't think backporting is the smartest option (unless it's done _way_
later), but the real problem with it isn't actually the VM behaviour, but
simply the fact that cached performance absoluely _sucks_ with the buffer
cache.

With the physically indexed buffer cache thing, you end up always having
to do these complicated translations into block numbers for every single
access, and at some point when I benchmarked it, it was a huge overhead
for doing simple things like readdir.

It's also a major pain for read-ahead, exactly partly due to the high cost
of translation - because you can't cheaply check whether the next block is
there, the cost of even asking the question "should I try to read ahead?"
is much much higher. As a result, read-ahead is seriously limited, because
it's so expensive for the cached case (which is still hopefully the
_common_ case).

So because read-ahead is limited, the non-cached case then _really_ sucks.

It was somewhat fixed in a really god-awful fashion by having
ext3_readdir() actually do _readahead_ though the page cache, even though
it does everything else through the buffer cache. And that just happens to
work because we hopefully have physically contiguous blocks, but when that
isn't true, the readahead doesn't do squat.

It's really quite fundamentally broken. But none of that causes any
problems for the VM, since directories cannot be mmap'ed anyway. But it's
really pitiful, and it really doesn't work very well. Of course, other
filesystems _also_ suck at this, and other operating systems haev even
MORE problems, so people don't always seem to realize how horribly
horribly broken this all is.

I really wish somebody would write a filesystem that did large cold-cache
directories well. Open some horrible file manager on /usr/bin with cold
caches, and weep. The biggest problem is the inode indirection, but at
some point when I looked at why it sucked, it was doing basically
synchronous single-buffer reads on the directory too, because readahead
didn't work properly.

I was hoping that something like SpadFS would actually take off, because
it seemed to do a lot of good design choices (having inodes in-line in the
directory for when there are no hardlinks is probably a requirement for a
good filesystem these days. The separate inode table had its uses, but
indirection in a filesystem really does suck, and stat information is too
important to be indirect unless it absolutely has to).

But I suspect it needs more than somebody who just wants to get his thesis
written ;)

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/