Re: [RFC] Per-inode metadata cache.

Andrea Arcangeli (andrea@suse.de)
Mon, 18 Oct 1999 20:28:53 +0200 (CEST)


On Mon, 18 Oct 1999, Alexander Viro wrote:

>? The same way you are doing it with pagecache.

I think if I would have just understood I wouldn't be asking here ;).

>WTF would we _need_ to know? Think about it as about memory caching. You
>can cache by virtual address and you can cache by physical address. And
>since we have no aliasing here... Andrea, consider the thing as VM. You
>have virtual addresses (offsets in file for data, offsets in directory for
>directory contents, beginning of covered range for indirect blocks, etc.)
>and you have physical address (offset on a disk). You have mapping of VA
>to PA (->bmap(), ->get_block()) (context being the file). Old scheme was
>equivalent to caching by PA and recalculation of mapping on each use. New
>sceme gives caching by context+VA and provides a working TLB. We already
>have it for data. For metadata we are still caching by PA. And getting the
>nasty pile of it with cache coherency. Would you like to work with MMU of
>such architecture?

So far so good. I just know the above. What I can't see is how do you want
to efficiently enforce coherency by moving the dirty-metadata elsewhere.

>> Right now you know only that the stale buffer can't came from inode
>> metadata as we have the race prone slowww trick to loop in polling mode
>> inside ext2_truncate until we'll be sure bforget will run in hard mode.
>
>And? Don't use bread() on metadata. It should never enter the buffer hash,
>just as buffer_heads of _data_ never enter it. Leave bread() for
>statically allocated stuff.

Hmm I think the rewrite is so intensive that I can't see it following the
current code. I have to think at a partial rewrote of ext2 to make it to
work in my mind without bread and by using the buffer cache only to allow
bdflush/kupdate/sync to do their work.

>> And even if you know such information you should do a linear search in the
>> list that's slower than an hash lookup anyway (or you can start slowly
>> unmapping all the other stale buffers even if you are not interested
>> about, so you may block more than necessary). Doing a per-object hash is
>> not an option IMHO.
>
>You are kinda late with it - it's already done for data. The question

The page cache is _not_ a per-obect hash. You don't have a separate hash
for each inode. You have a _global_ hash for all inodes.

At the same time you _just_ (in 2.3.22) have a global hash for all and
only the metadata and it's the so called "buffer cache". So if you can
reach your object in O(1) with an additional structure that may be
rasonable, but if you'll have again to search (linear search or an hash
search) then better you use the buffer hash to get rid of the stale
buffer.

I think you are missing the reason you are suggesting to move the metadata
writes elsewhere.

The reason for using the pagecache is because we'll reach the
buffer->b_data without having to enter the filesystem code (so we'll be
able to write and to read without doing the Virtual->Phisical translation,
the pagecache it's our filesystem-TLB as you said).

We are instead _not_ using the page cache to enfore coherency (it infacts
_breaks_ the coherency and we have to query the buffer hash to know if we
are going to collide with the buffer cache).

So I can't see how and additional metadata-virtual layer can help to
enforce coherency between page-cache the new implemented metadata-cache
and the old buffer-cache (now used only for taking care of writes).

Note also that as the pagecache is just allowing us to avoid entering the
bmap path (so we just remains at the virtual layer all the time we can)
and we also don't need to optimize the metadata. We just avoid to query
the metadata using the pagecache, so I can't see a performance downside of
taking the metadata at the physical layer. And as just said, I can't see
how moving the metadata at the virtual layer can help to enforce coherency
(we'll have to handle coherency by hand with another additional layer
instead).

>Right. And you may want different policies for data and metadata - the
>latter should always be in kvm. AFAICS that's what SCT refered to.

That sounds rasonable of course otherwise you'll have to impact the
filesystem with lots of kmap. Anyway 64g is a red-herring too as we could
_just_ use the bigmem memory (between 1/2g to 4g) as pagecache by using
bounce buffers for slowly and deadlock-prone doing I/O (remember
ll_rw_block can't fail right now and it's supposed to not allocate memory
as it's called from swapout). Solving the deadlock issue has nothing to do
with 64g (and you'll go always dogslow then during I/O). For the raw-io
all is been easy with the bounce buffers as raw-io can fail instead.

Andrea

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/