Re: proposed change for async vbuffer heads

Benjamin C R LaHaise (blah@dot.superaje.com)
Wed, 27 Aug 1997 00:12:37 +0000 ( )


On Tue, 26 Aug 1997, Theodore Y. Ts'o wrote:
...
> If we can allow the page cache to handle 64 bit offsets even on the
> i386, then there's a very simple solution. We simply access all of the
> meta-data information (the superblocks, inode table, indirect block
> pointers, etc.) via the page cache with the entire block device as the
> blocking store. (i.e., what happens if you open /dev/hda1 directly and
> go through the page cache).

This is by far the right way to go if it weren't for one can of worms:
aliasing. If an indirect block were accessed via the device's inode (say
it's a 1024 byte block) and is immediately followed by a block of some
file's data, writing to the indirect block would cause the dirty 'page'
to be flushed, overwriting any update made to the file.

One possibility is that we could allow for page cache pages to not be
entirely 'present' for device inode pages by having either a max-valid
offset or perhaps a bitmap of 512 byte (or less) chunks - valid and dirty.
This isn't 100% efficient, but if we only do it for the superblock and as
little meta-data as possible I think the couple of k lost would be ok.

Alternatively, the page cache could be extended to deal with aliasing, and
it should be, but... how to do without getting a performance hit and/or
ruining the simplicity of the page cache? Think think think... Hmm: in
99% of cases there will be no alias to worry about. How about moving
the page cache tag for a page out of struct page so we end up with
something like:

struct page_cache_entry {
struct inode *inode;
loff_t offset; /* -ve for per-inode meta data */
unsigned size; /* normally PAGE_SIZE */
char *data; /* normally page aligned */
struct page_cache_entry *next_in_hash;
struct page_cache_entry *next_in_inode;
struct page_cache_entry *next_in_backing_store; /* on backing_store_list */
struct page_cache_entry *backing_store_list; /* not for device inodes */
};

The page cache would be managed pretty much as it is currently, but upon
loading additional page_cache_entries would be added with the device's
inode and offset (coalescing of course). This entry becomes the 'backing
store' that is pointed to.

A filesystem's write routine will become much simpler - simply flush the
page through the backing_store entries. Since we want to do everything
through the page cache, to make life easier on filesystems a vfs op
called expand could be created, plus an option to bmap to force
allocation of blocks... that way a generic_writepage could be shared
amongst filesystems.

Unfortunately this adds one additional entry per page for normal files
(is that significant?). This ends up being a pruned down buffer_head
type tag with the page cache functionality. Reaping would still occur
via the page_map mechanism (a page would have a pointer to the
page_cache_entry with the inode that loaded it).

Hmm: partition devices should merely translate the offset and return
/dev/hda's inode. Furthermore, by using inodes for filesystem level
io we'd eliminate the need for the loop device (and suddenly it'd work
over nfs ;).

...
As for indirect blocks, negative offsets look like the way to go. That
way there isn't as much of a chance for hash table collisions (having
lots of meta-data pages tied to a device's inode might trigger this).

> The reason why we need to handle 64-bit offsets in the page cache is to
> support filesystems greater than 4GB, which do exist today.

Definately. With all the talk about 64bit file apis the remaining parts
of the vfs/mm should be tweaked before 2.2.

>
> - Ted
>
> P.S. The other thing we need to fix about the VM system is that today,
> if three processes have a shared map, and it is dirtied by the three
> processes, the pages in that VM mapping apparently get written out three
> times, because the dirty bit is per process instead of per-page. This
> can be severely non-optimal for certain performance critical
> applications that have this access/modification pattern.

Per-page dirty bits and pte lists along the lines of what either Mark or
myself are doing will solve this issue nicely. (I've just been too busy
lately to get on with these interesting developements now that I have
the basic pte lists working.)

-ben