Re: The Central Mystery

Colin Plumb (colin@nyx.net)
Thu, 24 Jul 97 23:28:32 MDT


> Böhme et.al, Linux Kernel Programming, have about 40 pages on
> memory management. Maybe this helps.

Yes, that has been the most helpful source so far. But your explanation
is even clearer!

> What I want to figure out is how a file system brings data into memory
> in such a way that it's properly aligned and packed into MMU-sized
> (e.g. 4K) pages despite the huge variability in the number of ways that
> it can get there. How do permissions get checked and used? How
> do writes get propagated back to the file system?

> There are several ways how a file system can support mmap:
> a) implement a file mmap operation
> b) use the generic mmap, but implement the inode's readpage
> c) use the generic readpage, and support the inode's bmap

For the television audience: several parts of Linux, notably the VFS,
use vectors of function poitners to implement operations that a
file system might *possibly* want to be involved with. The file
system implementation can supply a NULL pointer for that function,
in which case Linux has generic implementations that will do
(often based on other functions).

C++ programmers will recognize this as "subclassing". The base
class is virtual, but provides a lot of default implementations.

The only difference is that C++ always fills in the pointers, while
Linux sometimes fills in a pointer (e.g. to generic_mmap), but
sometimes has an explicit if(). Often, this if() has the alternative
in-line, which may end up being faster even after the if() overhead.

Note that this means that, given a bmap() function, a file
system doesn't have to implement read()! (generic_file_read
is available and will do all sorts of hairy read-ahead for you.)

> Upon sys_mmap, the system calls the file's mmap operation. In case a),
> the file system needs to fill a vm_area_struct. In particular, it needs
> to set the operations pointer. Upon page fault, swap-out, write-back
> and so on the system will then call the operations of the vm area. In
> particular, upon page read, the nopage operation is called.

This is the really hairy way, used mostly for network file systems,
and not of general interest to local filesystem writers.

> The generic nopage then calls the inode's readpage function, asking
> for the page. The readpage function is asked to fill the page. It's
> entirely up to the fs how it achieves this. If the file is smaller than
> the page, the file system should fill only the beginning of the page.

Um, correction, it should zero-fill (see brw_page() in fs/buffer.c).

> Most file systems use the generic readpage here. This in turn calls
> bmap. bmap translates the file offset to a block number on the device
> where the inode resides, and the page data better starts on a block
> boundary (or the file system cannot use generic_readpage).
> generic_readpage then fakes a buffer that shares the data with page
> to fill, and performs the actual IO.

Ah, here's the fun. generic_readpage() (fs/buffer.c) calls bmap()
for each (file system) block in the (MMU) page and sticks the
block numbers (where 0 means "zero-fill") in an array, which it
then passes to brw_page().

?> Should zeromap_page_range be used on page-sized holes in files?
?> I guess generic_readpage() is a little too late, but it could
?> have a special return value that could be noticed at the 7 places
?> (all in mm/filemap.c) where it's called.

brw_page() proceeds to create dummy buffer_head structures for each
file system block, then read them all in asynchronously.
There are two sepcial cases:
- If the block number is zero, it's just a memset().
It appears that this code doesn't check
that rw = READ before invoking this special case, so I hope it never
happens the other way around...
- If the block is in the buffer cache, it is copied.

?> Is it worth writing a memset_aligned(), or even a zerofill_aligned(),
?> which could avoid a lot of tedious checking on, e.g. the Alpha?

Any actual I/O needed is put into an array and passed to ll_rw_block.
If there is no I/O, it falls through to the tail of unlock_buffer
that updates the page status. (Question: is it dangerous to clear
the PG_locked bit before setting the PG_uptodate bit? I ask only
because the I/O case does them in the opposite order.)

?> Is it correct to do ++current->maj_flt even in the !nr
?> (no actual I/O) case?

This bubbles down through some code that I haven't explored yet,
but eventually gets satisifed and comes back to end_request() in
include/linux/blk.h. This uses macro trickery to customize it
to each device, but basically the "uptodate" flag is a success
flag, and it calls mark_buffer_uptodate and then unlock_buffer
(I'm not clear why those are two separate functions) on each
buffer_head affected by the current I/O request.

mark_buffer_uptodate check to see if every buffer in the page is
now up to date, and if so, marks the page as up to date.

?> I don't understand what happens if the buffer_head is *not*
?> satisfying a page-in request here... does it trigger the
?> "(tmp=tmp->b_this_page) == NULL" case and redundantly set
?> the page as up to date? If so, why not
?> - Avoid testing the BH_Uptodate bit of the current bh,
?> since we just set it, and
?> - Avoid testing for tmp == NULL after the first step, since
?> it'll either be NULL or circularly linked, and
?> - Avoid the extra work on the page tables if it doesn't matter
?> what the PG_uptodate bit of buffer-cache pages is.
?>
?> The code then looks like the following:

void mark_buffer_uptodate(struct buffer_head * bh, int on)
{
if (on) {
struct buffer_head *tmp = bh->b_this_page;
set_bit(BH_Uptodate, &bh->b_state);
if (tmp) {
/* If a page has buffers and all these buffers
* are uptodate, then the page is uptodate. */
do {
if (!test_bit(BH_Uptodate, &tmp->b_state))
return;
tmp=tmp->b_this_page;
} while (tmp != bh);
set_bit(PG_uptodate,&mem_map[MAP_NR(bh->b_data)].flags);
}
return;
}
clear_bit(BH_Uptodate, &bh->b_state);
}

I'm not quite sure what effect the PG_uptodate flag has on things,

It appears that whether this is a temporary buffer is stored in two
places... b_this_page == NULL for permanent pages, and the BH_FreeOnIO.
(Based on the fact that after checking the flag, unlock_buffers doesn't
check for the bh->b_this_page == NULL case.)

?> If so, why not get rid of b_this_page and just use b_next in this case?
?> The permanent buffer_heads in the buffer cache aren't using the field.

Anyway, unlock_buffer clears the lock flag and wakes up any sleepers
on the buffer head. If this isn't a page I/O buffer, that's it.
If it *is*, however, there's more.

There's some careful checking with interrupts disabled to see if we're the
*last* buffer in the page to finish. (Unlike mark_buffer_uptodate, which
just redoes a trivial amount of idempotent work if there's a race condition,
so there's no need to lock carefully, this is important.)

Anyway, it calls free_async_buffers on the buffer_head list, unlocks
the page, and wakes up waiters on the page. Some bookkeeping,
and it's finally done.

> Since Larry was asking for read/write: In Linux, read/write are still
> separate file operations, since not every file system needs to support
> mmap (and there are files w/o a file system behind them). Those file
> systems that do support readpage (directly or indirectly via bmap)
> don't need to implement read/write, they can use the generic file
> read/write routines instead.

> Please note that this is a Linux 2.1 description; in Linux 2.0, things
> are slightly different.

I know a *big* difference is that the buffer cache is used much less now.
Basically, the buffer cache "physically tagged", by device and
offset. The page cache is "virually tagged", indexed by inode
and offset. It didn't used to be.

?> Why not support FS block sizes larger than the HW page? In particular,
?> Why not allow 8K block sizes on non-Alpha platforms? It seems that
?> You could call bmap() on an 8K page and the choose one of the 4K halves
?> when making up the buffer_head to pass to ll_rw_block or when copying
?> in brw_page.

That should be enough questions and explanations for now...

Thanks to everyone who's making this clear.

-- 
	-Colin