Re: [PATCH 002 of 8] md/bitmap: Remove bitmap writeback daemon.

From: Neil Brown
Date: Sat May 13 2006 - 00:40:29 EST


On Friday May 12, akpm@xxxxxxxx wrote:
> NeilBrown <neilb@xxxxxxx> wrote:
> >
> > ./drivers/md/bitmap.c | 115 ++----------------------------------------
>
> hmm. I hope we're not doing any of that filesystem I/O within the context
> of submit_bio() or kblockd or anything like that. Looks OK from a quick
> scan.

No. We do all the I/O from the context of the per-array thread.
However some IO requests cannot complete until the filesystem I/O
completes, so we need to be sure that the filesystem I/O won't block
waiting for memory, or fail with -ENOMEM.

>
> a_ops->commit_write() already ran set_page_dirty(), so you don't need that
> in there.

Is that documented somewhere..... but yes, that seems to be right. Thanks.

>
> I assume this always works in units of a complete page? It's strange to do
> prepare_write() followed immediately by commit_write(). Normally
> prepare_write() will do some prereading, but it's smart enough to not do
> that if the caller is preparing to write the whole page.
>

Yes, it is strange. That was one of the things that made me want to
review this code and figure out how to do it "properly".

As far as I can see, much of 'address_space' is really an internal
interface to support routines used by the filesystem. A filesystem
may choose to use address spaces, and has a fair degree of freedom
when it comes to which bits to make use of and exactly what they
mean.
About the only thing that *has* to be supported is ->writepages --
which has a fair degree of latitude in exactly what it does -- and
->writepage -- which can only be called after locking a page and
rechecking the ->mapping.

bitmap.c is currently trying to do something every different.
It uses ->readpage to get pages in the page cache (even though some
address spaces don't define ->readpage) and then holds onto those
pages without holding the page lock, and then calls ->writepage to
flush them out from time to time.
Before calling writepage it gets the pagelock, but doesn't re-check
that ->mapping is correct (there is nothing much it can do if it isn't
correct..).

I noticed this is particularly a problem with tmpfs. When you call
writepage on a tmpfs page, the page is swizzled into the swap cache,
and ->mapping becomes NULL - not the behaviour that bitmap is
expecting.

Now I agree that tmpfs is an unusual case, and that storing a bitmap
in tmpfs doesn't make a lot of sense (though it can make some...) but
the point is that if a filesystem is allowed to move pages around like
that, then bitmap cannot hold on to pages in the page cache like it
wants to. It simply isn't a well defined thing to do.


> We normally use PAGE_CACHE_SIZE for these things, not PAGE_SIZE. Same diff.
>

Yeah.... why is that? Why have two names for exactly the same value?
How does a poor develop know when to use one and when the other? More
particularly, how does one remember.
I would argue that the "same diff" should be no difference - not even
in spelling....

> If you have a page and you want to write the whole thing out then there's
> really no need to run prepare_write or commit_write at all. Just
> initialise the whole page, run set_page_dirty() then write_one_page().
>

I see that now. But only after locking the page, and rechecking that
->mapping is correct, and if it isn't .... well, more work is involved
that bitmap is in a position to do.

> Perhaps it should check that the backing filesystem actually implements
> commit_write(), prepare_write(), readpage(), etc. Some might not, and the
> user will get taught not to do that via an oops.
>

Might help, but as I think you've gathered, I really want a whole
different approach to writing to the file. One that I can justify as
being a "correct" use of interfaces, and also that I can be certain
will not block or fail on a kmalloc or similar. Hence the bmap thing
later.


Thanks for your feedback.

NeilBrown
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/