Re: The INN/mmap bug

From: Linus Torvalds (torvalds@transmeta.com)
Date: Sun Sep 17 2000 - 22:30:29 EST


On Sun, 17 Sep 2000, Linus Torvalds wrote:
>
> I bet your patch fixes the corruption, but I want to understand _why_.
> Call me dense, but __block_commit_write() seems to do everything we want
> done..

Ok, I may be dense, but I see the bug.

And no, your patch was not the right thing either. It _will_ make that
particular corruption go away, but it's a more insidious problem, and your
patch only addreses part of the whole problem.

The basic problem is that right now we have code that does

        if (!page->buffers)
                create_empty_buffers(page, inode, inode->i_sb->s_blocksize);
        ...
        if (!buffer_uptodate(bh))
                ll_rw_block(READ, 1, &bh);

And this is WRONG.

If the whole page is up-to-date, we must NOT try to read the buffer in
from disk, because the in-memory copy is always more up-to-date.

Normally this bug shows itself only as a small performance issue - if we
only use read() and write(), all changes to the page will be done
"synchronously" with "page->buffers" being held, and as such the page will
never have contents that are newer than the on-disk image. But whenever
somebody writes to the page through a shared mapping, that is no longer
true - we MUST NOT do the buffer read, because it's going to overwrite
data that is newer than the disk contents.

The bug again didn't show up in the trivial test-cases, because it depends
on us "losing" the page buffers and having to re-create them in order to
show up, and that only happens under memory pressure.

Basically, both "truncate()" and "write()" have this bug where they can
end up re-reading stuff from disk even though the in-memory copy is newer.

And because write() had this bug, the bug also got into
block_write_full_page(). Not because block_write_full_page() was buggy in
itself, but because it used a buggy routine.

And your patch fixes the corruption, not by fixing the bug, but by
avoiding the buggy routing in block_write_full_page().

We need to fix the real bug - otherwise anybody doing both write() and
shared mmap's to the same file is going to be bit by it later on...

The easy fix is probably to do something like

        /* Map the buffer */
        if (!buffer_mapped(bh)) {
                ...
        }
+ /* If the page is up-to-date, so is the buffer */
+ if (Page_Uptodate(page))
+ set_bit(BH_Uptodate, &bh->b_state);

        /* Ok, now it was _truly_ not uptodate */
        if (!buffer_uptodate(bh))
                ll_rw_block(READ, 1, &bh);

Comments? The above should fix block_write_full_page() automatically, as
well as fixing the other cases too - and actually improve performance at
the same time by getting rid of unnecessary re-reads.

Looks fairly simple. It only happens in __block_prepare_write() and in
block_truncate_page(), so there's just two places to fix.

Can you se anything else wrong?

                Linus

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/



This archive was generated by hypermail 2b29 : Sat Sep 23 2000 - 21:00:16 EST