Re: [PATCH] mm: fix page_mkclean_one (was: 2.6.19 file contentcorruption on ext3)

From: Linus Torvalds
Date: Tue Dec 26 2006 - 14:29:22 EST

On Tue, 26 Dec 2006, Nick Piggin wrote:

> Linus Torvalds wrote:
> >
> > Ok, so how about this diff.
> >
> > I'm actually feeling good about this one. It really looks like
> > "do_no_page()" was simply buggy, and that this explains everything.
> Still trying to catch up here, so I'm not going to reply to any old
> stuff and just start at the tip of the thread... Other than to say
> that I really like cancel_page_dirty ;)

Yeah, I think that part is a bit clearer about what's going on now.

> I think your patch is quite right so that's a good catch.

Actually, since people told me it didn't matter, I went back and looked at
_why_ - the thing is, "vma->vm_page_prot" should always be read-only
anyway, except for mappings that don't do dirty accounting at all, so I
think my patch only found cases that are unimportant (ie pages that get
faulted on on filesystems like ramfs that doesn't do any dirty page
accounting because they're all dirty anyway).

> But I'm not too surprised that it does not help the problem, because I
> don't think we have started shedding any old pte_dirty tests at
> unmap/reclaim-time, have we? So the dirty bit isn't going to get lost,
> as such.

True. We should no longer _need_ those dirty bit reclaims at
unmap/reclaim, but we still do them, so you're right, even if we were
buggy in this area, it should only really matter for the dirty page
counting, not for any lost data.

> I was hoping that you've almost narrowed it down to the filesystem
> writeback code, with the last few mails?

I think so, yes.

However, I've checked, and "rtorrent" really does seem to be fairly
well-behaved wrt any filesystem activity. It does

- no threading. It's 100% single-threaded, and doesn't even appear to use

- exactly _one_ "ftruncate()", and it does it at the beginning, for the
full final size.

IOW, it's not anything subtle with truncate and dirty page cancel.

- It never uses mprotect on the shared mappings, but it _does_ do:
"mincore()" - but the return values don't much matter (it's used
as a heuristic on which parts to hash, apparently)

I double- and triple-checked this one, because I
did make changes to "mincore()", but those didn't go
into the affected kernels anyway (ie they are not in
plain 2.6.19, nor in either)

"msync(MS_ASYNC)" (or MS_SYNC if you use a command line flag)
"munmap()" of course

- it never seems to mix mmap() and write() - it does _only_ mmap.

- it seems to mmap/munmap the shared files in nice 64-page chunks, all
64-page aligned in the file (ie it does NOT create one big mapping, it
has some kind of LRU of thse 64-page chunks). The only exception being
the last chunk, which it maps byte-accurate to the size.

- I haven't checked whether it only ever has the same chunk mapped once
at a time.

Anyway, the _one_ half-way interesting thing is the fact that it doesn't
allocate any backing store at all for the file, and as such the page
writeback needs to create all the underlying buffers on the filesystem. I
really don't see why that would be a problem either, but I could imagine
that if we have some writeback bug where we can end up writing back the
_same_ page concurrently, we'd actually end up racing in the kernel, and
allocating two different backing stores, and then maybe the other one
would effectively "get lost" (and the earlier writeback would win the
race, explaining why we'd end up with zeroes at the end of a block).

Or something.

However, all the codepaths _seem_ to test for PG_writeback, and not even
try to start another writeback while the first one is still active.

What would also actually be interesting is whether somebody can reproduce
this on Reiserfs, for example. I _think_ all the reports I've seen are on
ext2 or ext3, and if this is somehow writeback-related, it could be some
bug that is just shared between the two by virtue of them still having a
lot of stuff in common.

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at