On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote:
On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
OK, so I looked for PG_dirty anyway.
In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
bail out if the page is dirty.
Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
truncate_complete_page, because it called cancel_dirty_page (and thus
cleared PG_dirty) after try_to_free_buffers was called via
do_invalidatepage.
Now, if I'm not mistaken, we can end up as follows.
truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
ext3_invalidatepage()
journal_invalidatepage()
journal_unmap_buffer()
__dispose_buffer()
__journal_unfile_buffer()
__journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
Good, this seems to be the exact path that actually triggers it. I got to
journal_unmap_buffer(), but was too lazy to actually then bother to follow
it all the way down - I decided that I didn't actually really even care
what the low-level FS layer did, I had already convinced myself that it
obviously must be dirtying the page some way, since that matched the
symptoms exactly (ie only the journaling case was impacted, and this was
all about the journal).
But perhaps more importantly: regardless of what the low-level filesystem
did at that point, the VM accounting shouldn't care, and should be robust
in the face of a low-level filesystem doing strange and wonderful things.
But thanks for bothering to go through the whole history and figure out
what exactly is up.
Oh well, after seeing the move of cancel_dirty_page, I just went
backwards from __set_page_dirty using cscope + some smart guessing and
quickly ended up at ext3_invalidatepage, so it wasn't that hard :-)
As try_to_free_buffers got its ext3 hack back in
ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
3e67c0987d7567ad666641164a153dca9a43b11d should be reverted? (Except for
the accounting fix in cancel_dirty_page, of course).
Yes, I think we have room for cleanups now, and I agree: we ended up
reinstating some questionable code in the VM just because we didn't really
know or understand what was going on in the ext3 journal code.
Hm, you attributed more to my mail than there was actually in it. I
didn't even start to think of cleanups (because I don't know jack about
the whole ext3/jdb stuff, so I simply cannot come up with any cleanups
(yet?)).What I meant is that we only did a half-revert of that hackery.
When try_to_free_buffers started to check for PG_dirty, the
cancel_dirty_page call had to be called before do_invalidatepage, to
"fix" a _huge_ leak. But that caused the accouting breakage we're now
seeing, because we never account for the pages that got redirtied during
do_invalidatepage.
Then the change to try_to_free_buffers got reverted, so we no longer
need to call cancel_dirty_page before do_invalidatepage, but still we
do. Thus the accounting bug remains. So what I meant to suggest was
simply to actually "finish" the revert we started.
Or expressed as a patch:
diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..2974903 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
if (page->mapping != mapping)
return;
- cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
if (PagePrivate(page))
do_invalidatepage(page, 0);
+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);
I'll be the last one to comment on whether or not that causes inaccurate
accouting, so I'll just watch you and Jan battle that out until someone
comes up with a post-.24 patch to provide a clean fix for the issue.
Krzysztof, could you give this patch a test run?
If that "fixes" the problem for now, I'll try to come up with some
usable commit message, or if somehow wants to beat me to it, you can
already have my
Signed-off-by: Björn Steinbrink <B.Steinbrink@xxxxxx>