Re: [Bug 9182] Critical memory leak (dirty pages)

From: Linus Torvalds
Date: Wed Dec 19 2007 - 12:45:52 EST




On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
>
> I'll confirm this tomorrow but it seems that even switching to data=ordered
> (AFAIK default o ext3) is indeed enough to cure this problem.

Ok, do we actually have any ext3 expert following this? I have no idea
about what the journalling code does, but I have painful memories of ext3
doing really odd buffer-head-based IO and totally bypassing all the normal
page dirty logic.

Judging by the symptoms (sorry for not following this well, it came up
while I was mostly away travelling), something probably *does* clear the
dirty bit on the pages, but the dirty *accounting* is not done properly,
so the kernel keeps thinking it has dirty pages.

Now, a simple "grep" shows that ext3 does not actually do any
ClearPageDirty() or similar on its own, although maybe I missed some other
subtle way this can happen. And the *normal* VFS routines that do
ClearPageDirty should all be doing the proper accounting.

So I see a couple of possible cases:

- actually clearing the PG_dirty bit somehow, without doing the
accounting.

This looks very unlikely. PG_dirty is always cleared by some variant of
"*ClearPageDirty()", and that bit definition isn't used for anything
else in the whole kernel judging by "grep" (the page allocator tests
the bit, that's it).

And there aren't that many hits for ClearPageDirty, and they all seem
to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if the
mapping has dirty state accounting.

The exceptions seem to be:
- the page freeing path, but that path checks that "mapping" is NULL
(so no accounting), and would complain loudly if it wasn't
- the swap state stuff ("move_from_swap_cache()"), but that should
only ever trigger for swap cache pages (we have a BUG_ON() in that
path), and those don't do dirty accounting anyway.
- pageout(), but again only for pages that have a NULL mapping.

- ext3 might be clearing (probably indirectly) the "page->mapping" thing
or similar, which in turn will make the VFS think that even a dirty
page isn't actually to be accounted for - so when the page *turned*
dirty, it was accounted as a dirty page, but then, when it was cleaned,
the accounting wasn't reversed because ->mapping had become NULL.

This would be some interaction with the truncation logic, and quite
frankly, that should be all shared with the non-journal case, so I find
this all very unlikely.

However, that second case is interesting, because the pageout case
actually has a comment like this:

/*
* Some data journaling orphaned pages can have
* page->mapping == NULL while being dirty with clean buffers.
*/

which really sounds like the case in question.

I may know the VM, but that special case was added due to insane
journaling filesystems, and I don't know what insane things they do. Which
is why I'm wondering if there is any ext3 person who knows the journaling
code?

How/when does it ever "orphan" pages? Because yes, if it ever does that,
and clears the ->mapping field on a mapped page, then that page will have
incremented the dirty counts when it became dirty, but will *not*
decrement the dirty count when it is an orphan.

> Two questions remain then: why system dies when dirty reaches ~200MB and what
> is wrong with ext3+data=journal with >=2.6.20-rc2?

Well, that one is probably pretty straightforward: since the kernel thinks
that there are too many dirty pages, it will ask everybody who creates
more dirty pages to clean out some *old* dirty pages, but since they don't
exist, the whole thing will basically wait forever for a writeout to clean
things out that will never happen.

200MB is 10% of your 2GB of low-mem RAM, and 10% is the default
dirty_ratio that causes synchronous waits for writeback. If you use the
normal 3:1 VM split, the hang should happen even earlier (at the ~100MB
"dirty" mark).

So that part isn't the bug. The bug is in the accounting, but I'm pretty
damn sure that the core VM itself is pretty ok, since that code has now
been stable for people for the last year or so. It seems that ext3 (with
data journaling) does something dodgy wrt some page.

But how about trying this appended patch. It should warn a few times if
some page is ever removed from a mapping while it's dirty (and the mapping
is one that should have been accouned). It also tries to "fix up" the
case, so *if* this is the cause, it should also fix the bug.

I'd love to hear if you get any stack dumps with this, and what the
backtrace is (and whether the dirty counts then stay ok).

The patch is totally untested. It compiles for me. That's all I can say.

(There's a few other places that set ->mapping to NULL, but they're pretty
esoteric. Page migration? Stuff like that).

Linus

---
mm/filemap.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 188cf5f..7560843 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -124,6 +124,18 @@ void __remove_from_page_cache(struct page *page)
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
BUG_ON(page_mapped(page));
+
+ if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+ static int count = 10;
+ if (count) {
+ count--;
+ WARN_ON(1);
+ }
+
+ /* Try to fix up the bug.. */
+ dec_zone_page_state(page, NR_FILE_DIRTY);
+ dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ }
}

void remove_from_page_cache(struct page *page)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/