Re: [PATCH 3.16 327/346] mm: workingset: fix crash in shadow node shrinker caused by replace_page_cache_page()

From: Johannes Weiner
Date: Mon Nov 14 2016 - 10:43:13 EST


On Mon, Nov 14, 2016 at 12:14:20AM +0000, Ben Hutchings wrote:
> 3.16.39-rc1 review patch. If anyone has any objections, please let me know.
>
> ------------------
>
> From: Johannes Weiner <hannes@xxxxxxxxxxx>
>
> commit 22f2ac51b6d643666f4db093f13144f773ff3f3a upstream.
>
> Antonio reports the following crash when using fuse under memory pressure:
>
> kernel BUG at /build/linux-a2WvEb/linux-4.4.0/mm/workingset.c:346!
> invalid opcode: 0000 [#1] SMP
> Modules linked in: all of them
> CPU: 2 PID: 63 Comm: kswapd0 Not tainted 4.4.0-36-generic #55-Ubuntu
> Hardware name: System manufacturer System Product Name/P8H67-M PRO, BIOS 3904 04/27/2013
> task: ffff88040cae6040 ti: ffff880407488000 task.ti: ffff880407488000
> RIP: shadow_lru_isolate+0x181/0x190
> Call Trace:
> __list_lru_walk_one.isra.3+0x8f/0x130
> list_lru_walk_one+0x23/0x30
> scan_shadow_nodes+0x34/0x50
> shrink_slab.part.40+0x1ed/0x3d0
> shrink_zone+0x2ca/0x2e0
> kswapd+0x51e/0x990
> kthread+0xd8/0xf0
> ret_from_fork+0x3f/0x70
>
> which corresponds to the following sanity check in the shadow node
> tracking:
>
> BUG_ON(node->count & RADIX_TREE_COUNT_MASK);
>
> The workingset code tracks radix tree nodes that exclusively contain
> shadow entries of evicted pages in them, and this (somewhat obscure)
> line checks whether there are real pages left that would interfere with
> reclaim of the radix tree node under memory pressure.
>
> While discussing ways how fuse might sneak pages into the radix tree
> past the workingset code, Miklos pointed to replace_page_cache_page(),
> and indeed there is a problem there: it properly accounts for the old
> page being removed - __delete_from_page_cache() does that - but then
> does a raw raw radix_tree_insert(), not accounting for the replacement
> page. Eventually the page count bits in node->count underflow while
> leaving the node incorrectly linked to the shadow node LRU.
>
> To address this, make sure replace_page_cache_page() uses the tracked
> page insertion code, page_cache_tree_insert(). This fixes the page
> accounting and makes sure page-containing nodes are properly unlinked
> from the shadow node LRU again.
>
> Also, make the sanity checks a bit less obscure by using the helpers for
> checking the number of pages and shadows in a radix tree node.
>
> Fixes: 449dd6984d0e ("mm: keep page cache radix tree nodes in check")
> Link: http://lkml.kernel.org/r/20160919155822.29498-1-hannes@xxxxxxxxxxx
> Signed-off-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Reported-by: Antonio SJ Musumeci <trapexit@xxxxxxxxxx>
> Debugged-by: Miklos Szeredi <miklos@xxxxxxxxxx>
> Signed-off-by: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
> Signed-off-by: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> [bwh: Backported to 3.16:
> - Implementation of page_cache_tree_insert() is different
> - Adjust context]
> Signed-off-by: Ben Hutchings <ben@xxxxxxxxxxxxxxx>

The added sanity checks in this patch can crash kernels built with
CONFIG_DEBUG_VM. While I doubt many people run 3.16 with that enabled,
please also consider taking the following two changes. The first one
changes the new VM_BUG_ONs to mere warnings, and the second patch
addresses the underlying issue that triggered them in the first place.

21f54ddae449 Using BUG_ON() as an assert() is _never_ acceptable
d3798ae8c6f3 mm: filemap: don't plant shadow entries without radix tree node

Attaching the latter below, since it's drastically different than the
upstream change, but a lot simpler because it predates the DAX stuff.

Thanks

---