Re: BUG_ON() in workingset_node_shadows_dec() triggers
From: Johannes Weiner
Date: Wed Oct 05 2016 - 05:26:09 EST
On Tue, Oct 04, 2016 at 06:21:37PM -0700, Linus Torvalds wrote:
> On Tue, Oct 4, 2016 at 2:32 AM, Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
> > In the workingset code, if we detect radix tree nodes in a state in
> > which they shouldn't be on the shadow node LRU, we could simply warn,
> > abort the reclaim process and leave them off the LRU. Something like
> > the below patch.
> I don't hate that patch, but I wonder why the counts get corrupted and
> the workingset_node_shadows_dec() thing triggered in the first place.
Okay, I kept digging, and I think I found the culprit: yes, the actual
bug is ancient (3.15), but also requires a fairly specific combination
of access pattern and memory pressure to trigger, which is why testing
didn't reveal it right away. The symptom before the sanity check was
an occasionally leaked radix tree node, which is mostly harmless and
hard to notice [ really drives home the BUG_ON point, doesn't it... ]
The problem is in the way we account shadow entries inside the upper
bits of node->count behind the back of the radix tree code. We don't
use the insert and delete functions directly from the page cache and
then do our own node->count updates. And other than that the radix
tree code mostly accounts only its own childpointers in node->count
and doesn't make assumptions about how user entries are accounted.
The notable exception is during tree extension: when there is a single
entry at index 0, it's stored as a direct pointer from root->rnode. A
fault at a higher index will copy that direct entry to a proper node
and do node->count = 1. That's wrong if the direct entry is a shadow.
So the scenario required to trigger this issue is this: you need a
file with only the first page faulted in and then evicted before
touching another page in the file. When that first page faults back,
that's when the counter underflows.
Here is a reproducer that triggers the warning instantly for me:
* When index 0 is present, then evicted, rnode is a direct pointer to
* a shadow entry. Extending the tree from there does the wrong thing
* by setting node->count = 1; that's the page counter bits, it should
* be setting the upper shadow bits. When index 0 faults back and we
* try to remove the shadow entry, the shadow counter is 0.
/* Set to slightly bigger than memory */
#define PRESSURE_SIZE (12UL << 30)
unsigned long off;
int fd1, fd2;
fd1 = open("test-twopages", O_CREAT|O_RDWR);
/* fault page at index 0 */
pread(fd1, buf, 4096, 0);
/* create pressure to plant shadow at fd1's index 0 (knock on wood) */
fd2 = open("test-pressure", O_CREAT|O_RDWR);
for (off = 0; off < PRESSURE_SIZE; off += 4096)
read(fd2, buf, 4096);
/* fault page at index 1, extending tree sets wrong node->count */
pread(fd1, buf, 4096, 4096);
/* refault index 0, deletes shadow with no shadows in node->count */
pread(fd1, buf, 4096, 0);
That radix tree node management needs some cleaning up. It probably
makes sense to split node->count into actually separate members for
clarity, and then add a root tag to distinguish shadows from regular
entries in root->rnode. I have to think about this more, the current
situation is too fragile and ugly.
But in the meantime, there is an obvious fix: don't ever store shadow
entries in root->rnode, seeing as we need nodes for proper accounting.
It means we temporarily lose the ability to detect refaults from
single-page files, but it's probably better to keep the stable fix
small and restore that functionality in a new release.
Patch below. NOTE: I'm traveling without access to my test rig right
now and so I have only lightly tested this on my laptop. I'm also
jetlagged like crazy, so please triple check my thinking. The patch
does fix the reproducer case and has otherwise been stable here.