Re: GPF in __d_lookup_rcu after hibernate

From: Al Viro
Date: Sat Mar 19 2016 - 16:18:21 EST


On Sat, Mar 19, 2016 at 07:24:30PM +0000, Al Viro wrote:
> Hard to tell without your .config, but at a guess that's
> while (kn->parent && base != kn)
> kn = kn->parent;
> in kernfs_get_target_path() running into kn equal to 0x008f0000008e0000,
> which is not a valid pointer.
>
> Note that all of those are of the same pattern:
> 00 00 N 00 00 00 N+1 00
> where a pointer should've been. In these traces we'd seen N equal to 0xa,
> 0x9a and 0x8e. Hell knows what it is, but the patterns are too similar to
> be a coincidence; it's the same kind of memory corruption. Have it hit
> a dentry and you've got yourself a persistent oops in dcache hash chain
> traversals.
>
> FWIW, it might be a single table of that form, with the previous pointer
> in the chain corrupted so it points into it. Hell knows... AFAICS,
> by that point the previous addresses are already lost, both in __d_lookup_rcu()
> and kernfs_get_target_path() cases.

As the matter of fact, it looks like similar values pop up in traces posted
at least a couple of years ago - http://pastebin.com/Nhewn8xP, for example,
is full of such stuff, also on resume from suspend-on-disk. With 3.13
kernel, including the things like pte equal to 0x0095000000940000, etc.

So it smells like a repeated pattern of memory corruption on resume from
disk, going back at least that far. What gets corrupted varies, so I suspect
that dcache is simply something that contains lists long enough and traversed
frequently enough to be likely to catch that. Page tables are another
place where it's likely to show up...