Re: GPF in __d_lookup_rcu after hibernate
From: Johan Hovold
Date: Sun Mar 20 2016 - 09:28:40 EST
On Sat, Mar 19, 2016 at 08:17:59PM +0000, Al Viro wrote:
> On Sat, Mar 19, 2016 at 07:24:30PM +0000, Al Viro wrote:
> > Hard to tell without your .config, but at a guess that's
> > while (kn->parent && base != kn)
> > kn = kn->parent;
> > in kernfs_get_target_path() running into kn equal to 0x008f0000008e0000,
> > which is not a valid pointer.
> >
> > Note that all of those are of the same pattern:
> > 00 00 N 00 00 00 N+1 00
> > where a pointer should've been. In these traces we'd seen N equal to 0xa,
> > 0x9a and 0x8e. Hell knows what it is, but the patterns are too similar to
> > be a coincidence; it's the same kind of memory corruption. Have it hit
> > a dentry and you've got yourself a persistent oops in dcache hash chain
> > traversals.
> >
> > FWIW, it might be a single table of that form, with the previous pointer
> > in the chain corrupted so it points into it. Hell knows... AFAICS,
> > by that point the previous addresses are already lost, both in __d_lookup_rcu()
> > and kernfs_get_target_path() cases.
>
> As the matter of fact, it looks like similar values pop up in traces posted
> at least a couple of years ago - http://pastebin.com/Nhewn8xP, for example,
> is full of such stuff, also on resume from suspend-on-disk. With 3.13
> kernel, including the things like pte equal to 0x0095000000940000, etc.
>
> So it smells like a repeated pattern of memory corruption on resume from
> disk, going back at least that far. What gets corrupted varies, so I suspect
> that dcache is simply something that contains lists long enough and traversed
> frequently enough to be likely to catch that. Page tables are another
> place where it's likely to show up...
Ouch. Thanks for looking into this.
I followed your advice and dumped the swap partition. The pattern is
definitely there; at least 250+ times in my 8GB partition, possibly left
overs from earlier suspends.
The pattern itself appears to be repeat from at least 0x4400 to 0x7ff00
with a similar series before it (with some values overwritten):
75024120 2a00290028002700 2d002c0182002b00
75024140 310030002f002e00 3500340033003200
75024160 3900380037003600 3d003c003b003a00
75024200 410040003f003e00 4500440043004200
75024220 4900480047004600 4d004c004b004a00
...
75025200 0000003200000031 0000003400000033
75025220 0000003600000035 0000003800000037
75025240 0000003a00000039 0000003c0000003b
75025260 0000003e0000003d a02400400000003f
75025300 0000000028042277 0000000000000000
75025320 0000000000000000 0000000000000000
75025340 0000432600000000 0000450000004400
75025360 0000470000004600 0000490000004800
75025400 00004b0000004a00 00004d0000004c00
...
75044620 0007ef000007ee00 0007f1000007f000
75044640 0007f3000007f200 0007f5000007f400
75044660 0007f7000007f600 0007f9000007f800
75044700 0007fb000007fa00 0007fd000007fc00
75044720 0007ff000007fe00 00000007140e2000
It appeared harder to trigger the GPF with your debugging code added but
that was probably just coincidence:
[ 2660.062886] buggered at ffff8800da802780 00500000 00510000 00520000 00530000 00540000 00550000 00560000 00570000 00580000 00590000 005a0000 005b0000 005c0000 005d0000 005e0000 005f0000 00600000 00610000 00620000 00630000 00640000 00650000 00660000 00670000 00680000 00690000 006a0000 006b0000 006c0000 006d0000 006e0000 006f0000 00700000 00710000 00720000 00730000 00740000 00750000 00760000 00770000 00780000 00790000 007a0000 007b0000 007c0000 007d0000 007e0000 007f0000
[ 2660.062912] general protection fault: 0000 [#1] PREEMPT SMP
[ 2660.063016] Modules linked in: intel_rapl iosf_mbi uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_core
[ 2660.063218] CPU: 1 PID: 8036 Comm: rsync Not tainted 4.4.6 #131
[ 2660.063313] Hardware name: SAMSUNG ELECTRONICS CO., LTD. 940X3G/NP940X3G-K03SE, BIOS P02ACJ.101.130926.dg 09/26/2013
[ 2660.063480] task: ffff8800d53a8000 ti: ffff8800d46ec000 task.ti: ffff8800d46ec000
[ 2660.063599] RIP: 0010:[<ffffffff811539b2>] [<ffffffff811539b2>] __d_lookup_rcu+0x72/0x210
[ 2660.063739] RSP: 0018:ffff8800d46efc80 EFLAGS: 00010206
[ 2660.063823] RAX: 0000000000510000 RBX: 0053000000520000 RCX: 0000000000000000
[ 2660.063942] RDX: ffff88021fa8e040 RSI: ffff88021fa8c9b8 RDI: ffff88021fa8c9b8
[ 2660.064010] RBP: ffff8800bee21600 R08: ffffffff81dcf4a8 R09: ffff88002a035038
[ 2660.064078] R10: 0000000000000019 R11: ffffffffffffffff R12: ffff8800da802780
[ 2660.064146] R13: 00000019d2947805 R14: ffff8800d46efdb0 R15: ffff8800d46efcec
[ 2660.064214] FS: 00007fb0608ac700(0000) GS:ffff88021fa80000(0000) knlGS:0000000000000000
[ 2660.064291] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2660.064345] CR2: 0000000001fcf000 CR3: 000000013045d000 CR4: 00000000001406e0
[ 2660.064413] Stack:
[ 2660.064433] ffff88002a035038 ffff880000000019 ffffffffffffffff ffffffffffffffff
[ 2660.064509] ffff8800d46efda0 0000000000000000 ffff8800d46efd38 ffff8800bee21600
[ 2660.064585] ffff8800d46efd30 ffff880215df4f20 ffffffff8114973d ffff8800d46efd2c
[ 2660.064662] Call Trace:
[ 2660.064708] [<ffffffff8114973d>] ? lookup_fast+0x3d/0x2d0
[ 2660.064767] [<ffffffff81149a71>] ? walk_component+0x31/0x2b0
[ 2660.064829] [<ffffffff811483db>] ? path_init+0x17b/0x3c0
[ 2660.064887] [<ffffffff8114a2db>] ? path_lookupat+0x5b/0x110
[ 2660.064947] [<ffffffff8114bce3>] ? filename_lookup+0x93/0x110
[ 2660.065009] [<ffffffff8114b9e4>] ? getname_flags+0x44/0x180
[ 2660.065071] [<ffffffff81142fd4>] ? vfs_fstatat+0x44/0x90
[ 2660.065130] [<ffffffff81143560>] ? SyS_newlstat+0x10/0x30
[ 2660.065189] [<ffffffff81001039>] ? syscall_trace_enter_phase1+0xb9/0x110
[ 2660.065262] [<ffffffff810a4ae3>] ? vtime_user_enter+0x23/0x40
[ 2660.065325] [<ffffffff81102345>] ? __context_tracking_enter+0x45/0x90
[ 2660.065396] [<ffffffff817a0c17>] ? entry_SYSCALL_64_fastpath+0x12/0x6a
[ 2660.065466] Code: 8b 18 48 83 e3 fe 0f 84 bc 00 00 00 4c 89 e8 49 c7 c3 ff ff ff ff 48 c1 e8 20 49 89 c2 eb 0c 48 8b 1b 48 85 db 0f 84 9d 00 00 00 <4c> 8b 03 4c 8d 63 f8 4c 89 c2 41 8d 80 00 00 01 00 48 c1 ea 20
[ 2660.065768] RIP [<ffffffff811539b2>] __d_lookup_rcu+0x72/0x210
[ 2660.065827] RSP <ffff8800d46efc80>
[ 2660.084547] ---[ end trace 2546bba214fdadef ]---
Thanks,
Johan