xarray soft lockups on 5.2-rc's

From: Dave Hansen
Date: Wed Jun 26 2019 - 14:01:32 EST


On 6/3/19 3:59 PM, Matthew Wilcox wrote:
>> I'm seeing a similar softlockup:
> I've taken that commit out of my xarray tree. It's only a performance
> optimisation, so I don't mind dropping it until it's debugged.

I'm also seeing these on mainline pretty regularly (~once a week). The
latest was on -rc4, but the xarray code doesn't look like it's been
updated in mainline since 5.1.

I only seem to see these for the swap cache path, never the page cache
path, which makes me suspicious (typical dump is below). The stack is
always the same path, but the exact instruction pointer address seems to
bounce around between find_get_entry() and deeper into the xarray code.

find_get_entry() seems to have some recent changes, and this patch of
backtraces would be consistent with it getting stuck in a "goto repeat"
loop in there.

12 RIP: 0010:xas_load+0x35/0x80
9 RIP: 0010:xas_load+0x27/0x80
7 RIP: 0010:xas_load+0x13/0x80
4 RIP: 0010:xas_load+0x66/0x80
4 RIP: 0010:xas_load+0x45/0x80
2 RIP: 0010:xas_start+0x0/0x90
2 RIP: 0010:xas_load+0x3f/0x80
2 RIP: 0010:xas_load+0x31/0x80
2 RIP: 0010:xas_load+0x20/0x80
2 RIP: 0010:xas_load+0x0/0x80
2 RIP: 0010:find_get_entry+0x87/0x1a0
2 RIP: 0010:find_get_entry+0x74/0x1a0
1 RIP: 0010:xas_start+0x57/0x90
1 RIP: 0010:xas_start+0x42/0x90
1 RIP: 0010:xas_load+0xb/0x80
1 RIP: 0010:xas_load+0x72/0x80
1 RIP: 0010:xas_load+0x5e/0x80
1 RIP: 0010:xas_load+0x43/0x80
1 RIP: 0010:xas_load+0x39/0x80
1 RIP: 0010:xas_load+0x24/0x80
1 RIP: 0010:find_get_entry+0x63/0x1a0
1 RIP: 0010:dev_watchdog+0x273/0x280


> CPU: 2 PID: 1025 Comm: TaskSchedulerFo Not tainted 5.2.0-rc4 #15
> Hardware name: LENOVO 20F5S7V800/20F5S7V800, BIOS R02ET70W (1.43 ) 01/28/2019
> RIP: 0010:xas_start+0x0/0x90
> Code: 48 85 d2 75 e3 48 8b 17 b8 00 00 80 00 89 f1 d3 e0 8b 7a 18 85 c7 75 07 09 f8 89 42 18 f3 c3 f3 c3 f3 c3 0f 1f 80 00 00 00 00 <48> 8b 47 18 48 89 c2 83 e2 03 74 19 48 83 fa 02 75 26 48 3d 05 c0
> RSP: 0000:ffffc9000996fcf0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
> RAX: 0000000000000000 RBX: ffffc9000996fd08 RCX: 0000000000000000
> RDX: 0000000000000021 RSI: ffff888168184490 RDI: ffffc9000996fd08
> RBP: 000000000027db21 R08: 000fffffffe00000 R09: ffff8884216fa5c0
> R10: ffff8884216fa000 R11: 0000000000000822 R12: 0000000000000000
> R13: ffff888409fc9018 R14: 000000000027db21 R15: 0000000000000001
> FS: 00007f4f44d01700(0000) GS:ffff888411900000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f4f4cb21048 CR3: 000000029751a003 CR4: 00000000003626e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Call Trace:
> xas_load+0x9/0x80
> find_get_entry+0x74/0x1a0
> pagecache_get_page+0x27/0x250
> lookup_swap_cache+0x40/0x130
> do_swap_page+0x84/0x860
> __handle_mm_fault+0x865/0xfb0
> handle_mm_fault+0x103/0x210
> __do_page_fault+0x2f5/0x530
> ? page_fault+0x8/0x30
> page_fault+0x1e/0x30
> RIP: 0033:0x7f4ff44cb836
> Code: 39 f7 0f 84 48 fd ff ff 44 89 ca 83 4b 04 04 e9 f3 fe ff ff 48 8b 53 68 49 8d 04 2f 4c 39 62 18 75 30 48 81 3c 24 ff 03 00 00 <4c> 89 60 18 48 89 50 10 48 89 43 68 48 89 42 18 0f 86 55 fe ff ff
> RSP: 002b:00007f4f44cffa50 EFLAGS: 00010212
> RAX: 00007f4f4cb21030 RBX: 00007f4f4c000020 RCX: 00007f4f4c75c5f0
> RDX: 00007f4f4c000078 RSI: 00000000000030f1 RDI: 00000000007117f0
> RBP: 00000000001eda40 R08: 00007f4f4c350370 R09: 0000000004422604
> R10: 0000000000184cb5 R11: 0000000000000000 R12: 00007f4f4c000078
> R13: 00007f4f4c75c5f0 R14: 00000000008ff230 R15: 00007f4f4c9335f0
> watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [TaskSchedulerFo:1025]