Re: [Xen-devel] [PATCH v2] xen: Fix x86 sched_clock() interface for xen

From: Hans van Kranenburg
Date: Fri Jan 11 2019 - 10:57:31 EST


On 1/11/19 3:01 PM, Juergen Gross wrote:
> On 11/01/2019 14:12, Hans van Kranenburg wrote:
>> Hi,
>>
>> On 1/11/19 1:08 PM, Juergen Gross wrote:
>>> Commit f94c8d11699759 ("sched/clock, x86/tsc: Rework the x86 'unstable'
>>> sched_clock() interface") broke Xen guest time handling across
>>> migration:
>>>
>>> [ 187.249951] Freezing user space processes ... (elapsed 0.001 seconds) done.
>>> [ 187.251137] OOM killer disabled.
>>> [ 187.251137] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) done.
>>> [ 187.252299] suspending xenstore...
>>> [ 187.266987] xen:grant_table: Grant tables using version 1 layout
>>> [18446743811.706476] OOM killer enabled.
>>> [18446743811.706478] Restarting tasks ... done.
>>> [18446743811.720505] Setting capacity to 16777216
>>>
>>> Fix that by setting xen_sched_clock_offset at resume time to ensure a
>>> monotonic clock value.
>>>
>>> [...]
>>
>> I'm throwing around a PV domU over a bunch of test servers with live
>> migrate now, and in between the kernel logging, I'm seeing this:
>>
>> [Fri Jan 11 13:58:42 2019] Freezing user space processes ... (elapsed
>> 0.002 seconds) done.
>> [Fri Jan 11 13:58:42 2019] OOM killer disabled.
>> [Fri Jan 11 13:58:42 2019] Freezing remaining freezable tasks ...
>> (elapsed 0.000 seconds) done.
>> [Fri Jan 11 13:58:42 2019] suspending xenstore...
>> [Fri Jan 11 13:58:42 2019] ------------[ cut here ]------------
>> [Fri Jan 11 13:58:42 2019] Current state: 1
>> [Fri Jan 11 13:58:42 2019] WARNING: CPU: 3 PID: 0 at
>> kernel/time/clockevents.c:133 clockevents_switch_state+0x48/0xe0
>> [Fri Jan 11 13:58:42 2019] Modules linked in:
>> [Fri Jan 11 13:58:42 2019] CPU: 3 PID: 0 Comm: swapper/3 Not tainted
>> 4.19.14+ #1
>> [Fri Jan 11 13:58:42 2019] RIP: e030:clockevents_switch_state+0x48/0xe0
>> [Fri Jan 11 13:58:42 2019] Code: 8b 0c cd 40 ee 00 82 e9 d6 5b d1 00 80
>> 3d 8e 8d 43 01 00 75 17 89 c6 48 c7 c7 92 62 1f 82 c6 05 7c 8d 43 01 01
>> e8 f8 22 f8 ff <0f> 0b 5b 5d f3 c3 83 e2 01 74 f7 48 8b 47 48 48 85 c0
>> 74 69 48 89
>> [Fri Jan 11 13:58:42 2019] RSP: e02b:ffffc90000787e30 EFLAGS: 00010082
>> [Fri Jan 11 13:58:42 2019] RAX: 0000000000000000 RBX: ffff88805df94d80
>> RCX: 0000000000000006
>> [Fri Jan 11 13:58:42 2019] RDX: 0000000000000007 RSI: 0000000000000001
>> RDI: ffff88805df963f0
>> [Fri Jan 11 13:58:42 2019] RBP: 0000000000000004 R08: 0000000000000000
>> R09: 0000000000000119
>> [Fri Jan 11 13:58:42 2019] R10: 0000000000000020 R11: ffffffff82af4e2d
>> R12: ffff88805df9ca40
>> [Fri Jan 11 13:58:42 2019] R13: 0000000dd28d6ca6 R14: 0000000000000000
>> R15: 0000000000000000
>> [Fri Jan 11 13:58:42 2019] FS: 00007f34193ce040(0000)
>> GS:ffff88805df80000(0000) knlGS:0000000000000000
>> [Fri Jan 11 13:58:42 2019] CS: e033 DS: 002b ES: 002b CR0: 0000000080050033
>> [Fri Jan 11 13:58:42 2019] CR2: 00007f6220be50e1 CR3: 000000005ce5c000
>> CR4: 0000000000002660
>> [Fri Jan 11 13:58:42 2019] Call Trace:
>> [Fri Jan 11 13:58:42 2019] tick_program_event+0x4b/0x70
>> [Fri Jan 11 13:58:42 2019] hrtimer_try_to_cancel+0xa8/0x100
>> [Fri Jan 11 13:58:42 2019] hrtimer_cancel+0x10/0x20
>> [Fri Jan 11 13:58:42 2019] __tick_nohz_idle_restart_tick+0x45/0xd0
>> [Fri Jan 11 13:58:42 2019] tick_nohz_idle_exit+0x93/0xa0
>> [Fri Jan 11 13:58:42 2019] do_idle+0x149/0x260
>> [Fri Jan 11 13:58:42 2019] cpu_startup_entry+0x6a/0x70
>> [Fri Jan 11 13:58:42 2019] ---[ end trace 519c07d1032908f8 ]---
>> [Fri Jan 11 13:58:42 2019] xen:grant_table: Grant tables using version 1
>> layout
>> [Fri Jan 11 13:58:42 2019] OOM killer enabled.
>> [Fri Jan 11 13:58:42 2019] Restarting tasks ... done.
>> [Fri Jan 11 13:58:42 2019] Setting capacity to 6291456
>> [Fri Jan 11 13:58:42 2019] Setting capacity to 10485760
>>
>> This always happens on every *first* live migrate that I do after
>> starting the domU.
>
> Yeah, its a WARN_ONCE().
>
> And you didn't see it with v1 of the patch?

No.

> On the first glance this might be another bug just being exposed by
> my patch.
>
> I'm investigating further, but this might take some time. Could you
> meanwhile verify the same happens with kernel 5.0-rc1? That was the
> one I tested with and I didn't spot that WARN.

I have Linux 5.0-rc1 with v2 on top now, which gives me this on live
migrate:

[ 51.845967] xen:grant_table: Grant tables using version 1 layout
[ 51.871076] BUG: unable to handle kernel NULL pointer dereference at
0000000000000098
[ 51.871091] #PF error: [normal kernel read fault]
[ 51.871100] PGD 0 P4D 0
[ 51.871109] Oops: 0000 [#1] SMP NOPTI
[ 51.871117] CPU: 0 PID: 36 Comm: xenwatch Not tainted 5.0.0-rc1 #1
[ 51.871132] RIP: e030:blk_mq_map_swqueue+0x103/0x270
[ 51.871141] Code: 41 39 45 30 76 97 8b 0a 85 c9 74 ed 89 c1 48 c1 e1
04 49 03 8c 24 c0 05 00 00 48 8b 09 42 8b 3c 39 49 8b 4c 24 58 48 8b 0c
f9 <4c> 0f a3 b1 98 00 00 00 72 c5 f0 4c 0f ab b1 98 00 00 00 44 0f b7
[ 51.871161] RSP: e02b:ffffc900008afca8 EFLAGS: 00010282
[ 51.871173] RAX: 0000000000000000 RBX: ffffffff82541728 RCX:
0000000000000000
[ 51.871184] RDX: ffff88805d0fae70 RSI: ffff88805deaa940 RDI:
0000000000000001
[ 51.871196] RBP: ffff88805be8b720 R08: 0000000000000001 R09:
ffffea0001699900
[ 51.871206] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff88805be8b218
[ 51.871217] R13: ffff88805d0fae68 R14: 0000000000000001 R15:
0000000000000004
[ 51.871237] FS: 00007faa50fac040(0000) GS:ffff88805de00000(0000)
knlGS:0000000000000000
[ 51.871252] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 51.871261] CR2: 0000000000000098 CR3: 000000005c6e6000 CR4:
0000000000002660
[ 51.871275] Call Trace:
[ 51.871285] blk_mq_update_nr_hw_queues+0x2fd/0x380
[ 51.871297] blkfront_resume+0x200/0x3f0
[ 51.871307] xenbus_dev_resume+0x48/0xf0
[ 51.871317] ? xenbus_dev_probe+0x120/0x120
[ 51.871326] dpm_run_callback+0x3c/0x160
[ 51.871336] device_resume+0xce/0x1d0
[ 51.871344] dpm_resume+0x115/0x2f0
[ 51.871352] ? find_watch+0x40/0x40
[ 51.871360] dpm_resume_end+0x8/0x10
[ 51.871370] do_suspend+0xef/0x1b0
[ 51.871378] shutdown_handler+0x123/0x150
[ 51.871387] xenwatch_thread+0xbb/0x160
[ 51.871397] ? wait_woken+0x80/0x80
[ 51.871406] kthread+0xf3/0x130
[ 51.871416] ? kthread_create_worker_on_cpu+0x70/0x70
[ 51.871427] ret_from_fork+0x35/0x40
[ 51.871435] Modules linked in:
[ 51.871443] CR2: 0000000000000098
[ 51.871452] ---[ end trace 84a3a6932d70aa71 ]---
[ 51.871461] RIP: e030:blk_mq_map_swqueue+0x103/0x270
[ 51.871471] Code: 41 39 45 30 76 97 8b 0a 85 c9 74 ed 89 c1 48 c1 e1
04 49 03 8c 24 c0 05 00 00 48 8b 09 42 8b 3c 39 49 8b 4c 24 58 48 8b 0c
f9 <4c> 0f a3 b1 98 00 00 00 72 c5 f0 4c 0f ab b1 98 00 00 00 44 0f b7
[ 51.871491] RSP: e02b:ffffc900008afca8 EFLAGS: 00010282
[ 51.871501] RAX: 0000000000000000 RBX: ffffffff82541728 RCX:
0000000000000000
[ 51.871512] RDX: ffff88805d0fae70 RSI: ffff88805deaa940 RDI:
0000000000000001
[ 51.871523] RBP: ffff88805be8b720 R08: 0000000000000001 R09:
ffffea0001699900
[ 51.871533] R10: 0000000000000000 R11: 0000000000000001 R12:
ffff88805be8b218
[ 51.871545] R13: ffff88805d0fae68 R14: 0000000000000001 R15:
0000000000000004
[ 51.871562] FS: 00007faa50fac040(0000) GS:ffff88805de00000(0000)
knlGS:0000000000000000
[ 51.871573] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 51.871582] CR2: 0000000000000098 CR3: 000000005c6e6000 CR4:
0000000000002660

When starting it on another test dom0 to see if the direction of
movement matters, it mostly fails to boot with:

[Fri Jan 11 16:16:34 2019] BUG: unable to handle kernel paging request
at ffff88805c61e9f0
[Fri Jan 11 16:16:34 2019] #PF error: [PROT] [WRITE]
[Fri Jan 11 16:16:34 2019] PGD 2410067 P4D 2410067 PUD 2c00067 PMD
5ff26067 PTE 801000005c61e065
[Fri Jan 11 16:16:34 2019] Oops: 0003 [#1] SMP NOPTI
[Fri Jan 11 16:16:34 2019] CPU: 3 PID: 1943 Comm: apt-get Not tainted
5.0.0-rc1 #1
[Fri Jan 11 16:16:34 2019] RIP: e030:move_page_tables+0x669/0x970
[Fri Jan 11 16:16:34 2019] Code: 8a 00 48 8b 03 31 ff 48 89 44 24 18 e8
c6 ab e7 ff 66 90 48 89 c6 48 89 df e8 c3 cc e7 ff 66 90 48 8b 44 24 18
b9 0c 00 00 00 <48> 89 45 00 48 8b 44 24 08 f6 40 52 40 0f 85 69 02 00
00 48 8b 44
[Fri Jan 11 16:16:34 2019] RSP: e02b:ffffc900008c7d70 EFLAGS: 00010282
[Fri Jan 11 16:16:34 2019] RAX: 0000000cb064b067 RBX: ffff88805c61ea58
RCX: 000000000000000c
[Fri Jan 11 16:16:34 2019] RDX: 0000000000000000 RSI: 0000000000000000
RDI: 0000000000000201
[Fri Jan 11 16:16:34 2019] RBP: ffff88805c61e9f0 R08: 0000000000000000
R09: 00000000000260a0
[Fri Jan 11 16:16:34 2019] R10: 0000000000007ff0 R11: ffff88805fd23000
R12: ffffea00017187a8
[Fri Jan 11 16:16:34 2019] R13: ffffea00017187a8 R14: 00007f04e9800000
R15: 00007f04e9600000
[Fri Jan 11 16:16:34 2019] FS: 00007f04ef355100(0000)
GS:ffff88805df80000(0000) knlGS:0000000000000000
[Fri Jan 11 16:16:34 2019] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Jan 11 16:16:34 2019] CR2: ffff88805c61e9f0 CR3: 000000005c5fc000
CR4: 0000000000002660
[Fri Jan 11 16:16:34 2019] Call Trace:
[Fri Jan 11 16:16:34 2019] move_vma.isra.34+0xd1/0x2d0
[Fri Jan 11 16:16:34 2019] __x64_sys_mremap+0x1b3/0x370
[Fri Jan 11 16:16:34 2019] do_syscall_64+0x49/0x100
[Fri Jan 11 16:16:34 2019] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri Jan 11 16:16:34 2019] RIP: 0033:0x7f04ee2e227a
[Fri Jan 11 16:16:34 2019] Code: 73 01 c3 48 8b 0d 1e fc 2a 00 f7 d8 64
89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 49 89 ca b8 19
00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ee fb 2a 00 f7 d8
64 89 01 48
[Fri Jan 11 16:16:34 2019] RSP: 002b:00007fffb3da3e38 EFLAGS: 00000246
ORIG_RAX: 0000000000000019
[Fri Jan 11 16:16:34 2019] RAX: ffffffffffffffda RBX: 000056533fa1bf50
RCX: 00007f04ee2e227a
[Fri Jan 11 16:16:34 2019] RDX: 0000000001a00000 RSI: 0000000001900000
RDI: 00007f04e95ac000
[Fri Jan 11 16:16:34 2019] RBP: 0000000001a00000 R08: 2e8ba2e8ba2e8ba3
R09: 0000000000000040
[Fri Jan 11 16:16:34 2019] R10: 0000000000000001 R11: 0000000000000246
R12: 00007f04e95ac060
[Fri Jan 11 16:16:34 2019] R13: 00007f04e95ac000 R14: 000056533fa45d73
R15: 000056534024bd10
[Fri Jan 11 16:16:34 2019] Modules linked in:
[Fri Jan 11 16:16:34 2019] CR2: ffff88805c61e9f0
[Fri Jan 11 16:16:34 2019] ---[ end trace 443702bd9ba5d6b2 ]---
[Fri Jan 11 16:16:34 2019] RIP: e030:move_page_tables+0x669/0x970
[Fri Jan 11 16:16:34 2019] Code: 8a 00 48 8b 03 31 ff 48 89 44 24 18 e8
c6 ab e7 ff 66 90 48 89 c6 48 89 df e8 c3 cc e7 ff 66 90 48 8b 44 24 18
b9 0c 00 00 00 <48> 89 45 00 48 8b 44 24 08 f6 40 52 40 0f 85 69 02 00
00 48 8b 44
[Fri Jan 11 16:16:34 2019] RSP: e02b:ffffc900008c7d70 EFLAGS: 00010282
[Fri Jan 11 16:16:34 2019] RAX: 0000000cb064b067 RBX: ffff88805c61ea58
RCX: 000000000000000c
[Fri Jan 11 16:16:34 2019] RDX: 0000000000000000 RSI: 0000000000000000
RDI: 0000000000000201
[Fri Jan 11 16:16:34 2019] RBP: ffff88805c61e9f0 R08: 0000000000000000
R09: 00000000000260a0
[Fri Jan 11 16:16:34 2019] R10: 0000000000007ff0 R11: ffff88805fd23000
R12: ffffea00017187a8
[Fri Jan 11 16:16:34 2019] R13: ffffea00017187a8 R14: 00007f04e9800000
R15: 00007f04e9600000
[Fri Jan 11 16:16:34 2019] FS: 00007f04ef355100(0000)
GS:ffff88805df80000(0000) knlGS:0000000000000000
[Fri Jan 11 16:16:34 2019] CS: e030 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Jan 11 16:16:34 2019] CR2: ffff88805c61e9f0 CR3: 000000005c5fc000
CR4: 0000000000002660

I can log in over ssh, but a command like ps afxu hangs. Oh, it seems
that 5.0-rc1 is doing this all the time. Next time it's after 500
seconds uptime.

xl destroy and trying again, it boots. 1st live migrate successful (and
no clockevents_switch_state complaints), second one explodes with
blk_mq_update_nr_hw_queues again.

Hmok, as long as I live migrate the 5.0-rc1 domU around between dom0s
with Xen 4.11.1-pre from commit 5acdd26fdc (the one we had in debian
until yesterday) and Linux 4.19.9 in the dom0, it works. As soon as I
live migrate to the one box running the new Xen 4.11.1 package from
Debian unstable, and Linux 4.19.12, then I get the
blk_mq_update_nr_hw_queues.

If I do the same with 4.19 in the domU, I don't get
blk_mq_update_nr_hw_queues.

Now, back to 4.19.14 + guard_hole + v2, I can't seem to reproduce the
clockevents_switch_state any more. I'll take a break and then see to
find out if I'm doing anything different than earlier today when I could
reproduce it 100% consistently.

O_o :)

Hans