RE: [PATCH] x86/hyper-v: guard against cpu mask changes in hyperv_flush_tlb_others()

From: Michael Kelley
Date: Fri Aug 06 2021 - 06:43:33 EST


From: David Moses <mosesster@xxxxxxxxx> Sent: Friday, August 6, 2021 2:20 AM

> Hi Michael ,
> We are  running kernel 4.19.195 (The fix Wei Liu suggested of moving the
> cpumask_empty check after disabling interrupts is included in this version).
> with the default hyper-v version 
> I'm getting the 4 bytes garbage read (trace included) once almost every night
> We running on Azure vm Standard  D64s_v4 with 64 cores (Our system include
> three of such Vms) the application is very high io traffic involving iscsi 
> We believe this issue casus us to stack corruption on the rt scheduler as I forward
> in the previous mail.
>
> Let us know what is more needed to clarify the problem.
> Is it just Hyper-v related?   or could be a general kernel issue. 
>
> Thx David 
>
> even more that that while i add the below patch/fix 
>
> diff --git a/arch/x86/include/asm/mshyperv.h b/arch/x86/include/asm/mshyperv.h
> index 5b58a6c..165727a 100644
> --- a/arch/x86/include/asm/mshyperv.h
> +++ b/arch/x86/include/asm/mshyperv.h
> @@ -298,6 +298,9 @@ static inline struct hv_vp_assist_page *hv_get_vp_assist_page(unsigned int cpu)
 > */
> static inline int hv_cpu_number_to_vp_number(int cpu_number)
> {
> +       if (WARN_ON_ONCE(cpu_number < 0 || cpu_number >= num_possible_cpus()))
> +               return VP_INVAL;
> +
>         return hv_vp_index[cpu_number];
> }
>
> we have evidence that we reach this point 
>
> see below:
> Aug  5 21:03:01 c-node11 kernel: [17147.089261] WARNING: CPU: 15 PID: 8973 at arch/x86/include/asm/mshyperv.h:301 hyperv_flush_tlb_others+0x1f7/0x760
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] RIP: 0010:hyperv_flush_tlb_others+0x1f7/0x760
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] Code: ff ff be 40 00 00 00 48 89 df e8 c4 ff 3a 00
> 85 c0 48 89 c2 78 14 48 8b 3d be 52 32 01 f3 48 0f b8 c7 39 c2 0f 82 7e 01 00 00 <0f> 0b ba ff ff ff ff
> 89 d7 48 89 de e8 68 87 7d 00 3b 05 66 54 32
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] RSP: 0018:ffff8c536bcafa38 EFLAGS: 00010046
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] RAX: 0000000000000040 RBX: ffff8c339542ea00 RCX: ffffffffffffffff
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] RDX: 0000000000000040 RSI: ffffffffffffffff RDI: ffffffffffffffff
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] RBP: ffff8c339878b000 R08: ffffffffffffffff R09: ffffe93ecbcaa0e8
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] R10: 00000000020e0000 R11: 0000000000000000 R12: ffff8c536bcafa88
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] R13: ffffe93efe1ef980 R14: ffff8c339542e600 R15: 00007ffcbc390000
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] FS:  00007fcb8eae37a0(0000) GS:ffff8c339f7c0000(0000) knlGS:0000000000000000
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] CR2: 000000000135d1d8 CR3: 0000004037137005 CR4: 00000000003606e0
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Aug  5 21:03:01 c-node11 kernel: [17147.089275] Call Trace:
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  flush_tlb_mm_range+0xc3/0x120
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  ptep_clear_flush+0x3a/0x40
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  wp_page_copy+0x2e6/0x8f0
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  ? reuse_swap_page+0x13d/0x390
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  do_wp_page+0x99/0x4c0
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  __handle_mm_fault+0xb4e/0x12c0
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  ? memcg_kmem_get_cache+0x76/0x1a0
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  handle_mm_fault+0xd6/0x200
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  __get_user_pages+0x29e/0x780
> Aug  5 21:03:01 c-node11 kernel: [17147.089275]  get_user_pages_remote+0x12c/0x1b0

(FYI -- email to the Linux kernel mailing lists should be in plaintext format, and
not use HTML or other formatting.)

This is an excellent experiment. It certainly suggests that the cpumask that is
passed to hyperv_flush_tlb_others() has bits set for CPUs above 64 that don't exist.
If that's the case, it would seem to be a general kernel issue rather than something
specific to Hyper-V.

Since it looks like you can to add debugging code to the kernel, here are a couple
of thoughts:

1) In hyperv_flush_tlb_others() after the call to disable interrupts, check the value
of cpulast(cpus), and if it is greater than num_possible_cpus(), execute a printk()
statement that outputs the entire contents of the cpumask that is passed in. There's
a special printk format string for printing out bitmaps like cpumasks. Let me know
if you would like some help on this code -- I can provide a diff later today. Seeing
what the "bad" cpumask looks like might give some clues as to the problem.

2) As a different experiment, you can disable the Hyper-V specific flush routines
entirely. At the end of the mmu.c source file, have hyperv_setup_mmu_ops()
always return immediately. In this case, the generic Linux kernel flush routines
will be used instead of the Hyper-V ones. The code may be marginally slower,
but it will then be interesting to see if a problem shows up elsewhere.

But based on your experiment, I'm guessing that there's a general kernel issue
rather than something specific to Hyper-V.

Have you run 4.19 kernels previous to 4.19.195 that didn't have this problem? If
you have a kernel version that is good, the ultimate step would be to do
a bisect and find out where the problem was introduced in the 4.19-series. That
could take a while, but it would almost certainly identify the problematic
code change and would be beneficial to the Linux kernel community in
general.

Michael