Re: [PATCH v6 00/12] SVM cleanup and INVPCID feature support
From: Hugh Dickins
Date: Wed Mar 24 2021 - 20:07:17 EST
On Wed, 24 Mar 2021, Borislav Petkov wrote:
> Ok,
>
> some more experimenting Babu and I did lead us to:
>
> ---
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index f5ca15622dc9..259aa4889cad 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -250,6 +250,9 @@ static inline void __native_flush_tlb_single(unsigned long addr)
> */
> if (kaiser_enabled)
> invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> + else
> + asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> +
> invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> }
>
> applied on the guest kernel which fixes the issue. And let me add Hugh
> who did that PCID stuff at the time. So lemme summarize for Hugh and to
> ask him nicely to sanity-check me. :-)
Just a brief interim note to assure you that I'm paying attention,
but wow, it's a long time since I gave any thought down here!
Trying to page it all back in...
I see no harm in your workaround if it works, but it's not as if
this is a previously untried path: so I'm suspicious how an issue
here with Globals could have gone unnoticed for so long, and need
to understand it better.
Hugh
>
> Basically, you have an AMD host which supports PCID and INVPCID and you
> boot on it a 4.9 guest. It explodes like the panic below.
>
> What fixes it is this:
>
> diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
> index f5ca15622dc9..259aa4889cad 100644
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -250,6 +250,9 @@ static inline void __native_flush_tlb_single(unsigned long addr)
> */
> if (kaiser_enabled)
> invpcid_flush_one(X86_CR3_PCID_ASID_USER, addr);
> + else
> + asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
> +
> invpcid_flush_one(X86_CR3_PCID_ASID_KERN, addr);
> }
>
> ---
>
> and the reason why it does, IMHO, is because on AMD, kaiser_enabled is
> false because AMD is not affected by Meltdown, which means, there's no
> user/kernel pagetables split.
>
> And that also means, you have global TLB entries which means that if you
> look at that __native_flush_tlb_single() function, it needs to flush
> global TLB entries on CPUs with X86_FEATURE_INVPCID_SINGLE by doing an
> INVLPG in the kaiser_enabled=0 case. Errgo, the above hunk.
>
> But I might be completely off here thus this note...
>
> Thoughts?
>
> Thx.
>
>
> [ 1.235726] ------------[ cut here ]------------
> [ 1.237515] kernel BUG at /build/linux-dqnRSc/linux-4.9.228/arch/x86/kernel/alternative.c:709!
> [ 1.240926] invalid opcode: 0000 [#1] SMP
> [ 1.243301] Modules linked in:
> [ 1.244585] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.9.0-13-amd64 #1 Debian 4.9.228-1
> [ 1.247657] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
> [ 1.251249] task: ffff909363e94040 task.stack: ffffa41bc0194000
> [ 1.253519] RIP: 0010:[<ffffffff8fa2e40c>] [<ffffffff8fa2e40c>] text_poke+0x18c/0x240
> [ 1.256593] RSP: 0018:ffffa41bc0197d90 EFLAGS: 00010096
> [ 1.258657] RAX: 000000000000000f RBX: 0000000001020800 RCX: 00000000feda3203
> [ 1.261388] RDX: 00000000178bfbff RSI: 0000000000000000 RDI: ffffffffff57a000
> [ 1.264168] RBP: ffffffff8fbd3eca R08: 0000000000000000 R09: 0000000000000003
> [ 1.266983] R10: 0000000000000003 R11: 0000000000000112 R12: 0000000000000001
> [ 1.269702] R13: ffffa41bc0197dcf R14: 0000000000000286 R15: ffffed1c40407500
> [ 1.272572] FS: 0000000000000000(0000) GS:ffff909366300000(0000) knlGS:0000000000000000
> [ 1.275791] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1.278032] CR2: 0000000000000000 CR3: 0000000010c08000 CR4: 00000000003606f0
> [ 1.280815] Stack:
> [ 1.281630] ffffffff8fbd3eca 0000000000000005 ffffa41bc0197e03 ffffffff8fbd3ecb
> [ 1.284660] 0000000000000000 0000000000000000 ffffffff8fa2e835 ccffffff8fad4326
> [ 1.287729] 1ccd0231874d55d3 ffffffff8fbd3eca ffffa41bc0197e03 ffffffff90203844
> [ 1.290852] Call Trace:
> [ 1.291782] [<ffffffff8fbd3eca>] ? swap_entry_free+0x12a/0x300
> [ 1.294900] [<ffffffff8fbd3ecb>] ? swap_entry_free+0x12b/0x300
> [ 1.297267] [<ffffffff8fa2e835>] ? text_poke_bp+0x55/0xe0
> [ 1.299473] [<ffffffff8fbd3eca>] ? swap_entry_free+0x12a/0x300
> [ 1.301896] [<ffffffff8fa2b64c>] ? arch_jump_label_transform+0x9c/0x120
> [ 1.304557] [<ffffffff9073e81f>] ? set_debug_rodata+0xc/0xc
> [ 1.306790] [<ffffffff8fb81d92>] ? __jump_label_update+0x72/0x80
> [ 1.309255] [<ffffffff8fb8206f>] ? static_key_slow_inc+0x8f/0xa0
> [ 1.311680] [<ffffffff8fbd7a57>] ? frontswap_register_ops+0x107/0x1d0
> [ 1.314281] [<ffffffff9077078c>] ? init_zswap+0x282/0x3f6
> [ 1.316547] [<ffffffff9077050a>] ? init_frontswap+0x8c/0x8c
> [ 1.318784] [<ffffffff8fa0223e>] ? do_one_initcall+0x4e/0x180
> [ 1.321067] [<ffffffff9073e81f>] ? set_debug_rodata+0xc/0xc
> [ 1.323366] [<ffffffff9073f08d>] ? kernel_init_freeable+0x16b/0x1ec
> [ 1.325873] [<ffffffff90011d50>] ? rest_init+0x80/0x80
> [ 1.327989] [<ffffffff90011d5a>] ? kernel_init+0xa/0x100
> [ 1.330092] [<ffffffff9001f424>] ? ret_from_fork+0x44/0x70
> [ 1.332311] Code: 00 0f a2 4d 85 e4 74 4a 0f b6 45 00 41 38 45 00 75 19 31 c0 83 c0 01 48 63 d0 49 39 d4 76 33 41 0f b6 4c 15 00 38 4c 15 00 74 e9 <0f> 0b 48 89 ef e8 da d6 19 00 48 8d bd 00 10 00 00 48 89 c3 e8
> [ 1.342818] RIP [<ffffffff8fa2e40c>] text_poke+0x18c/0x240
> [ 1.345859] RSP <ffffa41bc0197d90>
> [ 1.347285] ---[ end trace 0a1c5ab5eb16de89 ]---
> [ 1.349169] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [ 1.349169]
> [ 1.352885] Kernel Offset: 0xea00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [ 1.357039] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000000b
> [ 1.357039]
>
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
>