Re: [PATCH] x86/bhi: avoid hardware mitigation for 'spectre_bhi=vmexit'

From: Jon Kohler
Date: Fri Sep 13 2024 - 14:01:50 EST




> On Sep 13, 2024, at 1:33 PM, Pawan Gupta <pawan.kumar.gupta@xxxxxxxxxxxxxxx> wrote:
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Fri, Sep 13, 2024 at 03:51:01PM +0000, Jon Kohler wrote:
>>
>>
>>> On Sep 13, 2024, at 1:28 AM, Chao Gao <chao.gao@xxxxxxxxx> wrote:
>>>
>>> !-------------------------------------------------------------------|
>>> CAUTION: External Email
>>>
>>> |-------------------------------------------------------------------!
>>>
>>> On Thu, Sep 12, 2024 at 09:24:40AM -0700, Pawan Gupta wrote:
>>>> On Thu, Sep 12, 2024 at 03:44:38PM +0000, Jon Kohler wrote:
>>>>>> It is only worth implementing the long sequence in VMEXIT_ONLY mode if it is
>>>>>> significantly better than toggling the MSR.
>>>>>
>>>>> Thanks for the pointer! I hadn’t seen that second sequence. I’ll do measurements on
>>>>> three cases and come back with data from an SPR system.
>>>>> 1. as-is (wrmsr on entry and exit)
>>>>> 2. Short sequence (as a baseline)
>>>>> 3. Long sequence
>>>>
>>
>> Pawan,
>>
>> Thanks for the pointer to the long sequence. I've tested it along with
>> Listing 3 (TSX Abort sequence) using KUT tscdeadline_immed test. TSX
>> abort sequence performs better unless BHI mitigation is off or
>> host/guest spec_ctrl values match, avoiding WRMSR toggling. Having the
>> values match the DIS_S value is easier said than done across a fleet
>> that is already using eIBRS heavily.
>>
>> Test System:
>> - Intel Xeon Gold 6442Y, microcode 0x2b0005c0
>> - Linux 6.6.34 + patches, qemu 8.2
>> - KVM Unit Tests @ latest (17f6f2fd) with tscdeadline_immed + edits:
>> - Toggle spec ctrl before test in main()
>> - Use cpu type SapphireRapids-v2
>>
>> Test string:
>> TESTNAME=vmexit_tscdeadline_immed TIMEOUT=90s MACHINE= ACCEL= taskset -c 26 ./x86/run x86/vmexit.flat \
>> -smp 1 -cpu SapphireRapids-v2,+x2apic,+tsc-deadline -append tscdeadline_immed |grep tscdeadline
>>
>> Test Results:
>> 1. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1: tscdeadline_immed 3878 (WRMSR toggling)
>> 2. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1025: tscdeadline_immed 3153 (no WRMSR toggling)
>> 3. spectre_bhi=vmexit, BHB long sequence, host/guest spec_ctrl=1: tscdeadline_immed 3629 (still better than test 1, penalty only on exit)
>> 4. spectre_bhi=vmexit, TSX abort sequence, host/guest spec_ctrl=1: tscdeadline_immed 3294 (best general purpose performance)
>
> This looks promising.

Thanks! I’ll send out a v2 so you can see how it comes together.

>
>> 5. spectre_bhi=vmexit, TSX abort sequence, host spec_ctrl=1, guest spec_ctrl=1025: tscdeadline_immed 4011 (needs optimization)
>
> Once QEMU adds support for exposing BHI_CTRL, this is a very likely
> scenario. To optimize this, host needs to have BHI_DIS_S set. We also need
> to account for the case where some guests set BHI_DIS_S and others dont.

QEMU base enablement is only one part of the puzzle. That would mean
a cpu type change (e.g. SapphireRapids-Vxxx), which VMM control planes
need to pickup (e.g. libvirt), in addition to guest OS’s needing to pick it up too.

Even then, it isn’t always automatic. Windows for example disables their
BHI mitigation by default, requiring admin intervention to manually modify the
registry to enable it:
https://msrc.microsoft.com/update-guide/vulnerability/CVE-2022-0001

I don’t know offhand if that is BHI_DIS_S or just a clear loop, it doesn’t say

Server SKUs are disabled by default: https://support.microsoft.com/en-us/topic/kb4072698-windows-server-and-azure-stack-hci-guidance-to-protect-against-silicon-based-microarchitectural-and-speculative-execution-side-channel-vulnerabilities-2f965763-00e2-8f98-b632-0d96f30c8c8e
Desktop/Client SKUs are disabled by default: https://support.microsoft.com/en-us/topic/kb4073119-windows-client-guidance-for-it-pros-to-protect-against-silicon-based-microarchitectural-and-speculative-execution-side-channel-vulnerabilities-35820a8a-ae13-1299-88cc-357f104f5b11

>
>> In short, there is a significant speedup to be had here.
>>
>> As for test 5, honest that is somewhat invalid because it would be
>> dependent on the VMM user space showing BHI_CTRL.
>
> Right.
>
>> QEMU as an example does not do that, so even with latest qemu and latest
>> kernel, guests will still use BHB loop even on SPR++ today, and they
>> could use the TSX loop with this proposed change if the VMM exposes RTM
>> feature.
>
> I did not know that QEMU does not expose CPUID.BHI_CTRL. Chao, could you
> please help getting this feature exposed in QEMU?
>
>> I'm happy to post a V2 patch with my TSX changes, or take any other
>> suggestions here.
>
> With CPUID.BHI_CTRL exposed to guests, this:
>
>> 2. spectre_bhi=on, host spec_ctrl=1025, guest spec_ctrl=1025: tscdeadline_immed 3153 (no WRMSR toggling)
>
> will be the most common case, which is also the best performing. Isn't it
> better to aim for this?

I agree, but I also honestly think this is a very large hill to climb.

This will only happen when the host and guest have full understanding of this
mitigation and the guest reboots to reinitialize.

In both enterprise and cloud environments, it may be an extremely long time
before there is full alignment between these two points at a broader fleet level.

In some use cases, such as virtual appliances or older operating systems
that may *never* get updated to understand BHI_CTRL or as I pointed out
for Windows SKUs, Microsoft just straight up disabled it by default, so we’d
be imposing a non-trivial tax on them from the outset.