Re: [PATCH v3] x86/speculation, KVM: only IBPB for switch_mm_always_ibpb on vCPU load

From: Jon Kohler
Date: Sat Apr 30 2022 - 10:52:01 EST




> On Apr 30, 2022, at 5:50 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> On Fri, Apr 29, 2022 at 11:23:32PM +0000, Sean Christopherson wrote:
>> The host kernel is protected via RETPOLINE and by flushing the RSB immediately
>> after VM-Exit.
>
> Ah, right.
>
>> I don't know definitively. My guess is that IBPB is far too costly to do on every
>> exit, and so the onus was put on userspace to recompile with RETPOLINE. What I
>> don't know is why it wasn't implemented as an opt-out feature.
>
> Or, we could add the same logic on the exit path as in cond_mitigation()
> and test for LAST_USER_MM_IBPB when the host has selected
> switch_mm_cond_ibpb and thus allows for certain guests to be
> protected...

This is roughly what the v1 of this patch did. *if* we keep it here, to fix this bug
we’d have to bring this logic here for sure.

>
> Although, that use case sounds kinda meh: AFAIU, the attack vector here
> would be, protecting the guest from a malicious kernel. I guess this
> might matter for encrypted guests tho.
>
>> I'll write up the bits I have my head wrapped around.
>
> That's nice, thanks!

Agreed, thx Sean, I really appreciate it.

>> I don't know of any actual examples. But, it's trivially easy to create multiple
>> VMs in a single process, and so proving the negative that no one runs multiple VMs
>> in a single address space is essentially impossible.
>>
>> The container thing is just one scenario I can think of where userspace might
>> actually benefit from sharing an address space, e.g. it would allow backing the
>> image for large number of VMs with a single set of read-only VMAs.
>
> Why I keep harping about this: so let's say we eventually add something
> and then months, years from now we cannot find out anymore why that
> thing was added. We will likely remove it or start wasting time figuring
> out why that thing was added in the first place.
>
> This very questioning keeps popping up almost on a daily basis during
> refactoring so I'd like for us to be better at documenting *why* we're
> doing a certain solution or function or whatever.

This is 100% a fair ask, I appreciate the diligence, as we’ve all been there
on the ‘other side’ of changes to complex areas and spend hours digging on
git history, LKML threads, SDM/APM, and other sources trying to derive
why the heck something is the way it is. I’m 100% game to make sure this
is a good change, so truly thank you for helping hold a high quality bar.

> And this is doubly important when it comes to the hw mitigations because
> if you look at the problem space and all the possible ifs and whens and
> but(t)s, your head will spin in no time.
>
> So I'm really sceptical when there's not even a single actual use case
> to a proposed change.
>
> So Jon said something about oversubscription and a lot of vCPU
> switching. That should be there in the docs as the use case and
> explaining why dropping IBPB during vCPU switches is redundant.
>
>> I truly have no idea, which is part of the reason I brought it up in the first
>> place. I'd have happily just whacked KVM's IBPB entirely, but it seemed prudent
>> to preserve the existing behavior if someone went out of their way to enable
>> switch_mm_always_ibpb.
>
> So let me try to understand this use case: you have a guest and a bunch
> of vCPUs which belong to it. And that guest gets switched between those
> vCPUs and KVM does IBPB flushes between those vCPUs.
>
> So either I'm missing something - which is possible - but if not, that
> "protection" doesn't make any sense - it is all within the same guest!
> So that existing behavior was silly to begin with so we might just as
> well kill it.

Close, its not 1 guest with a bunch of vCPU, its a bunch of guests with
a small amount of vCPUs, thats the small nuance here, which is one of
the reasons why this was hard to see from the beginning.

AFAIK, the KVM IBPB is avoided when switching in between vCPUs
belonging to the same vmcs/vmcb (i.e. the same guest), e.g. you could
have one VM highly oversubscribed to the host and you wouldn’t see
either the KVM IBPB or the switch_mm IBPB. All good.

Reference vmx_vcpu_load_vmcs() and svm_vcpu_load() and the
conditionals prior to the barrier.

However, the pain ramps up when you have a bunch of separate guests,
especially with a small amount of vCPUs per guest, so the switching is more
likely to be in between completely separate guests. Think small servers or
virtual desktops. Thats what the scalability test I described in my previous note
does, which effectively gets us to the point where each and every load is to
a different guest, so we’re firing KVM IBPB each time.

But even then, we’re in agreement that its silly because the switch_mm
takes care of it.

>
>> Yes, or do it iff switch_mm_always_ibpb is enabled to maintain "compability".
>
> Yap, and I'm questioning the even smallest shred of reasoning for having
> that IBPB flush there *at* *all*.
>
> And here's the thing with documenting all that: we will document and
> say, IBPB between vCPU flushes is non-sensical. Then, when someone comes
> later and says, "but but, it makes sense because of X" and we hadn't
> thought about X at the time, we will change it and document it again and
> this way you'll have everything explicit there, how we arrived at the
> current situation and be able to always go, "ah, ok, that's why we did
> it."
>
> I hope I'm making some sense...

Makes sense to me. Let’s wait for Sean’s writeup to clarify and keep
drilling down on this. This “but but” was exactly why we wanted to leave at
least a “shred” behind :) but agreed, we need to double down on documentation

> --
> Regards/Gruss,
> Boris.
>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__people.kernel.org_tglx_notes-2Dabout-2Dnetiquette&d=DwIBaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=gz30B4PoWGK0UCpgwq_jMKL5LHzM6w-430LAsSEcVYgm5iUvCuCKcbv8amDHDAUu&s=MOviOxGMpK3YkgYmDtVKwgJQ7RcSFTsEAMWUD8W-SoI&e=