Re: [RFC KVM 00/27] KVM Address Space Isolation

From: Jan Setje-Eilers
Date: Tue May 14 2019 - 17:36:14 EST



On 5/14/19 12:37 AM, Peter Zijlstra wrote:
On Mon, May 13, 2019 at 07:07:36PM -0700, Andy Lutomirski wrote:
On Mon, May 13, 2019 at 2:09 PM Liran Alon <liran.alon@xxxxxxxxxx> wrote:
The hope is that the very vast majority of #VMExit handlers will be
able to completely run without requiring to switch to full address
space. Therefore, avoiding the performance hit of (2).
However, for the very few #VMExits that does require to run in full
kernel address space, we must first kick the sibling hyperthread
outside of guest and only then switch to full kernel address space
and only once all hyperthreads return to KVM address space, then
allow then to enter into guest.
What exactly does "kick" mean in this context? It sounds like you're
going to need to be able to kick sibling VMs from extremely atomic
contexts like NMI and MCE.
Yeah, doing the full synchronous thing from NMI/MCE context sounds
exceedingly dodgy, howver..

Realistically they only need to send an IPI to the other sibling; they
don't need to wait for the VMExit to complete or anything else.

And that is something we can do from NMI context -- with a bit of care.
See also arch_irq_work_raise(); specifically we need to ensure we leave
the APIC in an idle state, such that if we interrupted an APIC sequence
it will not suddenly fail/violate the APIC write/state etc.

ÂI've been experimenting with IPI'ing siblings on vmexit, primarily because we know we'll need it if ASI turns out to be viable, but also because I wanted to understand why previous experiments resulted in such poor performance.

ÂYou're correct that you don't need to wait for the sibling to come out once you send the IPI. That hardware thread will not do anything other than process the IPI once it's sent. There is still some need for synchronization, at least for the every vmexit case, since you always want to make sure that one thread is actually doing work while the other one is held. I have this working for some cases, but not enough to call it a general solution. I'm not at all sure that the every vmexit case can be made to perform for the general case. Even the non-general case uses synchronization that I fear might be overly complex.

ÂFor the cases I do have working, simply not pinning the sibling when we exit due to the quest idling is a big enough win to put performance into a much more reasonable range.

ÂBase on this, I believe that pining a sibling HT in a subset of cases, when we interact with full kernel address space, is almost certainly reasonable.

-jan