Re: [PATCH 0/5] kvm: fix latent guest entry/exit bugs

From: Mark Rutland
Date: Fri Jan 14 2022 - 10:20:12 EST


On Fri, Jan 14, 2022 at 02:51:38PM +0100, Christian Borntraeger wrote:
> Am 14.01.22 um 14:32 schrieb Mark Rutland:
> > On Fri, Jan 14, 2022 at 01:29:46PM +0100, Christian Borntraeger wrote:
> > > Am 14.01.22 um 13:19 schrieb Mark Rutland:
> > > > On Thu, Jan 13, 2022 at 04:20:07PM +0100, Christian Borntraeger wrote:
> > > > > Am 11.01.22 um 16:35 schrieb Mark Rutland:

[...]

> > > > One major thing I wasn't sure about for s390 is the sequence:
> > > >
> > > > guest_enter_irqoff(); // Enters an RCU EQS
> > > > ...
> > > > local_irq_enable();
> > > > ...
> > > > sie64a(...);
> > > > ...
> > > > local_irq_disable();
> > > > ...
> > > > guest_exit_irqoff(); // Exits an RCU EQS
> > > >
> > > > ... since if an IRQ is taken between local_irq_{enable,disable}(), RCU won't be
> > > > watching, and I couldn't spot whether your regular IRQ entry logic would wake
> > > > RCU in this case, or whether there was something else I'm missing that saves
> > > > you here.
> > > >
> > > > For other architectures, including x86 and arm64, we enter the guest with IRQs
> > > > masked and return from the guest with IRQs masked, and don't actually take IRQs
> > > > until we unmask them in the host, after the guest_exit_*() logic has woken RCU
> > > > and so on.
> > > >
> > > > I wasn't able to find documentation on the semantics of SIE, so I couldn't spot
> > > > whether the local_irq_{enable,disable}() calls were necessary, or could be
> > > > removed.
> > >
> > > We run the SIE instruction with interrupts enabled. SIE is interruptible.
> > > The disable/enable pairs are just because guest_enter/exit_irqoff() require them.
> >
> > What I was trying to figure out was when an interrupt is taken between
> > guest_enter_irqoff() and guest_exit_irqoff(), where is RCU woken? I couldn't
> > spot that in the s390 entry code (probably simply because I'm not familiar with
> > it), and so AFAICT that means IRQ code could run without RCU watching, which
> > would cause things to explode.
> >
> > On other architectures that problem is avoided because IRQs asserted during the
> > guest cause a specific guest exit rather than a regular IRQ exception, and the
> > HW enables/disables IRQs when entering/exiting the guest, so the host can leave
> > IRQs masked across guest_enter_irqoff()..guest_exit_irqoff().
> >
> > Am I right in understanding that SIE itself won't enable (host) interrupts
> > while running the guest, and so it *needs* to be run with interrupts already
> > enabled?
>
> yes
>
> > > One thing to be aware of: in our entry.S - after an interrupt - we leave SIE by
> > > setting the return address of the interrupt after the sie instruction so that we
> > > get back into this __vcpu_run loop to check for signals and so.
> >
> > Just to check, that's after the IRQ handler runs, right?
>
> and yes.

Thanks for confirming!

IIUC as above, that means there's a latent RCU bug on s390, and to fix that
we'll need to add something to the IRQ entry logic to wake RCU for any IRQ
taken in the EQS between guest_enter_irqoff() and guest_exit_irqoff(), similar
to what is done for IRQs taken from an idle EQS.

I see s390 uses the common irqentry_{enter,exit}(), so perhaps we could extend
the logic there to check something in addition to is_idle_task()? e.g. add a
noinstr helper to check kvm_running_vcpu, Or add a thread flag that says we're
in this guest EQS.

I also think there is another issue here. When an IRQ is taken from SIE, will
user_mode(regs) always be false, or could it be true if the guest userspace is
running? If it can be true I think tha context tracking checks can complain,
and it *might* be possible to trigger a panic().

In irqentry_enter(), if user_mode(regs) == true, we call
irqentry_enter_from_user_mode -> __enter_from_user_mode(). There we check that
the context is CONTEXT_USER, but IIUC that will be CONTEXT_GUEST at this point.
We also call arch_check_user_regs(), and IIUC this might permit a malicious
guest to trigger a host panic by way of debug_user_asce(), but I may have
misunderstood and that might not be possible.

Thanks,
Mark.