Re: [RFC PATCH 0/3] KVM: Introduce "VM bugged" concept

From: Marc Zyngier
Date: Fri Sep 25 2020 - 12:33:05 EST


Hi Sean,

On Wed, 23 Sep 2020 23:45:27 +0100,
Sean Christopherson <sean.j.christopherson@xxxxxxxxx> wrote:
>
> This series introduces a concept we've discussed a few times in x86 land.
> The crux of the problem is that x86 has a few cases where KVM could
> theoretically encounter a software or hardware bug deep in a call stack
> without any sane way to propagate the error out to userspace.
>
> Another use case would be for scenarios where letting the VM live will
> do more harm than good, e.g. we've been using KVM_BUG_ON for early TDX
> enabling as botching anything related to secure paging all but guarantees
> there will be a flood of WARNs and error messages because lower level PTE
> operations will fail if an upper level operation failed.
>
> The basic idea is to WARN_ONCE if a bug is encountered, kick all vCPUs out
> to userspace, and mark the VM as bugged so that no ioctls() can be issued
> on the VM or its devices/vCPUs.
>
> RFC as I've done nowhere near enough testing to verify that rejecting the
> ioctls(), evicting running vCPUs, etc... works as intended.

I'm quite like the idea. However, I wonder whether preventing the
vcpus from re-entering the guest is enough. When something goes really
wrong, is it safe to allow the userspace process to terminate normally
and free the associated memory? And is it still safe to allow new VMs
to be started?

I can't really imagine a case where such extreme measures would be
necessary on arm64, but I thought I'd ask.

Thanks,

M.

--
Without deviation from the norm, progress is not possible.