Re: [PATCH v2 07/10] KVM: x86: Always populate L1 GPA for KVM_MAP_MEMORY

From: Paolo Bonzini
Date: Tue Apr 16 2024 - 17:41:33 EST


On Mon, Apr 15, 2024 at 11:17 PM Sean Christopherson <seanjc@googlecom> wrote:
>
> On Mon, Apr 15, 2024, Rick P Edgecombe wrote:
> > I wouldn't call myself much of an expert on nested, but...
> >
> > On Wed, 2024-04-10 at 15:07 -0700, isaku.yamahata@xxxxxxxxx wrote:
> > > There are several options to populate L1 GPA irrelevant to vCPU mode.
> > > - Switch vCPU MMU only: This patch.
> > > Pros: Concise implementation.
> > > Cons: Heavily dependent on the KVM MMU implementation.
>
> Con: Makes it impossible to support other MMUs/modes without extending the uAPI.

+1.

> The first question to answer is, do we want to return an error or "silently"
> install mappings for !SMM, !guest_mode. And so this option becomes relevant only
> _if_ we want to unconditionally install mappings for the 'base" mode.
>
> > > - Return error on guest mode or SMM mode: Without this patch.
> > > Pros: No additional patch.
> > > Cons: Difficult to use.
> >
> > Hmm... For the non-TDX use cases this is just an optimization, right? For TDX
> > there shouldn't be an issue. If so, maybe this last one is not so horrible.

It doesn't even have to be ABI that it gives an error. As you say,
this ioctl can just be advisory only for !confidential machines. Even
if it were implemented, the shadow MMU can drop roots at any moment
and/or kill the mapping via the shrinker.

That said, I can't fully shake the feeling that this ioctl should be
an error for !TDX and that TDX_INIT_MEM_REGION wasn't that bad. The
implementation was ugly but the API was fine. Sorry about this;
patches 3-5 can still be included in kvm-coco-queue sooner rather than
later.

> And the fact there are so variables to control (MAXPHADDR, SMM, and guest_mode)
> basically invalidates the argument that returning an error makes the ioctl() hard
> to use. I can imagine it might be hard to squeeze this ioctl() into QEMU's
> existing code, but I don't buy that the ioctl() itself is hard to use.

Nah, I don't think so. With TDX it's just MAXPHYADDR; just invoke it
after KVM_SET_CPUID2 or TDX_INIT_VCPU which is very early.

> Literally the only thing userspace needs to do is set CPUID to implicitly select
> between 4-level and 5-level paging. If userspace wants to pre-map memory during
> live migration, or when jump-starting the guest with pre-defined state, simply
> pre-map memory before stuffing guest state. In and of itself, that doesn't seem
> difficult, e.g. at a quick glance, QEMU could add a hook somewhere in
> kvm_vcpu_thread_fn() without too much trouble (though that comes with a huge
> disclaimer that I only know enough about how QEMU manages vCPUs to be dangerous).

Hehe :) the machine_done_notifier is probably a better place. /me
checks... yes it's exactly where Xiaoyao did it (tdx_finalize_vm is
the notifier, it calls KVM_TDX_INIT_VCPU, from tdx_post_init_vcpus and
then KVM_MEMORY_MAPPING).

> I would describe the overall cons for this patch versus returning an error
> differently. Switching MMU state puts the complexity in the kernel. Returning
> an error punts any complexity to userspace. Specifically, anything that KVM can
> do regarding vCPU state to get the right MMU, userspace can do too.
>
> Add on that silently doing things that effectively ignore guest state usually
> ends badly, and I don't see a good argument for this patch (or any variant
> thereof).

Agreed.

Paolo