Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init

From: Huang, Kai
Date: Thu Dec 07 2023 - 07:59:04 EST

Next message: Martin Kurbanov: "[PATCH v1 0/2] leds: aw200xx: support for hw pattern controllers"
Previous message: Thomas Richter: "[PATCH] perf test: Fix fails of perf stat --bpf-counters --for-each-cgroup on s390"
In reply to: Jeremi Piotrowski: "Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init"
Next in thread: Jeremi Piotrowski: "Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

>
> > I think we are lacking background of this usage model and how it works. For
> > instance, typically L2 is created by L1, and L1 is responsible for L2's device
> > I/O emulation. I don't quite understand how could L0 emulate L2's device I/O?
> >
> > Can you provide more information?
>
> Let's differentiate between fast and slow I/O. The whole point of the paravisor in
> L1 is to provide device emulation for slow I/O: TPM, RTC, NVRAM, IO-APIC, serial ports.
>
> But fast I/O is designed to bypass it and go straight to L0. Hyper-V uses paravirtual
> vmbus devices for fast I/O (net/block). The vmbus protocol has awareness of page visibility
> built-in and uses native (GHCI on TDX, GHCB on SNP) mechanisms for notifications. So once
> everything is set up (rings/buffers in swiotlb), the I/O for fast devices does not
> involve L1. This is only possible when the VM manages C-bit itself.

Yeah that makes sense. Thanks for the info.

>
> I think the same thing could work for virtio if someone would "enlighten" vring
> notification calls (instead of I/O or MMIO instructions).
>
> >
> > >
> > > >
> > > > >
> > > > > Whats missing is the tdx_guest flag is not exposed to userspace in /proc/cpuinfo,
> > > > > and as a result dmesg does not currently display:
> > > > > "Memory Encryption Features active: Intel TDX".
> > > > >
> > > > > That's what I set out to correct.
> > > > >
> > > > > > So far I see that you try to get kernel think that it runs as TDX guest,
> > > > > > but not really. This is not very convincing model.
> > > > > >
> > > > >
> > > > > No that's not accurate at all. The kernel is running as a TDX guest so I
> > > > > want the kernel to know that.
> > > > >
> > > >
> > > > But it isn't. It runs on a hypervisor which is a TDX guest, but this doesn't
> > > > make itself a TDX guest.>
> > >
> > > That depends on your definition of "TDX guest". The TDX 1.5 TD partitioning spec
> > > talks of TDX-enlightened L1 VMM, (optionally) TDX-enlightened L2 VM and Unmodified
> > > Legacy L2 VM. Here we're dealing with a TDX-enlightened L2 VM.
> > >
> > > If a guest runs inside an Intel TDX protected TD, is aware of memory encryption and
> > > issues TDVMCALLs - to me that makes it a TDX guest.
> >
> > The thing I don't quite understand is what enlightenment(s) requires L2 to issue
> > TDVMCALL and know "encryption bit".
> >
> > The reason that I can think of is:
> >
> > If device I/O emulation of L2 is done by L0 then I guess it's reasonable to make
> > L2 aware of the "encryption bit" because L0 can only write emulated data to
> > shared buffer. The shared buffer must be initially converted by the L2 by using
> > MAP_GPA TDVMCALL to L0 (to zap private pages in S-EPT etc), and L2 needs to know
> > the "encryption bit" to set up its page table properly. L1 must be aware of
> > such private <-> shared conversion too to setup page table properly so L1 must
> > also be notified.
>
> Your description is correct, except that L2 uses a hypercall (hv_mark_gpa_visibility())
> to notify L1 and L1 issues the MAP_GPA TDVMCALL to L0.

In TDX partitioning IIUC L1 and L2 use different secure-EPT page table when
mapping GPA of L1 and L2. Therefore IIUC entries of both secure-EPT table which
map to the "to be converted page" need to be zapped.

I am not entirely sure whether using hv_mark_gpa_visibility() is suffice? As if
the MAP_GPA was from L1 then I am not sure L0 is easy to zap secure-EPT entry
for L2.

But anyway these are details probably we don't need to consider.

>
> C-bit awareness is necessary to setup the whole swiotlb pool to be host visible for
> DMA.

Agreed.

>
> >
> > The concern I am having is whether there's other usage model(s) that we need to
> > consider. For instance, running both unmodified L2 and enlightened L2. Or some
> > L2 only needs TDVMCALL enlightenment but no "encryption bit".
> >
>
> Presumably unmodified L2 and enlightened L2 are already covered by current code but
> require excessive trapping to L1.
>
> I can't see a usecase for TDVMCALLs but no "encryption bit".
>
> > In other words, that seems pretty much L1 hypervisor/paravisor implementation
> > specific. I am wondering whether we can completely hide the enlightenment(s)
> > logic to hypervisor/paravisor specific code but not generically mark L2 as TDX
> > guest but still need to disable TDCALL sort of things.
>
> That's how it currently works - all the enlightenments are in hypervisor/paravisor
> specific code in arch/x86/hyperv and drivers/hv and the vm is not marked with
> X86_FEATURE_TDX_GUEST.

And I believe there's a reason that the VM is not marked as TDX guest.

>
> But without X86_FEATURE_TDX_GUEST userspace has no unified way to discover that an
> environment is protected by TDX and also the VM gets classified as "AMD SEV" in dmesg.
> This is due to CC_ATTR_GUEST_MEM_ENCRYPT being set but X86_FEATURE_TDX_GUEST not.

Can you provide more information about what does _userspace_ do here?

What's the difference if it sees a TDX guest or a normal non-coco guest in
/proc/cpuinfo?

Looks the whole purpose of this series is to make userspace happy by advertising
TDX guest to /proc/cpuinfo. But if we do that we will have bad side-effect in
the kernel so that we need to do things in your patch 2/3.

That doesn't seem very convincing. Is there any other way that userspace can
utilize, e.g., any HV hypervisor/paravisor specific attributes that are exposed
to userspace?

Next message: Martin Kurbanov: "[PATCH v1 0/2] leds: aw200xx: support for hw pattern controllers"
Previous message: Thomas Richter: "[PATCH] perf test: Fix fails of perf stat --bpf-counters --for-each-cgroup on s390"
In reply to: Jeremi Piotrowski: "Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init"
Next in thread: Jeremi Piotrowski: "Re: [PATCH v1 1/3] x86/tdx: Check for TDX partitioning during early TDX init"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]