Re: [PATCH v1 4/5] KVM: Introduce KVM_EXIT_COCO exit type

From: Binbin Wu
Date: Fri Jul 26 2024 - 03:15:35 EST




On 6/29/2024 8:36 AM, Michael Roth wrote:
On Fri, Jun 28, 2024 at 01:08:19PM -0700, Sean Christopherson wrote:
On Wed, Jun 26, 2024, Michael Roth wrote:
On Wed, Jun 26, 2024 at 07:22:43AM -0700, Sean Christopherson wrote:
On Fri, Jun 21, 2024, Michael Roth wrote:
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index ecfa25b505e7..2eea9828d9aa 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -7122,6 +7122,97 @@ Please note that the kernel is allowed to use the kvm_run structure as the
primary storage for certain register types. Therefore, the kernel may use the
values in kvm_run even if the corresponding bit in kvm_dirty_regs is not set.
+::
+
+ /* KVM_EXIT_COCO */
+ struct kvm_exit_coco {
+ #define KVM_EXIT_COCO_REQ_CERTS 0
+ #define KVM_EXIT_COCO_MAX 1
+ __u8 nr;
+ __u8 pad0[7];
+ union {
+ struct {
+ __u64 gfn;
+ __u32 npages;
+ #define KVM_EXIT_COCO_REQ_CERTS_ERR_INVALID_LEN 1
+ #define KVM_EXIT_COCO_REQ_CERTS_ERR_GENERIC (1 << 31)
Unless I'm mistaken, these error codes are defined by the GHCB, which means the
values matter, i.e. aren't arbitrary KVM-defined values.
They do happen to coincide with the GHCB-defined values:

/*
* The GHCB spec only formally defines INVALID_LEN/BUSY VMM errors, but define
* a GENERIC error code such that it won't ever conflict with GHCB-defined
* errors if any get added in the future.
*/
#define SNP_GUEST_VMM_ERR_INVALID_LEN 1
#define SNP_GUEST_VMM_ERR_BUSY 2
#define SNP_GUEST_VMM_ERR_GENERIC BIT(31)

and not totally by accident. But the KVM_EXIT_COCO_REQ_CERTS_ERR_* are
defined/documented without any reliance on the GHCB spec and are purely
KVM-defined. I just didn't really see any reason to pick different
numerical values since it seems like purposely obfuscating things for
For SNP. For other vendors, the numbers look bizarre, e.g. why bit 31? And the
fact that it appears to be a mask is even more odd.
That's fair. Values 1 and 2 made sense so just re-use, but that results
in a awkward value for _GENERIC that's not really necessary for the KVM
side.

no real reason. But the code itself doesn't rely on them being the same
as the spec defines, so we are free to define these however we'd like as
far as the KVM API goes.
I forget exactly what we discussed in PUCK, but for the error codes, I think KVM
should either define it's own values that are completely disconnected from any
"harware" spec, or KVM should very explicitly #define all hardware values and have
I'd gotten the impression that option 1) is what we were sort of leaning
toward, and that's the approach taken here.
And if we expose things selectively to keep the ABI small, it's a bit
awkward too. For instance, KVM_EXIT_COCO_REQ_CERTS_ERR_* basically needs
a way to indicate success/fail/ENOMEM. Which we have with
(assuming 0==success):

#define KVM_EXIT_COCO_REQ_CERTS_ERR_INVALID_LEN 1
#define KVM_EXIT_COCO_REQ_CERTS_ERR_GENERIC (1 << 31)

But the GHCB also defines other values like:

#define SNP_GUEST_VMM_ERR_BUSY 2

which don't make much sense to handle on the userspace side and doesn't
Why not? If userspace is waiting on a cert update for whatever reason, why can't
it signal "busy" to the guest?
My thinking was that userspace is free to take it's time and doesn't need
to report delays back to KVM. But it would reduce the potential for
soft-lockups in the guest, so it might make sense to work that into the
API.

But more to original point, there could be something added in the future
that really has nothing to do with anything involving KVM<->userspace
interaction and so would make no sense to expose to userspace.
Unfortunately I picked a bad example. :)

really have anything to do with the KVM_EXIT_COCO_REQ_CERTS KVM event,
which is a separate/self-contained thing from the general guest request
protocol. So would we expose that as ABI or not? If not then we end up
with this weird splitting of code. And if yes, then we have to sort of
give userspace a way to discover whenever new error codes are added to
the GHCB spec, because KVM needs to understand these value too and
Not necessarily. So long as KVM doesn't need to manipulate guest state, e.g. to
set RBX (or whatever reg it is) for ERR_INVALID_LEN, then KVM doesn't need to
care/know about the error codes. E.g. userspace could signal VMM_BUSY and KVM
would happily pass that to the guest.
But given we already have an exception to that where KVM does need to
intervene for certain errors codes like ERR_INVALID_LEN that require
modifying guest state, it doesn't seem like a good starting position
to have to hope that it doesn't happen again.

It just doesn't seem necessary to put ourselves in a situation where
we'd need to be concerned by that at all. If the KVM API is a separate
and fairly self-contained thing then these decisions are set in stone
until we want to change it and not dictated/modified by changes to
anything external without our explicit consideration.

I know the certs things is GHCB-specific atm, but when the certs used
to live inside the kernel the KVM_EXIT_* wasn't needed at all, so
that's why I see this as more of a KVM interface thing rather than
a GHCB one. And maybe eventually some other CoCo implementation also
needs some interface for fetching certificates/blobs from userspace
and is able to re-use it still because it's not too SNP-specific
and the behavior isn't dictated by the GHCB spec (e.g.
ERR_INVALID_LEN might result in some other state needing to be
modified in their case rather than what the GHCB dictates.)

TDX GHCI does have a similar PV interface for TDX guest to get quota, i.e.,
TDG.VP.VMCALL<GetQuote>.  This GetQuote PV interface is designed to invoke
a request to generate a TD-Quote signing by a service hosting TD-Quoting
Enclave operating in the host environment for a TD Report passed as a
parameter by the TD.
And the request will be forwarded to userspace for handling.

So like GHCB, TDX needs to pass a shared buffer to userspace, which is
specified by GPA and size (4K aligned) and get the error code from
userspace and forward the error code to guest.

But there are some differences from GHCB interface.
1. TDG.VP.VMCALL<GetQuote> is a a doorbell-like interface used to queue a
   request. I.e., it is an asynchronous request.  The error code represents
   the status of request queuing, *not* the status of TD Quote generation..
2. Besides the error code returned by userspace for GetQuote interface, the
   GHCI spec defines a "Status Code" field in the header of the shared buffer.
   The "Status Code" field is also updated by VMM during the real handling of
   getting quote (after TDG.VP.VMCALL<GetQuote> returned to guest).
   After the TDG.VP.VMCALL<GetQuote> returned and back to TD guest, the TD
   guest can poll the "Status Code" field to check if the processing is
   in-flight, succeeded or failed.
   Since the real handling of getting quota is happening in userspace, and
   it will interact directly with guest, for TDX, it has to expose TDX
   specific error code to userspace to update the result of quote generation.

Currently, TDX is about to add a new TDX specific KVM exit reason, i.e.,
KVM_EXIT_TDX_GET_QUOTE and its related data structure based on a previous
discussion. https://lore.kernel.org/kvm/Zg18ul8Q4PGQMWam@xxxxxxxxxx/
For the error code returned by userspace, KVM simply forward the error code
to guest without further translation or handling.

I am neutral to have a common KVM exit reason to handle both GHCB for
REQ_CERTS and GHCI for GET_QUOTE.  But for the error code, can we uses vendor
specific error codes if KVM cares about the error code returned by userspace
in vendor specific complete_userspace_io callback?

BTW, here is the plan of 4 hypercalls needing to exit to userspace for
TDX basic support series:
TDG.VP.VMCALL<SetupEventNotifyInterrupt>
- Add a new KVM exit reason KVM_EXIT_TDX_SETUP_EVENT_NOTIFY
TDG.VP.VMCALL<GetQuote>
- Add a new KVM exit reason KVM_EXIT_TDX_GET_QUOTE
TDG.VP.VMCALL<MapGPA>
- Reuse KVM_EXIT_HYPERCALL with KVM_HC_MAP_GPA_RANGE
TDG.VP.VMCALL<ReportFatalError>
- Reuse KVM_EXIT_SYSTEM_EVENT but add a new type
  KVM_SYSTEM_EVENT_TDX_FATAL_ERROR