We can think about restricting the list of system calls that this hypercall can
execute. In the user-space changes for gVisor, we have a list of system calls
that are not executed via this hypercall. For example, sigprocmask is never
executed by this hypercall, because the kvm vcpu has its signal mask. Another
example is the ioctl syscall, because it can be one of kvm ioctl-s.
== Host Ring3/Guest ring0 mixed mode ==
This is how the gVisor KVM platform works right now. We don’t have a separate
hypervisor, and the Sentry does its functions. The Sentry creates a KVM virtual
machine instance, sets it up, and handles VMEXITs. As a result, the Sentry runs
in the host ring3 and the guest ring0 and can transparently switch between
these two contexts. In this scheme, the sentry syscall time is 3600ns.
This is for the case when a system call is called from gr0.
The benefit of this way is that only a first system call triggers vmexit and
all subsequent syscalls are executed on the host natively.
But it has downsides:
* Each sentry system call trigger the full exit to hr3.
* Each vmenter/vmexit requires to trigger a signal but it is expensive.
* It doesn't allow to support Confidential Computing (SEV-ES/SGX). The Sentry
has to be fully enclosed in a VM to be able to support these technologies.
== Execute system calls from a user-space VMM ==
In this case, the Sentry is always running in VM, and a syscall handler in GR0
triggers vmexit to transfer control to VMM (user process that is running in
hr3), VMM executes a required system call, and transfers control back to the
Sentry. We can say that it implements the suggested hypercall in the
user-space.
The sentry syscall time is 2100ns in this case.
The new hypercall does the same but without switching to the host ring 3. It
reduces the sentry syscall time to 1000ns.