Re: [PATCH] ptp: Add vDSO-style vmclock support

From: David Woodhouse
Date: Thu Jul 25 2024 - 18:21:34 EST


On Thu, 2024-07-25 at 17:47 -0400, Michael S. Tsirkin wrote:
> On Thu, Jul 25, 2024 at 10:29:18PM +0100, David Woodhouse wrote:
> > Those people included me. I wanted to interrupt all the vCPUs, even the
> > ones which were in userspace at the moment of migration, and have the
> > kernel deal with passing it on to userspace via a different ABI.
> >
> > It ends up being complex and intricate, and requiring a lot of new
> > kernel and userspace support. I gave up on it in the end for snapshots,
> > and didn't go there again for this.
>
> ok I believe you, I am just curious how come you need userspace
> support - what I imagine would live completely in kernel ...

Userspace doesn't even make a system call for gettimeofday() any more;
the relevant information is exposed to userspace through the vDSO.

If userspace needs to know that the time has been disrupted by LM, then
fundamentally that either needs to be exposed directly to it as well,
or userspace needs to go back to making actual system calls to get the
time (which is slow, and not acceptable for the same use cases which
care about it being accurate).

So how do we make it available in a form that's mappable directly to
userspace?

Well, we could have a hypervisor enlightenment, where the guest kernel
uses an MSR or hypercall to tell the hypervisor "please write the
information to <this> GPA", and provides an address within the vDSO
information page. Which isn't nice for Confidential Compute, and is
hard to allow for expansion in the size of the structure. And is much
more complex to support consistently across different hypervisors and
different architectures.

We *could* attempt to contrive a system where we indeed interrupt *all*
vCPUs and the kernel then updates something in the vDSO page before
running userspace again. That could work in theory and *might* be a bit
simpler than what we were trying to do for VMGENID/snapshots, but it's
still complex and would take an eternity to deploy to actual users, and
would probably never work for non-Linux. And imposes an even higher
cost on the guest kernel when LM occurs.

Or there's this method, where the hypervisor puts it in a shared memory
region which is just a PCI BAR or an ACPI _CRS or attached to virtio
(we really don't care how it's discovered). There's a nit that it now
has to be page sized, and a guest which has larger pages than the
hypervisor expects is going to have to use a small PTE to map it (or
not support that mode). But I think that's reasonable.

Having gone around in circles a few times, I'm fairly sure that
exposing a memory region which the hypervisor updates directly is the
simplest and cleanest way of doing it and getting it in the hands of
users.

We're rolling out the AMZNVCLK device for internal use cases, and plan
to add it in public instances some time later. This is the guest driver
which consumes that, and I've separately posted the QEMU patch to
provide the same device. Because I absolutely do want this to be
standardised across hypervisors, for the reasons you point out. You're
preaching to the choir there; I even got Microsoft to implement the
same 15-bit MSI extensions that we added to KVM :)

Supporting the disruption signal is the critical part, which allows
applications to abort operations until their clock is good again.
Providing the actual clock information on the new host, so that
applications can keep running immediately, is what I'll be working on
next.

I'd love virtio-rtc to adopt this structure too, and I've done my best
to ensure that that's feasible, but I can't take a dependency on that
and wait for it (and as discussed, wouldn't use the virtio form in my
environment anyway).

> mutt sucks less ;)

So does 'nc' but Evolution can talk to the corporate Exchange calendar
and email. And I'm used to it and can mostly cope with its quirks :)

Attachment: smime.p7s
Description: S/MIME cryptographic signature