Re: vm events, userspace, the vmgenid driver, and the future [was: the uevent revert thread]

From: Alexander Graf
Date: Wed Sep 18 2024 - 19:02:33 EST



On 19.09.24 00:27, Jason A. Donenfeld wrote:
[broadened subject line and added relevant parties to cc list]

On Tue, Sep 17, 2024 at 10:55:20PM +0200, Alexander Graf wrote:
What is still open are user space applications that require event based
notification on VM clone events - and *only* VM clone events. This
mostly caters for tools like systemd which need to execute policy - such
as generating randomly generated MAC addresses - in the event a VM was
cloned.

That's the use case this patch "vmgenid: emit uevent when VMGENID
updates" is about and I think the best path forward is to just revert
the revert. A uevent from the device driver is a well established, well
fitting Linux mechanism for that type of notification.
The thing that worries me is that vmgenid is just some weird random
microsoft acpi driver. It's one sort of particular device, and not a
very good one at that. There's still room for virtio/qemu to improve on
it with their own thing, or for vbox or whatever else to have their
version, and xen theirs, and so forth. That is to say, I'm not sure that
this virtual hardware is *the* way of doing it.


I agree, but given that it's been a few years and nobody else really came up with a different device, it means the current semantics for the scope of what the device is doing are close to "good enough". So I don't expect a lot of innovation here. And if there will be innovation - as you point out - it will bring different semantics that will then also require user space changes anyway.


Even in terms of the entropy stuff (which I know you no longer care
about, but I do), mst's original virtio-rng draft mentioned reporting
events beyond just VM forks, extending it generically to any kind of
entropy reduction situation. For example, migration or suspend or
whatever might be interesting things to trigger. Heck, one could imagine
those coming through vmgenid at some point, which would then change the
semantics you're after for systemd.


If they come through vmgenid, it would need to gain a new type of event at which point the uevent notification would also change.

I'm also not sure why live migration would trigger either a vm clone or any rng relevant event. And suspend is something we already have the machinery for to detect.


Even in terms of reporting exclusively about external VM events, there's
a subtle thing to consider between clones/forks and rollbacks, as well
as migrations. Vmgenid kind of lumps it all together, and hopefully the


It's the opposite: VMGenID is exclusively concerned about clones. It doesn't care about rollbacks. It doesn't care about migrations. Its value effectively changes when you clone a VM; and only then.


hypervisor notifies in a way consistent with what userspace was hoping
to learn about. (Right now, maybe we're doing what Hyper-V does, maybe,
but also maybe not; it's kind of loose.) So at some point, there's a
question about the limitations of vmgenid and the possible extensions of
it, or whether this will come in a different driver or virtual hardware,
and how.


To me a lot of this is too vague to be actionable. Unless someone comes in with real scenarios where they care about other scenarios, it sounds to me like the one scenario that vmgenid covers is what system level user space cares about. If in a few years we realize that we need 3 different types of events, we can start looking at ways to funnel those in a more abstract way. Until then, because we don't know what these events will be, we can't even design an API that would address them.

Keep in mind that we're not really talking here about building a generic API for any random user space application. We only want to give system software the ability to reason about system events. IMHO any more abstract layer to funnel multiple different of these to downstream user space (if we ever care) would be a user space problem to solve, like for example a dbus event.


Right now, this is mostly unexplored. The virtio-rng avenue was largest
step in terms of exploring this problem space, but there are obviously a
few directions to go, depending on what your primary concern is.

But all of that makes me think that exposing the particulars of this
virtual hardware driver to userspace is not the best option, or at least
not an option to rush into (or to trick Greg into), and will both limit


I'm pretty sure I never tricked Greg into anything :)


what we can do with it later, and potentially burden userspace with
having to check multiple different things with confusing interactions
down the road. So I think it's worth stepping back a bit and thinking


This interface here is only available to effectively udev/systemd type software. Any abstraction above that should be on them. And if we eventually decide that we need a better interface to generic user space, we can still build it.


about what we actually want from this and what those semantics should
be.

I'd also love to hear from the QEMU guys on this and get their input. To
that end, I've added qemu and virtio mailing lists, as well as mst.

Also, I'd be interested to learn specifically what you (Amazon) want
this for and what the larger picture there is. I get the systemd case,
but I'm under the assumption you've got a different project in your
woods.


The purpose for Amazon here is to accelerate serverless compute VMs [1].

We want to snapshot a VM post-init, before it receives any operation. Then resume it, initiate logic to resanitize itself and serve the request. The reason we want this particular vmgenid interface is so that we can create a notion of "resanitization" in user space at all. Once we have the event, systemd can start establishing service actions based on that which will lead to the user space ecosystem to grow interfaces to say "sanitize yourself" which we can then also invoke in VM post-init - probably without systemd :).

We built such event logic for Java today [2], but we would like to expand beyond. And that will become an unmaintainable mess without viable ecosystem support, so we may as well enable "normal" VM clones with the same logic. Given pretty much all hypervisors (including QEMU) out there already implement vmgenid, it seems to be the de facto standard to do exactly this notification.


Alex

[1] https://docs.aws.amazon.com/lambda/latest/dg/snapstart.html
[2] https://docs.aws.amazon.com/lambda/latest/dg/snapstart-runtime-hooks.html




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597