Re: [PATCH v3] drivers/virt: vmgenid: add vm generation id driver

From: Alexander Graf
Date: Mon Dec 07 2020 - 08:12:21 EST




On 27.11.20 19:26, Catangiu, Adrian Costin wrote:
- Background

The VM Generation ID is a feature defined by Microsoft (paper:
http://go.microsoft.com/fwlink/?LinkId=260709) and supported by
multiple hypervisor vendors.

The feature is required in virtualized environments by apps that work
with local copies/caches of world-unique data such as random values,
uuids, monotonically increasing counters, etc.
Such apps can be negatively affected by VM snapshotting when the VM
is either cloned or returned to an earlier point in time.

The VM Generation ID is a simple concept meant to alleviate the issue
by providing a unique ID that changes each time the VM is restored
from a snapshot. The hw provided UUID value can be used to
differentiate between VMs or different generations of the same VM.

- Problem

The VM Generation ID is exposed through an ACPI device by multiple
hypervisor vendors but neither the vendors or upstream Linux have no
default driver for it leaving users to fend for themselves.

Furthermore, simply finding out about a VM generation change is only
the starting point of a process to renew internal states of possibly
multiple applications across the system. This process could benefit
from a driver that provides an interface through which orchestration
can be easily done.

- Solution

This patch is a driver that exposes a monotonic incremental Virtual
Machine Generation u32 counter via a char-dev FS interface. The FS
interface provides sync and async VmGen counter updates notifications.
It also provides VmGen counter retrieval and confirmation mechanisms.

The generation counter and the interface through which it is exposed
are available even when there is no acpi device present.

When the device is present, the hw provided UUID is not exposed to
userspace, it is internally used by the driver to keep accounting for
the exposed VmGen counter. The counter starts from zero when the
driver is initialized and monotonically increments every time the hw
UUID changes (the VM generation changes).
On each hw UUID change, the new hypervisor-provided UUID is also fed
to the kernel RNG.

If there is no acpi vmgenid device present, the generation changes are
not driven by hw vmgenid events but can be driven by software through
a dedicated driver ioctl.

This patch builds on top of Or Idgar <oridgar@xxxxxxxxx>'s proposal
https://lkml.org/lkml/2018/3/1/498

- Future improvements

Ideally we would want the driver to register itself based on devices'
_CID and not _HID, but unfortunately I couldn't find a way to do that.
The problem is that ACPI device matching is done by
'__acpi_match_device()' which exclusively looks at
'acpi_hardware_id *hwid'.

There is a path for platform devices to match on _CID when _HID is
'PRP0001' - but this is not the case for the Qemu vmgenid device.

Guidance and help here would be greatly appreciated.

Signed-off-by: Adrian Catangiu <acatan@xxxxxxxxxx>

---

v1 -> v2:

  - expose to userspace a monotonically increasing u32 Vm Gen Counter
    instead of the hw VmGen UUID
  - since the hw/hypervisor-provided 128-bit UUID is not public
    anymore, add it to the kernel RNG as device randomness
  - insert driver page containing Vm Gen Counter in the user vma in
    the driver's mmap handler instead of using a fault handler
  - turn driver into a misc device driver to auto-create /dev/vmgenid
  - change ioctl arg to avoid leaking kernel structs to userspace
  - update documentation
  - various nits
  - rebase on top of linus latest

v2 -> v3:

  - separate the core driver logic and interface, from the ACPI device.
    The ACPI vmgenid device is now one possible backend.
  - fix issue when timeout=0 in VMGENID_WAIT_WATCHERS
  - add locking to avoid races between fs ops handlers and hw irq
    driven generation updates
  - change VMGENID_WAIT_WATCHERS ioctl so if the current caller is
    outdated or a generation change happens while waiting (thus making
    current caller outdated), the ioctl returns -EINTR to signal the
    user to handle event and retry. Fixes blocking on oneself.
  - add VMGENID_FORCE_GEN_UPDATE ioctl conditioned by
    CAP_CHECKPOINT_RESTORE capability, through which software can force
    generation bump.
---
 Documentation/virt/vmgenid.rst | 240 +++++++++++++++++++++++
 drivers/virt/Kconfig           |  17 ++
 drivers/virt/Makefile          |   1 +
 drivers/virt/vmgenid.c         | 435
+++++++++++++++++++++++++++++++++++++++++
 include/uapi/linux/vmgenid.h   |  14 ++
 5 files changed, 707 insertions(+)
 create mode 100644 Documentation/virt/vmgenid.rst
 create mode 100644 drivers/virt/vmgenid.c
 create mode 100644 include/uapi/linux/vmgenid.h


[...]

diff --git a/drivers/virt/Kconfig b/drivers/virt/Kconfig
index 80c5f9c1..5d5f37b 100644
--- a/drivers/virt/Kconfig
+++ b/drivers/virt/Kconfig
@@ -13,6 +13,23 @@ menuconfig VIRT_DRIVERS
 if VIRT_DRIVERS
+config VMGENID
+    tristate "Virtual Machine Generation ID driver"
+    depends on ACPI

I think you want to split the KConfig bit into two now. One for generic /dev/vmgenid support and another one for ACPI_VMGENID to automatically bump revisions when the hypervisor indicates it.

In fact, you can probably make this two separate patches with two separate files (read: kernel modules) even. The generic code can just export symbols to bump the system genid.

I'm also not fully convinced that calling the generic mechanism "vmgenid" is still accurate at this point. Can you think of a better name? "System Generation ID", so "sysgenid" maybe?

+    default N
+    help
+      This is a Virtual Machine Generation ID driver which provides
+      a virtual machine generation counter. The driver exposes FS ops
+      on /dev/vmgenid through which it can provide information and
+      notifications on VM generation changes that happen on snapshots
+      or cloning.
+      This enables applications and libraries that store or cache
+      sensitive information, to know that they need to regenerate it
+      after process memory has been exposed to potential copying.
+
+      To compile this driver as a module, choose M here: the
+      module will be called vmgenid.
+
 config FSL_HV_MANAGER
     tristate "Freescale hypervisor management driver"
     depends on FSL_SOC

[...]

+    case VMGENID_FORCE_GEN_UPDATE:
+        if (!checkpoint_restore_ns_capable(current_user_ns()))
+            return -EACCES;
+        vmgenid_bump_generation();

I think this is racy and needs to be slightly different. Imagine the following:

- container is running with genid 5
- I take a snapshot of the container
- Target system has genid 4
- I resume the container
- I call the genid update (genid = 5)

Then the container still sees genid 5, so *maybe* it won't adapt to the new environment. This will depend on whether the container gets enough time to adjust to genid=4 before we bump it to 5.

How about we pass a "bump, but not to this value" argument to the ioctl? Then it would look like this:

- container is running with genid 5
- I take a snapshot of the container and its genid (5)
- Target system has genid 4
- I resume the container
- I call the genid update with avoid=5 (so we bump genid to 6)

Now all processes in the system will adapt to genid=6, including the resumed container.


Alex



Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879