Re: [PATCH RFC v2 00/15] Add virtualization support for EGM

From: Alex Williamson

Date: Thu Mar 12 2026 - 11:01:33 EST


On Thu, 12 Mar 2026 13:51:20 +0000
Ankit Agrawal <ankita@xxxxxxxxxx> wrote:

> >> > nvgrace-gpu is manipulating sysfs
> >> > on devices owned by nvgrace-egm, we don't have mechanisms to manage the
> >> > aux device relative to the state of the GPU, we're trying to add a
> >> > driver that can bind to device created by an out-of-tree driver, and
> >> > we're inventing new uAPIs on the chardev for things that already exist
> >> > for vfio regions.
> >>
> >> Sorry for the confusion. The nvgrace-egm would not bind to the device
> >> created by the out-of-tree driver. We would have a separate out-of-tree
> >> equivalent of nvgrace-egm to bind to the device by the out-of-tree vfio
> >> driver. Maybe we can consider exposing a register/unregister APIs from
> >> nvgrace-egm where a module (in-tree nvgrace / out-of-tree) can register
> >> a pdev and nvgrace-egm can use to fetch the region info.
> >
> > Ok, this wasn't clear to me, but does that also mean that if some GPUs
> > are managed by nvgrace-gpu and others by out-of-tree drivers that the
> > in-kernel and out-of-tree equivalent drivers are both installing
> > chardevs as /dev/egmXX?  Playing in the same space is ugly, but what
> > happens when the 2 GPUs per socket are split between drivers and they
> > both try to added the same chardev?
>
> But that would be an unsupported configuration. It is expected that all the
> GPUs on the system and the EGM char devices to be attached to the same
> VM for full functionality. So either all the devices (GPU and EGM chardev)
> would be bound to nvgrace or to the out-of-tree module. Please refer sec 8.1
> https://docs.nvidia.com/multi-node-nvlink-systems/partition-guide-v1-2.pdf
> Perhaps I should add this information in the commit message.

Just because it can be documented as a policy doesn't make it an
agreeable architecture.

> > However, I'd then ask the question why we're associating EGM to the GPU
> > PCI driver at all.  For instance, why should nvgrace-gpu spawn aux
> > devices to feed into an nvgrace-egm driver, and duplicate that whole
> > thing in an out-of-tree driver, when we could just have one in-kernel
> > platform(?) driver walk ACPI, find these ranges, and expose them as
> > chardev entirely independent of the PCI driver bound to the GPU?
>
> So a new platform driver to walk through the ACPI and look for EGM properties
> and create EGM char devs?
>
> Maybe it is okay, but given that all the 4 EGM properties are under the GPU's
> ACPI node and there being no independent ACPI _HID device identity, it sounds
> a bit off to me. Do we have a precedent like that?
>
> But as I mentioned above, the expectation is that the EGM devices and the GPU
> devices to be assigned to the same VM. So would it not make sense that we
> keep the association between the EGM devices and the GPU devices?

You're telling me that the EGM access is 100% independent of any state
related to the GPU, so why would we tie the lifecycle of these aux
devices to any particular driver for the GPU or re-implement it across
multiple drivers? That doesn't make sense to me. Thanks,

Alex