Re: [PATCH RFC v2 00/15] Add virtualization support for EGM

From: Ankit Agrawal

Date: Wed Mar 11 2026 - 02:47:57 EST


Thanks Alex for the review.

>> The patch series introduce a new nvgrace-egm auxiliary driver module
>> to manage and map the HI/EGM region in the Grace Blackwell systems.
>> This binds to the auxiliary device created by the parent
>> nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
>> (out-of-tree open source module for SRIOV vGPU) to manage the
>> EGM region for the VM. Note that there is a unique EGM region per
>> socket and the auxiliary device gets created for every region. The
>> parent module fetches the EGM region information from the ACPI
>> tables and populate to the data structures shared with the auxiliary
>> nvgrace-egm module.
>>
>> nvgrace-egm module handles the following:
>> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
>> from the parent device shared EGM region structure.
>> 2. Create a char device that can be used as memory-backend-file by Qemu
>> for the VM and implement file operations. The char device is /dev/egmX,
>> where X is the PXM node ID of the EGM being mapped fetched in 1.
>> 3. Zero the EGM memory on first device open().
>> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
>> 5. Cleaning up state and destroying the chardev on device unbind.
>> 6. Handle presence of retired poisoned pages on the EGM region.
>>
>> Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
>> in the same directory.
>
>
> Pondering this series for a bit, is this auxiliary chardev approach
> really the model we should be pursuing?
>
> I know we're trying to disassociate the EGM region from the GPU, and
> de-duplicate it between GPUs on the same socket, but is there actually a
> use case of the EGM chardev separate from the GPU?

It is not just de-duplication. The EGM is a carveout of system memory
logically and physically separate and disconnected from the GPU. The
uniqueness here is that the information (SPA, size) of the region is present
on the GPU ACPI tables.

>
> The independent lifecycle of this aux device is troubling and it hasn't
> been confirmed whether or not access to the EGM region has some
>dependency on the state of the GPU. 

The EGM region is independent on the state of the GPU. One can plausibly
bootup the VM with just the EGM memory chardev as the backend file and
no GPU.

> nvgrace-gpu is manipulating sysfs
> on devices owned by nvgrace-egm, we don't have mechanisms to manage the
> aux device relative to the state of the GPU, we're trying to add a
> driver that can bind to device created by an out-of-tree driver, and
> we're inventing new uAPIs on the chardev for things that already exist
> for vfio regions.

Sorry for the confusion. The nvgrace-egm would not bind to the device
created by the out-of-tree driver. We would have a separate out-of-tree
equivalent of nvgrace-egm to bind to the device by the out-of-tree vfio
driver. Maybe we can consider exposing a register/unregister APIs from
nvgrace-egm where a module (in-tree nvgrace / out-of-tree) can register
a pdev and nvgrace-egm can use to fetch the region info.

> Therefore, does it actually make more sense to expose EGM as a device
> specific region on the vfio device fd?
>
> For example, nvgrace-gpu might manage the de-duplication by only
> exposing this device specific region on the lowest BDF GPU per socket.
> The existing REGION_INFO ioctl handles reporting the size to the user.
> The direct association to the GPU device handles reporting the node
> locality.  If necessary, a capability on the region could report the
> associated PXM, and maybe even the retired page list.
>
> All of the lifecycle issues are automatically handled, there's no
> separate aux device.  If necessary, zapping and faulting across reset
> is handled just like a BAR mapping.

The EGM memory (which becomes the system memory of the VM) cannot
be connected to the GPU reset as it is unrelated to the GPU device. We would
not want that to happen to system memory on GPU reset.

> If we need to expose the EGM size and GPU association via sysfs for
> management tooling, nvgrace-gpu could add an "egm_size" attribute to the
> PCI device's sysfs node.  This could also avoid the implicit
> implementation knowledge about which GPU exposes the EGM device
> specific region.
>
> Was such a design considered?  It seems much, much simpler and could be
> implemented by either nvgrace-gpu or identically by an out-of-tree
> driver without references in an in-kernel ID table.
>
> I'd like to understand the pros and cons of such an approach vs the one
> presented here.  Thanks,

We didn't consider it as a separate BAR / region as the EGM memory (part of the
system memory) is unrelated to the GPU device besides having its information
in the GPU ACPI table and becomes the system memory of the VM. Considering
it as part of the device BAR / region would connect the lifecyle of the EGM region
on the GPU device. Also we cannot consider zapping/faulting across GPU reset
as it is system memory of the VM.

Thanks
Ankit Agrawal