Re: [PATCH RFC v2 00/15] Add virtualization support for EGM
From: Alex Williamson
Date: Thu Mar 05 2026 - 12:42:44 EST
On Mon, 23 Feb 2026 15:54:59 +0000
<ankita@xxxxxxxxxx> wrote:
> From: Ankit Agrawal <ankita@xxxxxxxxxx>
>
> Background
> ----------
> Grace Hopper/Blackwell systems support the Extended GPU Memory (EGM)
> feature that enable the GPU to access the system memory allocations
> within and across nodes through high bandwidth path. This access path
> goes as: GPU <--> NVswitch <--> GPU <--> CPU. The GPU can utilize the
> system memory located on the same socket or from a different socket
> or even on a different node in a multi-node system [1]. This feature is
> being extended to virtualization.
>
>
> Design Details
> --------------
> EGM when enabled in the virtualization stack, the host memory
> is partitioned into 2 parts: One partition for the Host OS usage
> called Hypervisor region, and a second Hypervisor-Invisible (HI) region
> for the VM. Only the hypervisor region is part of the host EFI map
> and is thus visible to the host OS on bootup. Since the entire VM
> sysmem is eligible for EGM allocations within the VM, the HI partition
> is interchangeably called as EGM region in the series. This HI/EGM region
> range base SPA and size is exposed through the ACPI DSDT properties.
>
> Whilst the EGM region is accessible on the host, it is not added to
> the kernel. The HI region is assigned to a VM by mapping the QEMU VMA
> to the SPA using remap_pfn_range().
>
> The following figure shows the memory map in the virtualization
> environment.
>
> |---- Sysmem ----| |--- GPU mem ---| VM Memory
> | | | |
> |IPA <-> SPA map | |IPA <-> SPA map|
> | | | |
> |--- HI / EGM ---|-- Host Mem --| |--- GPU mem ---| Host Memory
>
> The patch series introduce a new nvgrace-egm auxiliary driver module
> to manage and map the HI/EGM region in the Grace Blackwell systems.
> This binds to the auxiliary device created by the parent
> nvgrace-gpu (in-tree module for device assignment) / nvidia-vgpu-vfio
> (out-of-tree open source module for SRIOV vGPU) to manage the
> EGM region for the VM. Note that there is a unique EGM region per
> socket and the auxiliary device gets created for every region. The
> parent module fetches the EGM region information from the ACPI
> tables and populate to the data structures shared with the auxiliary
> nvgrace-egm module.
>
> nvgrace-egm module handles the following:
> 1. Fetch the EGM memory properties (base HPA, length, proximity domain)
> from the parent device shared EGM region structure.
> 2. Create a char device that can be used as memory-backend-file by Qemu
> for the VM and implement file operations. The char device is /dev/egmX,
> where X is the PXM node ID of the EGM being mapped fetched in 1.
> 3. Zero the EGM memory on first device open().
> 4. Map the QEMU VMA to the EGM region using remap_pfn_range.
> 5. Cleaning up state and destroying the chardev on device unbind.
> 6. Handle presence of retired poisoned pages on the EGM region.
>
> Since nvgrace-egm is an auxiliary module to the nvgrace-gpu, it is kept
> in the same directory.
Pondering this series for a bit, is this auxiliary chardev approach
really the model we should be pursuing?
I know we're trying to disassociate the EGM region from the GPU, and
de-duplicate it between GPUs on the same socket, but is there actually a
use case of the EGM chardev separate from the GPU?
The independent lifecycle of this aux device is troubling and it hasn't
been confirmed whether or not access to the EGM region has some
dependency on the state of the GPU. nvgrace-gpu is manipulating sysfs
on devices owned by nvgrace-egm, we don't have mechanisms to manage the
aux device relative to the state of the GPU, we're trying to add a
driver that can bind to device created by an out-of-tree driver, and
we're inventing new uAPIs on the chardev for things that already exist
for vfio regions.
Therefore, does it actually make more sense to expose EGM as a device
specific region on the vfio device fd?
For example, nvgrace-gpu might manage the de-duplication by only
exposing this device specific region on the lowest BDF GPU per socket.
The existing REGION_INFO ioctl handles reporting the size to the user.
The direct association to the GPU device handles reporting the node
locality. If necessary, a capability on the region could report the
associated PXM, and maybe even the retired page list.
All of the lifecycle issues are automatically handled, there's no
separate aux device. If necessary, zapping and faulting across reset
is handled just like a BAR mapping.
If we need to expose the EGM size and GPU association via sysfs for
management tooling, nvgrace-gpu could add an "egm_size" attribute to the
PCI device's sysfs node. This could also avoid the implicit
implementation knowledge about which GPU exposes the EGM device
specific region.
Was such a design considered? It seems much, much simpler and could be
implemented by either nvgrace-gpu or identically by an out-of-tree
driver without references in an in-kernel ID table.
I'd like to understand the pros and cons of such an approach vs the one
presented here. Thanks,
Alex