Re: [PATCH RFC v2 14/15] vfio/nvgrace-gpu: Add link from pci to EGM

From: Alex Williamson

Date: Wed Mar 04 2026 - 18:38:22 EST


On Mon, 23 Feb 2026 15:55:13 +0000
<ankita@xxxxxxxxxx> wrote:

> From: Ankit Agrawal <ankita@xxxxxxxxxx>
>
> To replicate the host EGM topology in the VM in terms of
> the GPU affinity, the userspace need to be aware of which
> GPUs belong to the same socket as the EGM region.
>
> Expose the list of GPUs associated with an EGM region
> through sysfs. The list can be queried from the auxiliary
> device path.
>
> On a 2-socket, 4 GPU Grace Blackwell setup, the GPUs shows
> up at /sys/class/egm/egmX.
>
> E.g. ls /sys/class/egm/egm4/

If we end up with a sysfs representation of the EGM device, why did we
go to the trouble of naming them based on their PXM?

Shouldn't we just have a node association in sysfs rather than the GPUs?

AIUI, the PXM value doesn't necessarily align to the kernel's node
index anyway, so what is the value of exposing the PXM? If the node
association is learned through sysfs, we could just use an ida for
assigning minors and avoid the address space problem of PXM values
aligning to reserved minor numbers.

> 0008:01:00.0  0009:01:00.0  dev  device  egm_size  power  subsystem  uevent
>
> Suggested-by: Matthew R. Ochs <mochs@xxxxxxxxxx>
> Signed-off-by: Ankit Agrawal <ankita@xxxxxxxxxx>
> ---
> drivers/vfio/pci/nvgrace-gpu/egm_dev.c | 47 +++++++++++++++++++++++++-
> 1 file changed, 46 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> index 6d716c3a3257..3bdd5bb41e1b 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/egm_dev.c
> @@ -56,6 +56,50 @@ int nvgrace_gpu_fetch_egm_property(struct pci_dev *pdev, u64 *pegmphys,
> return ret;
> }
>
> +static struct device *egm_find_chardev(struct nvgrace_egm_dev *egm_dev)
> +{
> + char name[32] = { 0 };
> +
> + scnprintf(name, sizeof(name), "egm%lld", egm_dev->egmpxm);

%llu

> + return device_find_child_by_name(&egm_dev->aux_dev.dev, name);
> +}
> +
> +static int nvgrace_egm_create_gpu_links(struct nvgrace_egm_dev *egm_dev,
> + struct pci_dev *pdev)
> +{
> + struct device *chardev_dev = egm_find_chardev(egm_dev);
> + int ret;
> +
> + if (!chardev_dev)
> + return 0;
> +
> + ret = sysfs_create_link(&chardev_dev->kobj,
> + &pdev->dev.kobj,
> + dev_name(&pdev->dev));
> +
> + put_device(chardev_dev);
> +
> + if (ret && ret != -EEXIST)
> + return ret;
> +
> + return 0;
> +}
> +
> +static void remove_egm_symlinks(struct nvgrace_egm_dev *egm_dev,
> + struct pci_dev *pdev)
> +{
> + struct device *chardev_dev;
> +
> + chardev_dev = egm_find_chardev(egm_dev);
> + if (!chardev_dev)
> + return;
> +
> + sysfs_remove_link(&chardev_dev->kobj,
> + dev_name(&pdev->dev));
> +
> + put_device(chardev_dev);
> +}
> +
> int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> {
> struct gpu_node *node;
> @@ -68,7 +112,7 @@ int add_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
>
> list_add_tail(&node->list, &egm_dev->gpus);
>
> - return 0;
> + return nvgrace_egm_create_gpu_links(egm_dev, pdev);
> }
>
> void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
> @@ -77,6 +121,7 @@ void remove_gpu(struct nvgrace_egm_dev *egm_dev, struct pci_dev *pdev)
>
> list_for_each_entry_safe(node, tmp, &egm_dev->gpus, list) {
> if (node->pdev == pdev) {
> + remove_egm_symlinks(egm_dev, pdev);
> list_del(&node->list);
> kfree(node);
> }

This is really broken layering for nvgrace-gpu to be adding sysfs
attributes to the chardev devices. Thanks,

Alex