Re: [PATCH v2 3/3] vfio/nvgrace-gpu: Check the HBM training and C2C link status

From: Alex Williamson
Date: Thu Jan 09 2025 - 15:21:11 EST


On Sun, 5 Jan 2025 17:36:15 +0000
<ankita@xxxxxxxxxx> wrote:

> From: Ankit Agrawal <ankita@xxxxxxxxxx>
>
> In contrast to Grace Hopper systems, the HBM training has been moved
> out of the UEFI on the Grace Blackwell systems. This reduces the system
> bootup time significantly.
>
> The onus of checking whether the HBM training has completed thus falls
> on the module.
>
> The HBM training status can be determined from a BAR0 register.
> Similarly, another BAR0 register exposes the status of the CPU-GPU
> chip-to-chip (C2C) cache coherent interconnect.
>
> Based on testing, 30s is determined to be sufficient to ensure
> initialization completion on all the Grace based systems. Thus poll
> these register and check for 30s. If the HBM training is not complete
> or if the C2C link is not ready, fail the probe.
>
> While the time is not required on Grace Hopper systems, it is
> beneficial to make the check to ensure the device is in an
> expected state. Hence keeping it generalized to both the generations.
>
> Signed-off-by: Ankit Agrawal <ankita@xxxxxxxxxx>
> ---
> drivers/vfio/pci/nvgrace-gpu/main.c | 53 +++++++++++++++++++++++++++++
> 1 file changed, 53 insertions(+)
>
> diff --git a/drivers/vfio/pci/nvgrace-gpu/main.c b/drivers/vfio/pci/nvgrace-gpu/main.c
> index 44a276c886e1..cf020496743e 100644
> --- a/drivers/vfio/pci/nvgrace-gpu/main.c
> +++ b/drivers/vfio/pci/nvgrace-gpu/main.c
> @@ -5,6 +5,7 @@
>
> #include <linux/sizes.h>
> #include <linux/vfio_pci_core.h>
> +#include <linux/delay.h>
>
> /*
> * The device memory usable to the workloads running in the VM is cached
> @@ -28,6 +29,13 @@
>
> #define GPU_CAP_DVSEC_REGISTER 3
>
> +#define C2C_LINK_BAR0_OFFSET 0x1498
> +#define HBM_TRAINING_BAR0_OFFSET 0x200BC
> +#define STATUS_READY 0xFF
> +
> +#define POLL_QUANTUM_MS 1000
> +#define POLL_TIMEOUT_MS (30 * 1000)
> +
> /*
> * The state of the two device memory region - resmem and usemem - is
> * saved as struct mem_region.
> @@ -848,6 +856,47 @@ static bool nvgrace_gpu_has_mig_hw_bug_fix(struct pci_dev *pdev)
> return false;
> }
>
> +/*
> + * To reduce the system bootup time, the HBM training has
> + * been moved out of the UEFI on the Grace-Blackwell systems.
> + *
> + * The onus of checking whether the HBM training has completed
> + * thus falls on the module. The HBM training status can be
> + * determined from a BAR0 register.
> + *
> + * Similarly, another BAR0 register exposes the status of the
> + * CPU-GPU chip-to-chip (C2C) cache coherent interconnect.
> + *
> + * Poll these register and check for 30s. If the HBM training is
> + * not complete or if the C2C link is not ready, fail the probe.
> + *
> + * While the wait is not required on Grace Hopper systems, it
> + * is beneficial to make the check to ensure the device is in an
> + * expected state.
> + */
> +static int nvgrace_gpu_check_device_status(struct pci_dev *pdev)

"nvgrace_gpu_wait_device_ready()"?

> +{
> + void __iomem *io;
> + int time_elasped;
> +
> + io = pci_iomap(pdev, 0, ~0UL);

The documentation is unclear here, but existing code suggests passing 0
here rather than -1 to map the full BAR. It ends up being equivalent
since the code doesn't error attempting to map longer than the BAR, but
there's no reason to add a bad example.

> + if (!io)
> + return -ENOMEM;
> +
> + for (time_elasped = 0; time_elasped < POLL_TIMEOUT_MS;
> + time_elasped += POLL_QUANTUM_MS) {
> + if ((ioread32(io + C2C_LINK_BAR0_OFFSET) == STATUS_READY) &&
> + (ioread32(io + HBM_TRAINING_BAR0_OFFSET) == STATUS_READY)) {
> + pci_iounmap(pdev, io);
> + return 0;
> + }
> + msleep(POLL_QUANTUM_MS);
> + }

time_after() would simplify things here. I'd also suggest a common
exit path.

> +
> + pci_iounmap(pdev, io);
> + return -ENODEV;

ETIME could work for the error code too. Thanks,

Alex

> +}
> +
> static int nvgrace_gpu_probe(struct pci_dev *pdev,
> const struct pci_device_id *id)
> {
> @@ -856,6 +905,10 @@ static int nvgrace_gpu_probe(struct pci_dev *pdev,
> u64 memphys, memlength;
> int ret;
>
> + ret = nvgrace_gpu_check_device_status(pdev);
> + if (ret)
> + return ret;
> +
> ret = nvgrace_gpu_fetch_memory_property(pdev, &memphys, &memlength);
> if (!ret)
> ops = &nvgrace_gpu_pci_ops;