Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on Aldebaran
From: Borislav Petkov
Date: Tue Nov 02 2021 - 14:03:47 EST
On Thu, Oct 28, 2021 at 06:31:02PM +0530, Naveen Krishna Chatradhi wrote:
Staring at this more...
> +/*
> + * Newer AMD CPUs and GPUs whose data fabrics can be connected via custom xGMI
> + * links, comes with registers to gather local and remote node type map info.
> + *
> + * "Local Node Type" refers to nodes with the same type as that from which the
> + * register is read, and "Remote Node Type" refers to nodes with a different type.
> + *
> + * This function, reads the registers from GPU DF function 1.
> + * Hence, local nodes are GPU and remote nodes are CPUs.
> + */
> +static int amd_get_node_map(void)
... so this is a generic function name...
> +{
> + struct amd_node_map *nodemap;
> + struct pci_dev *pdev;
> + u32 tmp;
> +
> + pdev = pci_get_device(PCI_VENDOR_ID_AMD,
> + PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F1, NULL);
... but this here is trying to get the Aldebaran PCI device function.
So what happens if in the future, the GPU is a different one and it
gets RAS functionality and PCI device functions too? You'd probably need
to add that new GPU support too.
And then looking at that patch again, see how this new code is bolted on
and sure, it all is made to work, but it is strenuous and you have to
always pay attention to what type of devices you're dealing with.
And the next patch does:
... if (bank_type == SMCA_UMC_V2) {
/* do UMC v2 special stuff here. */
which begs the question: wouldn't this GPU PCI devices enumeration be a
lot cleaner if it were separate?
I.e., in amd_nb.c you'd have
init_amd_nbs:
amd_cache_northbridges();
amd_cache_gart();
amd_cache_gpu_devices();
and in this last one you do your enumeration. Completely separate data
structures and all. Adding a new device support would then be trivial.
And then looking at the next patch again, you have:
+ } else if (bank_type == SMCA_UMC_V2) {
+ /*
+ * SMCA_UMC_V2 exists on GPU nodes, extract the node id
+ * from register MCA_IPID[47:44](InstanceIdHi).
+ * The InstanceIdHi field represents the instance ID of the GPU.
+ * Which needs to be mapped to a value used by Linux,
+ * where GPU nodes are simply numerically after the CPU nodes.
+ */
+ node_id = ((m->ipid >> 44) & 0xF) -
+ amd_gpu_node_start_id() + amd_cpu_node_count();
where instead of exporting those functions and having the caller do the
calculations, you'd have a function in amd_nb.c which is called
amd_get_gpu_node_id(unsigned long ipid)
which will use those separate data structures mentioned above and give
you the node id.
And those GPU node IDs are placed numerically after the CPU nodes so
your code doesn't need to do anything special - just read out registers
and cache them.
And you don't need those exports either - it is all nicely encapsulated
and a single function is used to get the callers what they wanna know.
Hmmm?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette