Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on Aldebaran

From: Chatradhi, Naveen Krishna
Date: Thu Nov 04 2021 - 09:21:31 EST

Next message: Ilya Leoshkevich: "[PATCH] perf: Use __BYTE_ORDER__"
Previous message: kernel test robot: "Re: [PATCH 1/2] drm/input_helper: Add new input-handling helper"
In reply to: Borislav Petkov: "Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on Aldebaran"
Next in thread: Chatradhi, Naveen Krishna: "Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on Aldebaran"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Boris,

On 11/2/2021 11:33 PM, Borislav Petkov wrote:

[CAUTION: External Email]

On Thu, Oct 28, 2021 at 06:31:02PM +0530, Naveen Krishna Chatradhi wrote:

Staring at this more...

Thanks for taking the time.

+/*
+ * Newer AMD CPUs and GPUs whose data fabrics can be connected via custom xGMI
+ * links, comes with registers to gather local and remote node type map info.
+ *
+ * "Local Node Type" refers to nodes with the same type as that from which the
+ * register is read, and "Remote Node Type" refers to nodes with a different type.
+ *
+ * This function, reads the registers from GPU DF function 1.
+ * Hence, local nodes are GPU and remote nodes are CPUs.
+ */
+static int amd_get_node_map(void)

... so this is a generic function name...

+{
+ struct amd_node_map *nodemap;
+ struct pci_dev *pdev;
+ u32 tmp;
+
+ pdev = pci_get_device(PCI_VENDOR_ID_AMD,
+ PCI_DEVICE_ID_AMD_ALDEBARAN_DF_F1, NULL);

... but this here is trying to get the Aldebaran PCI device function.

I know, this is confusion. we will try to give a meaning for definition here.

So what happens if in the future, the GPU is a different one and it
gets RAS functionality and PCI device functions too? You'd probably need
to add that new GPU support too.

Yes, might happen

And then looking at that patch again, see how this new code is bolted on
and sure, it all is made to work, but it is strenuous and you have to
always pay attention to what type of devices you're dealing with.

And the next patch does:

... if (bank_type == SMCA_UMC_V2) {

/* do UMC v2 special stuff here. */

which begs the question: wouldn't this GPU PCI devices enumeration be a
lot cleaner if it were separate?

I.e., in amd_nb.c you'd have

init_amd_nbs:

amd_cache_northbridges();
amd_cache_gart();
amd_cache_gpu_devices();

Agreed. however, a slight modification to the suggestion

Instead of modifying the init_amd_nbs()

How about, defining a new struct

+struct system_topology {
+       const struct pci_device_id *misc_ids;
+       const struct pci_device_id *link_ids;
+       const struct pci_device_id *root_ids;
+       u16 roots_per_misc;
+       u16 misc_count;
+       u16 root_count;
+};

and modifying the amd_cache_northbridges() to

+int amd_cache_northbridges(void)
+{
+        struct system_toplogy topo;
+        int ret;
+
+        if (amd_northbridges.num)
+                return 0;
+
+        ret = amd_cpu_nbs(&topo);
+        printk("==> misc:%d\n", ret);
+
+        if (look_for_remote_nodes()) {
+                ret = amd_gpu_nbs(&topo);
+                printk("==> gpu_misc:%d\n", ret);
+        }
+
+        get_next_northbridges(&topo);

This way, creating appropriate number MCs under EDAC and existing exported APIs can remain the same.

Let me know your thoughts on this. I can send an updated version with your comments addressed.

and in this last one you do your enumeration. Completely separate data
structures and all. Adding a new device support would then be trivial.

And then looking at the next patch again, you have:

+ } else if (bank_type == SMCA_UMC_V2) {
+ /*
+ * SMCA_UMC_V2 exists on GPU nodes, extract the node id
+ * from register MCA_IPID[47:44](InstanceIdHi).
+ * The InstanceIdHi field represents the instance ID of the GPU.
+ * Which needs to be mapped to a value used by Linux,
+ * where GPU nodes are simply numerically after the CPU nodes.
+ */
+ node_id = ((m->ipid >> 44) & 0xF) -
+ amd_gpu_node_start_id() + amd_cpu_node_count();

where instead of exporting those functions and having the caller do the
calculations, you'd have a function in amd_nb.c which is called

amd_get_gpu_node_id(unsigned long ipid)

which will use those separate data structures mentioned above and give
you the node id.

Sure, we can modify this way.

And those GPU node IDs are placed numerically after the CPU nodes so
your code doesn't need to do anything special - just read out registers
and cache them.

And you don't need those exports either - it is all nicely encapsulated
and a single function is used to get the callers what they wanna know.

Got it, thank you.

Hmmm?

--
Regards/Gruss,
Boris.

https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Cdd5b3586178441f4886808d99e2b1ef3%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637714730331703852%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=oXDojOFqVVhxn4P1tgwLycaJgc2rvwo8EoUj3i971Mw%3D&reserved=0

Next message: Ilya Leoshkevich: "[PATCH] perf: Use __BYTE_ORDER__"
Previous message: kernel test robot: "Re: [PATCH 1/2] drm/input_helper: Add new input-handling helper"
In reply to: Borislav Petkov: "Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on Aldebaran"
Next in thread: Chatradhi, Naveen Krishna: "Re: [PATCH v6 1/5] x86/amd_nb: Add support for northbridges on Aldebaran"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]