[CAUTION: External Email]Sure.
On Thu, Oct 28, 2021 at 06:31:06PM +0530, Naveen Krishna Chatradhi wrote:
On newer heterogeneous systems with AMD CPUs, the data fabrics of the GPUsFor all your future commit messages, from
are connected directly via custom links.
One such system, where Aldebaran GPU nodes are connected to the
Family 19h, model 30h family of CPU nodes, the Aldebaran GPUs can report
memory errors via SMCA banks.
Aldebaran GPU support was added to DRM framework
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.freedesktop.org%2Farchives%2Famd-gfx%2F2021-February%2F059694.html&data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=Pnx%2FQxgmXS2MfWzu8lVEs22As3yJSWoAsjOvp%2FFOJYw%3D&reserved=0
The GPU nodes comes with HBM2 memory in-built, ECC support is enabled by
default and the UMCs on GPU node are different from the UMCs on CPU nodes.
GPU specific ops routines are defined to extend the amd64_edac
module to enumerate HBM memory leveraging the existing edac and the
amd64 specific data structures.
Note: The UMC Phys on GPU nodes are enumerated as csrows and the UMC
channels connected to HBM banks are enumerated as ranks.
Documentation/process/submitting-patches.rst:
"Describe your changes in imperative mood, e.g. "make xyzzy do frotz"
instead of "[This patch] makes xyzzy do frotz" or "[I] changed xyzzy
to do frotz", as if you are giving orders to the codebase to change
its behaviour."
Also, do not talk about what your patch does - that should hopefully be
visible in the diff itself. Rather, talk about *why* you're doing what
you're doing.
Will modify the names to use prefixes
Cc: Yazen Ghannam <yazen.ghannam@xxxxxxx>The function pointer this gets assigned to is called *get*_misc_regs.
Co-developed-by: Muralidhara M K <muralimk@xxxxxxx>
Signed-off-by: Muralidhara M K <muralimk@xxxxxxx>
Signed-off-by: Naveen Krishna Chatradhi <nchatrad@xxxxxxx>
Link: https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.kernel.org%2Fr%2F20210823185437.94417-4-nchatrad%40amd.com&data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=aD3jFWS4RrPrIRF5mFoggBs%2BYqUseU3adJFqlrbD2D8%3D&reserved=0
---
Changes since v5:
Removed else condition in per_family_init for 19h family
Changes since v4:
Split "f17_addr_mask_to_cs_size" instead as did in 3rd patch earlier
Changes since v3:
1. Bifurcated the GPU code from v2
Changes since v2:
1. Restored line deletions and handled minor comments
2. Modified commit message and some of the function comments
3. variable df_inst_id is introduced instead of umc_num
Changes since v1:
1. Modifed the commit message
2. Change the edac_cap
3. kept sizes of both cpu and noncpu together
4. return success if the !F3 condition true and remove unnecessary validation
drivers/edac/amd64_edac.c | 298 +++++++++++++++++++++++++++++++++-----
drivers/edac/amd64_edac.h | 27 ++++
2 files changed, 292 insertions(+), 33 deletions(-)
diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 7953ffe9d547..b404fa5b03ce 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -1121,6 +1121,20 @@ static void debug_display_dimm_sizes_df(struct amd64_pvt *pvt, u8 ctrl)
}
}
+static void debug_display_dimm_sizes_gpu(struct amd64_pvt *pvt, u8 ctrl)
+{
+ int size, cs = 0, cs_mode;
+
+ edac_printk(KERN_DEBUG, EDAC_MC, "UMC%d chip selects:\n", ctrl);
+
+ cs_mode = CS_EVEN_PRIMARY | CS_ODD_PRIMARY;
+
+ for_each_chip_select(cs, ctrl, pvt) {
+ size = pvt->ops->dbam_to_cs(pvt, ctrl, cs_mode, cs);
+ amd64_info(EDAC_MC ": %d: %5dMB\n", cs, size);
+ }
+}
+
static void __dump_misc_regs_df(struct amd64_pvt *pvt)
{
struct amd64_umc *umc;
@@ -1165,6 +1179,27 @@ static void __dump_misc_regs_df(struct amd64_pvt *pvt)
pvt->dhar, dhar_base(pvt));
}
+static void __dump_misc_regs_gpu(struct amd64_pvt *pvt)
But this function is is doing *dump*.
When I see __dump_misc_regs_gpu() I'd expect this function to be called
by a higher level dump_misc_regs() as the "__" denotes layering usually.
There is, in fact, dump_misc_regs() but that one calls
->get_misc_regs().
So this all needs fixing - right now I see a mess.
Also, there's <function_name>_gpu and gpu_<function_name> with the "gpu"
being either prefix or suffix. You need to fix the current ones to be
only prefixes - in a pre-patch - and then add yours as prefixes only too.
Sure, i will split this.
And in talking about pre-patches - this one is doing a bunch of things
at once and needs splitting. You do the preparatory work like carving
out common functionality and other refactoring in pre-patches, and then
you add the new functionality ontop.
Yes, adding few more ops routines should help get rid of this if conditions.
Also, I don't like ->is_gpu one bit, it being sprinkled like that around
the code. This says that the per-family attrs and ops assignment is
lacking.
@@ -2982,7 +3132,17 @@ static void decode_umc_error(int node_id, struct mce *m)This comment doesn't look like it is destined for human beings to read
pvt->ops->get_umc_err_info(m, &err);
- if (umc_normaddr_to_sysaddr(m->addr, pvt->mc_node_id, err.channel, &sys_addr)) {
+ /*
+ * GPU node has #phys[X] which has #channels[Y] each.
+ * On GPUs, df_inst_id = [X] * num_ch_per_phy + [Y].
+ * On CPUs, "Channel"="UMC Number"="DF Instance ID".
+ */
but maybe to be parsed by programs?
+ if (pvt->is_gpu)I'm not sure we want to log ECCs from the GPUs: what is going to happen
+ df_inst_id = (err.csrow * pvt->channel_count / mci->nr_csrows) + err.channel;
to them, how does the further error recovery look like?
Also, EDAC sysfs structure currently has only DIMMs, below's from my
box.
I don't see how that structure fits the GPUs so here's the $10^6
question: why does EDAC need to know about GPUs?
Sure, i will update with the why.
What is the strategy here?
Your 0th message talks about the "what" but what gets added is not
important - the "why" is.
$ grep -r . /sys/devices/system/edac/mc/ 2>/dev/null
/sys/devices/system/edac/mc/power/runtime_active_time:0
/sys/devices/system/edac/mc/power/runtime_status:unsupported
/sys/devices/system/edac/mc/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/power/control:auto
/sys/devices/system/edac/mc/mc0/ce_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_ue_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_mem_type:Unbuffered-DDR4
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_active_time:0
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_status:unsupported
/sys/devices/system/edac/mc/mc0/rank7/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/mc0/rank7/power/control:auto
/sys/devices/system/edac/mc/mc0/rank7/dimm_dev_type:Unknown
/sys/devices/system/edac/mc/mc0/rank7/size:8192
/sys/devices/system/edac/mc/mc0/rank7/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/rank7/dimm_label:mc#0csrow#3channel#1
/sys/devices/system/edac/mc/mc0/rank7/dimm_location:csrow 3 channel 1
/sys/devices/system/edac/mc/mc0/rank7/dimm_edac_mode:SECDED
/sys/devices/system/edac/mc/mc0/topmem:0x00000000e0000000
/sys/devices/system/edac/mc/mc0/mc_name:F17h
/sys/devices/system/edac/mc/mc0/rank5/dimm_ue_count:0
/sys/devices/system/edac/mc/mc0/rank5/dimm_mem_type:Unbuffered-DDR4
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_active_time:0
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_status:unsupported
/sys/devices/system/edac/mc/mc0/rank5/power/runtime_suspended_time:0
/sys/devices/system/edac/mc/mc0/rank5/power/control:auto
/sys/devices/system/edac/mc/mc0/rank5/dimm_dev_type:Unknown
/sys/devices/system/edac/mc/mc0/rank5/size:8192
/sys/devices/system/edac/mc/mc0/rank5/dimm_ce_count:0
/sys/devices/system/edac/mc/mc0/rank5/dimm_label:mc#0csrow#2channel#1
/sys/devices/system/edac/mc/mc0/rank5/dimm_location:csrow 2 channel 1
/sys/devices/system/edac/mc/mc0/rank5/dimm_edac_mode:SECDED
/sys/devices/system/edac/mc/mc0/dram_hole:0 0 0
/sys/devices/system/edac/mc/mc0/ce_noinfo_count:0
...
--
Regards/Gruss,
Boris.
https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.kernel.org%2Ftglx%2Fnotes-about-netiquette&data=04%7C01%7CNaveenKrishna.Chatradhi%40amd.com%7Ca6aa66df219641c880d008d9a514e5bf%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637722331481685552%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=fw%2B%2F8rCVRhhZ0xp8XLUW5rvoDUnfNQzwJleHVoPmDkw%3D&reserved=0