Re: [PATCH] EDAC/mce_amd: Fix Hygon UMC ECC error decoding with logical_die_id
From: Aichun Shi
Date: Sun Mar 01 2026 - 09:26:58 EST
On Mon, Feb 16, 2026 03:32:11PM -0500, Yazen Ghannam wrote:
> On Sat, Feb 14, 2026 at 02:42:03PM +0800, Aichun Shi wrote:
> > cpuinfo_topology.amd_node_id is populated via CPUID or MSR, as introduced
> > by commit f7fb3b2dd92c ("x86/cpu: Provide an AMD/HYGON specific topology
> > parser") and commit 03fa6bea5a3e ("x86/cpu: Make topology_amd_node_id()
> > use the actual node info"). However, this value may be non-continuous for
> > Hygon processors while EDAC uses continuous node IDs, which leads to
> > incorrect UMC ECC error decoding.
>
> Can you please share an example?
Yazen, thanks for your reply!
Certainly. For example, on some Hygon processors with 2 sockets and 4 dies
per socket, amd_node_id is populated as 0,1,2,3 for the 4 dies on socket 0,
and 16,17,18,19 for the 4 dies on socket 1, which is non-contiguous.
> >
> > In contract, cpuinfo_topology.logical_die_id always provides continuous
> > die (or node) IDs. Fix this by replacing topology_amd_node_id() with
> > topology_logical_die_id() when decoding UMC ECC errors for Hygon
> > processors.
On Hygon processors without CPUID leaf 0x80000026, the logical_die_id
obtained from topology_get_logical_id(apicid, TOPO_DIE_DOMAIN) is
incorrect. This is caused by the absence of die topology information
in the APIC ID space.
I have sent another patch to fix this issue:
https://lore.kernel.org/lkml/20260301141157.241770-1-shiaichun@xxxxxxxxxxxxxx/
Could you help to review this patch firstly?
> >
> > Signed-off-by: Aichun Shi <shiaichun@xxxxxxxxxxxxxx>
> > ---
> > drivers/edac/mce_amd.c | 9 +++++++--
> > 1 file changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/edac/mce_amd.c b/drivers/edac/mce_amd.c
> > index af3c12284a1e..4a23c1d6488e 100644
> > --- a/drivers/edac/mce_amd.c
> > +++ b/drivers/edac/mce_amd.c
> > @@ -746,8 +746,13 @@ static void decode_smca_error(struct mce *m)
> > pr_emerg(HW_ERR "%s Ext. Error Code: %d", smca_get_long_name(bank_type), xec);
> >
> > if ((bank_type == SMCA_UMC || bank_type == SMCA_UMC_V2) &&
> > - xec == 0 && decode_dram_ecc)
> > - decode_dram_ecc(topology_amd_node_id(m->extcpu), m);
> > + xec == 0 && decode_dram_ecc) {
> > + if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON &&
> > + boot_cpu_data.x86 == 0x18)
>
> Is the family check necessary? You did not mention a specific family in
> the commit message. So it seems the intent is to apply to all Hygon
> systems.
You are right, the family check (0x18) is over restrictive and can be removed.
> Thanks,
> Yazen
Thanks for your review and valuable comments!
Aichun Shi