Re: [PATCH v24b] RAS: Add a tracepoint for reporting memorycontroller events

From: Borislav Petkov
Date: Tue May 22 2012 - 09:05:20 EST


On Tue, May 22, 2012 at 07:18:21AM -0300, Mauro Carvalho Chehab wrote:
> Em 22-05-2012 06:28, Borislav Petkov escreveu:
> > On Tue, May 22, 2012 at 12:04:48AM -0300, Mauro Carvalho Chehab wrote:
> >> +TRACE_EVENT(mc_event,
> >> +
> >> + TP_PROTO(const unsigned int err_type,
> >> + const unsigned int mc_index,
> >> + const char *error_msg,
> >> + const char *label,
> >> + int layer0,
> >> + int layer1,
> >> + int layer2,
> >
> > Those are EDAC-internal layer representation, why are they exported to
> > userspace? Userspace needs only the location and label AFAICT.
>
> Those are not the EDAC internal layer representation. They're the physical
> location of the DIMM or rank.

Ok, you've replaced the location char * with the layers.

> > If you export them to userspace, they need much more meaningful names -
> > layer{0,1,2} mean nothing outside of the kernel.
>
> Ok. Do you have a better naming suggestion?
>
> What about layer0_pos, layer1_pos, layer2_pos?

Actually, I'd like them to be called channel/rank/row or something. Having them
numbered I don't know which layer is the top layer (channel/branch/slot)
and the lowest (rank/csrow/...)

Maybe top_layer, middle_layer, lowest_layer? Or something like that...

> >
> >> + unsigned long pfn,
> >> + unsigned long offset,
> >> + unsigned long grain,
> >
> > Why aren't those a single 'unsigned long address' since they all are
> > computed from it?
>
> We can merge pfn and offset into "unsigned long address".

Just have a single "unsigned long address" field and userspace can pick
out the stuff it needs from it.

> With regards to the grain, it is an address mask, written with a "short" way.
> So, grain 32, for example, means:
> ffff:ffff:ffff:fffe0
>
> As the current EDAC API exports it as grain, IMO, it is better to keep it as-is,
> but it won't be hard to do:
> unsigned long mask = ((unsigned long) -1) && (1 - grain)
>
> What do you think?

Why are we even exporting grain actually with each tracepoint
invocation? This is the granularity of reported error in bytes, and it,
as such, is statically assigned to a value in each driver. Userspace can
certainly figure out that value in a different way.

But the more important question is: does the grain help us when handling
the error info in userspace?

It tells us that at this physical address with "grain" granularity we
had an error. So?

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/