Re: [PATCH v24b] RAS: Add a tracepoint for reporting memorycontroller events

From: Borislav Petkov
Date: Mon May 21 2012 - 12:01:04 EST


On Mon, May 21, 2012 at 12:29:29PM -0300, Mauro Carvalho Chehab wrote:
> Em 19-05-2012 06:26, Borislav Petkov escreveu:
> > On Fri, May 18, 2012 at 11:12:11PM +0200, Borislav Petkov wrote:
> >> On Fri, May 18, 2012 at 07:10:42PM +0000, Luck, Tony wrote:
> >>>> That's why _each_ _driver_ will have its format and the userspace tools
> >>>> parsing it will know about it!
> >>>
> >>> Sounds like a full employment program for parser writers.
> >>>
> >>> There are some interesting fields that should be common to all
> >>> drivers ... so have twenty parsers that can handle:
> >>>
> >>> paddr: 0x1234
> >>> PADDR: 0x1234
> >>> Paddr = 0x1234
> >>> Phys = 1234
> >>> addr: 0x1234
> >>> Address: 0x000000001234
> >>>
> >>> looks like a lot of make-work ... when the OS can standardize in the ABI
> >>> that there is a 64-bit binary value that is the physical address of the
> >>> error (and another 64-bit mask saying which, if any, bits are valid).
>
> The EDAC API works with the address/address mask as 3 values: address page,
> address offset inside the page and grain.
>
> The values for address page/offset/grain and syndrome are mandatory on the EDAC
> error calls. Drivers that can't obtain that fill with zero. EDAC prints those
> values via the current API.
>
> So, passing those values for the tracepoint makes sense.
>
> In order to use a cleaner API, we may add a flags bitfield there, and add a few
> flags to mark if the address (page, offset, grain) is valid and if the syndrome
> is valid.

This is exactly the issue - we don't want to overcomplicate the
tracepoint! Simply output the single address value and userspace can cut
and split and paste it however it wants.

<snip some bullshit which I'm tired of addressing everytime>

> >> For the record, I very much like this reasoning :).
> >
> > On a second thought, filtering out the globally valid fields for all
> > drivers might require a lot of careful driver auditing.
>
> If you want to explode the driver details, then yes, but the address and
> syndrome are part of the EDAC internal ABI.
>
> Most drivers are capable of getting address and syndrome, for CE.
> For UE, syndrome is generally not calculated.
>
> Anyway, on _all_ internal ABI calls, when syndrome or address aren't
> calculated, the drivers pass a value of 0 for them.
>
> So, it is easy to add a logic inside the code that will rise a flag to
> indicate userspace when those fields are filled or not. All the EDAC
> core needs to do is to check if the value is 0 or not.

No, we're not going to do that! Adding a bit about the validity of a bit
is wrong design and is screaming B0RKED.

We're going to have single fields for EDAC-global valid values and leave
the driver-specific stuff lumped in one char * string.

Then, if a driver cannot specify sindrome or whatever, it simply DOESN'T
PUT IT IN THE STRING instead of adding validity bits.

[ â ]

> > Which is basically what I'm suggesting for a couple of days now :-)
> >
> > TP_PROTO(const unsigned int err_type,
> > const unsigned int mc_index,
> > const char *error_msg,
> > const char *label,
> > const char *location,
> > const char *detail)
> >
> > and I'm really not sure about err_type - this is an edac-specific define
> > and it means nothing outside the kernel so its string representation
> > could very well could be merged with error_msg and we can drop the ( ? :
> > ) ugliness in the tracepoint definition itself.
>
> All error management systems use a mandatory field for the error severity.

... I'm pretty sure the defines you're exporting are not the same as
your error management systems require. So some kind of mapping is
still needed. Which shows that you're going to need to massage that
information in userspace after all.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/