Re: [PATCH v22] edac, ras/hw_event.h: use events to handle hw issues

From: Borislav Petkov
Date: Wed May 16 2012 - 09:17:30 EST


On Wed, May 16, 2012 at 08:22:07AM -0300, Mauro Carvalho Chehab wrote:
> >> If you take a look at the trace printk:
> >>
> >> + TP_printk("%s error:%s on memory stick \"%s\" (mc:%d %s %s %s)",
> >> + (__entry->err_type == HW_EVENT_ERR_CORRECTED) ? "Corrected" :
> >> + ((__entry->err_type == HW_EVENT_ERR_FATAL) ?
> >> + "Fatal" : "Uncorrected"),
> >> + __get_str(msg),
> >> + __get_str(label),
> >> + __entry->mc_index,
> >> + __get_str(location),
> >> + __get_str(detail),
> >> + __get_str(driver_detail))
> >>
> >> There are not extra spaces there. The first extra space is probably because
> >> there is an extra space at the label string. This should be easy to fix.
> >>
> >> The other extra space at the end is because amd64 currently doesn't provide
> >> driver_detail information.
> >
> > Remind me again why do we need two strings: detail and driver_detail?
> >
> > Because they could very well be lumped together with a single "%s"
> > format - "(mc:%d %s)" - and be printed.
>
> As already explained, after merging two different fields, they can't be easily
> unmerged.
>
> The main reason for moving from printk to tracepoint is to allow userspace
> programs to get data in binary format, avoiding them to parse a printk message.
>
> So, it is more important to provide the right information at the trace fields
> than the niceness of the printk data.

This doesn't answer my question. My question was: "why can't 'detail'
and 'driver_detail' be a single parameter, i.e. 'detail' and this way
solve both pretty printing and getting binary data problems?

> > And detail will always contain something which is not the empty string,
> > so problem solved.
> >
> >>> Also, according to the output above "amd64_edac" is supposed to be
> >>> [error msg] which is strange.
> >>>
> >>> I believe this comes from this call in f1x_map_sysaddr_to_csrow():
> >>>
> >>> edac_mc_handle_error(HW_EVENT_ERR_CORRECTED, mci,
> >>> page, offset, syndrome,
> >>> csrow, chan, -1,
> >>> EDAC_MOD_STR, "", NULL);
> >>>
> >>> I guess you want to do the following instead:
> >>>
> >>> mcegen.py-2868 [007] .N.. 178.261607: mc_event: amd64_edac: corrected error on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0x34ed)
> >>>
> >>> maybe concatenate EDAC_MOD_STR with the proper string it reports, i.e.
> >>> corrected/uncorrected error?
> >>
> >> The issue here is because amd64_edac (just like a few other drivers) use
> >> its driver name (EDAC_MOD_STR) as the error message, instead of using
> >> something meaningful, like "read error" or "ECC error".
> >
> > No, the issue is here that edac_mc_handle_ce() used to say "CE..."
> > and edac_mc_handle_ue() used to say "UE.. " and yours don't say that
> > anymore. In other words, you need to add the "CE/UE" thing to the string
> > based on the HW_EVENT_ERR_* flag or something to that effect.
>
>
> Huh?
>
> >>> mcegen.py-2868 [007] .N.. 178.261607: mc_event: Corrected error:amd64_edac on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0x34ed )
>
> CE/UE is there. The only change is that, instead of using acronyms, it is now
> saying "Corrected error"/"Uncorrected error", as the idea is to provide something
> that the user will better understand.

Ok, so let's change the string output from the above version to:

"mcegen.py-2868 [007] .N.. 178.261607: mc_event: amd64_edac: Corrected error on memory stick "unknown memory" (mc:0 csrow:3 channel:1 page:0x5bac7 offset:0x388 grain:0 syndrome:0x34ed)"

I.e., "mc_event" [DRIVER] [ERROR TYPE] [DETAIL].

> >>>> Of course, any userspace tools meant to handle errors should not parse
> >>>> the above data. They should, instead, use the binary fields provided by
> >>>> the tracepoint, mapping them directly into their MIBs.
> >>>
> >>> What is a MIB?
> >>
> >> Management Information Base. This is how anyone that works with Element
> >> Management calls the model of information that represents each management
> >> property. It is generally written using ITU-T ASN.1 syntax. Almost all
> >> management software use that.
> >>
> >> [1] http://en.wikipedia.org/wiki/Management_information_base
> >
> > That looks like an ACPI or some other idiotic spec speak, pls remove it.
>
> No, MIB is not an idiotic speak. It is a widely-used term, older than TCP/IP,
> and defined by ITU-T (M.3000 series) and by ISO.
>
> It is used by everybody that knows a little bit about error management,
> as all management systems and protocols (like SNMP) are based on MIB
> concepts. This concept is also used by other international forums, like
> IETF, to define the Information Model to be used to manage the machine
> and/or network resources. For example, the TCP/IP stack management base
> is defined at IETF RFC1213.

More idiotic drivel.

Oh, well, you won't let go of this crap, so at least add the wikipedia
URL to the "MIB" reference so that mere mortals like me who don't know
jack of error management can understand what you're talking about.

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/