Re: [PATCH -v10 0/4] Lock-less list

From: Borislav Petkov
Date: Fri Jan 21 2011 - 13:01:59 EST


On Fri, Jan 21, 2011 at 09:39:34AM -0800, Tim Hockin wrote:
> >> Of course, that's why the upstream EDAC code uses printk too. In fact it
> >> does all
> >> sorts of in-kernel decoding to make the printk output more useful - the
> >> /dev/mcelog
> >> method of pushing all decoding to user-space is fundamentally flawed.
>
> EDAC is fundamentally flawed and we don't use it any more. It strips
> off so much information that we can't actually figure out what
> happened to the level we want. We do it in userspace now.

Well, you better make sure to tell me what information you need reported
and I'll try to get it fixed :) Currently, we can decode all MCEs in the
kernel and when the MCE is reporting a DRAM ECC error we can get you the
chip select it resulted from with EDAC.

We can also get you the bank, row and column from which the error
originates (could be added easily to amd64_edac.c).

[..]

> > It's also very ignorant to assume that the kernel knows everything about the
> > system and is capable of decoding errors to the satisfaction of userland.
> >  As Duncan Laurie pointed out (https://lkml.org/lkml/2011/1/11/390) we care
> > about not only the physical address, but which stick and which dimm *chip*
> > on the stick is having problems.  In-kernel abstractions  break down due to
> > the following:
>
> This. Andi was trying to use DMI tables to decode physical address to
> DIMMs, but I'll tell you this: I have yet to see a platform that has
> THAT MUCH information in the DMI tables and have it be *correct*.

and yes, there's not a fool-proof and generic way to tell which chip
select on the system points at which DIMM. And excuse me, but I really
really think that reading i2c devices and decyphering SPD ROM info from
them is still not the optimal solution - it should be easier and more
transparent than that. But guess what, this might change...

> >   * The kernel couldn't possible know how my i2c busses are setup and the
> > SPD EEPROMs are related to the physical memory abstraction that the bios
> > sets up for me.  I don't know of any standard way to have the BIOS expose
> > this sort of information to the operating system.  This sort of layout
> > changes between motherboard spins quite frequently as well, so good luck
> > mapping it yourself in any generic way.
> >
> >   * The kernel couldn't know how to map SPD JEDEC Manufacturer ID, Model
> > part number and revision to anything useful about the chips themselves.
> >
> >   * The kernel also couldn't know how to communicate with the AMBs in a
> > meaningful way (if present).
> >
> >
> > At the end of the day,   The only things I really care about are:
> >
> >   * I don't care if the kernel pre-processes the data it gets from the
> > hardware when there is an error.  For most users, burping something out to
> > the logs in decoded form is generally useful.  It isn't for us.
> >   * Don't ever put the kernel in a position where it will spam the logs and
> > wedge the system -- even if the hardware is wonky.
>
> I'll add to this - sometimes 100 MCEs/second is acceptable. The
> Kernel needs to not flake out under that.

Yeah, we got that, you want error reporting to be configurable and not
only over printk - we'll fix it.

> >   * Don't dummy the data such that I can't do the same calculations with
> > better visibility from userland.
>
> This. We do extensive analysis of data in userland.

Yeah, we want to put the MCE register info along with the decoded info.
We don't want to dummy up the data - we want to make it more useful.

> >   * Don't ever enforce a reactive policy that can't be changed from
> > userland.
> >   * I don't care whether the data comes from netlink, /dev/mcelog,
> > whiz-bang-sysfs uevent, or thingamaboo perfevents doohickie: as long as I
> > get events that are both atomic+consistent and the ABI is maintained.
>
> I've been asking for hardware events for ever. I seem to recall a
> proposal from IBM at OLS 2002 or 2003 where this was discussed. I
> wanted it then, and I still want it. But I don't just want MCEs. Why
> can I not use the same channel to get PCI errors or SATA errors or
> EDAC (non-MCE) errors.
>
> I don't care what the channel is, so long as I can rate-limit
> (/dev/mcelog is pretty good at that) events and the events I read
> contain full details about what happened.

Ok, makes sense.

> > I've CCed Robert who owns our userland bits as he may have something to add.
> >
> > That said, I'd love to have generic NMI-safe data-passing for improved
> > debugability, regardless of this conflated bickering about RAS
> > infrastructure :)

Thanks for the suggestions, much appreciated.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/