Re: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Borislav Petkov
Date: Tue May 29 2012 - 10:52:15 EST


On Tue, May 29, 2012 at 11:02:10AM -0300, Mauro Carvalho Chehab wrote:
> It seems you were unable to read the comments at the function that fills dimm->grain:
>
> /*
> * The dram rank boundary (DRB) reg values are boundary addresses
> * for each DRAM rank with a granularity of 64MB. DRB regs are
> * cumulative; the last one will contain the total memory
> * contained in all ranks.

This looks like a bug:

"The DRAM Rank Boundary Register defines the upper boundary address
of each DRAM rank with a granularity of 32 MB. Each rank has its own
single-byte DRB register. These registers are used to determine which
chip select will be active for a given address."

This is from http://www.intel.com/Assets/PDF/datasheet/306828.pdf which
is 955X but it should be documenting the same thing - DRB.

Now, if I'm reporting an error address and I'm saying "you had an error
at X, but this error is somewhere in the X+64MB region", then I can
simply say which rank it is. And we're doing that already with the
layer-things.

[ â ]

> That means that any correlation function used by an stochastic process
> analysis will need to take the grain into account, in order to detect
> if a series of errors are due to a random noise, or if they're due to
> a physical problem at the device.

Dude, stop talking crap and concentrate. On which planet is granularity
of the error 64 MB?

>From <Documentation/edac.txt>:

============================================================================
SYSTEM LOGGING

If logging for UEs and CEs are enabled then system logs will have
error notices indicating errors that have been detected:

EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0,
channel 1 "DIMM_B1": amd76x_edac

EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0,
channel 1 "DIMM_B1": amd76x_edac


The structure of the message is:
the memory controller (MC0)
Error type (CE)
memory page (0x283)
offset in the page (0xce0)
the byte granularity (grain 8)
or resolution of the error
^^^^

and

struct csrow_info {
unsigned long first_page; /* first page number in dimm */
unsigned long last_page; /* last page number in dimm */
unsigned long page_mask; /* used for interleaving -
* 0UL for non intlv
*/
u32 nr_pages; /* number of pages in csrow */
u32 grain; /* granularity of reported error in bytes */
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

But none of that matters - the only thing that matters is that this
thing is static and doesn't change for the module's lifetime.

So add it as a part of some EDAC initialization printk which we print
once on boot in dmesg and userspace tools can read it. Or to sysfs, if
it makes more sense.

But not in _each_ tracepoint record, filling the buffers with useless info.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
GM: Alberto Bozzo
Reg: Dornach, Landkreis Muenchen
HRB Nr. 43632 WEEE Registernr: 129 19551
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/