Re: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Mauro Carvalho Chehab
Date: Thu May 31 2012 - 12:14:44 EST


Em 31-05-2012 12:14, Borislav Petkov escreveu:
> On Thu, May 31, 2012 at 12:01:19PM -0300, Mauro Carvalho Chehab wrote:
>> Grain is an error property, associated with the error address.
>> It is as simple as that. It is not a "change grain frequently" type
>> of thing: each address have its associated grain.
>
> ... which almost never changes:
>
> 5 amd76x_edac.c amd76x_init_csrows 214 dimm->grain = dimm->nr_pages << PAGE_SHIFT;
> 6 cpc925_edac.c cpc925_init_csrows 367 dimm->grain = 32;
> 7 cpc925_edac.c cpc925_init_csrows 371 dimm->grain = 64;
> 8 e752x_edac.c e752x_init_csrows 1119 dimm->grain = 1 << 12;
> 9 e7xxx_edac.c e7xxx_init_csrows 399 dimm->grain = 1 << 12;
> k i3000_edac.c i3000_probe1 416 dimm->grain = I3000_DEAP_GRAIN;
> l i3200_edac.c i3200_probe1 395 dimm->grain = nr_pages << PAGE_SHIFT;
> m i5000_edac.c i5000_init_csrows 1286 dimm->grain = 8;
> n i5100_edac.c i5100_init_csrows 852 dimm->grain = 32;
> o i5400_edac.c i5400_init_dimms 1212 dimm->grain = 8;
> p i7300_edac.c decode_mtr 662 dimm->grain = 8;
> q i7core_edac.c get_dimm_config 637 dimm->grain = 8;
> r i82443bxgx_edac.c i82443bxgx_init_csrows 225 dimm->grain = 1 << 12;
> s i82860_edac.c i82860_init_csrows 180 dimm->grain = 1 << 12;
> t i82875p_edac.c i82875p_init_csrows 388 dimm->grain = 1 << 12;
> v i82975x_edac.c i82975x_init_csrows 430 dimm->grain = 1 << 7;
> w mpc85xx_edac.c mpc85xx_init_csrows 956 dimm->grain = 8;
> x mv64x60_edac.c mv64x60_init_csrows 677 dimm->grain = 8;
> y pasemi_edac.c pasemi_edac_init_csrows 183 dimm->grain = PASEMI_EDAC_ERROR_GRAIN;
> z ppc4xx_edac.c ppc4xx_edac_init_csrows 983 dimm->grain = 1;
> A r82600_edac.c r82600_init_csrows 259 dimm->grain = 1 << 14;
> B sb_edac.c get_dimm_config 597 dimm->grain = 32;
> C tile_edac.c tile_edac_init_csrows 117 dimm->grain = TILE_EDAC_ERROR_GRAIN;
> D x38_edac.c x38_probe1 394 dimm->grain = nr_pages << PAGE_SHIFT;

The grains among the drivers are different; userspace needs to know, so an
API is needed.

>
> From all possible EDAC grain assignments above, only 3 are not static.

+ sb_edac
+ i7core_edac

On both, the grain should be given via MCE regs (it is on my TODO list).

>
>> Ok, on _old_ hardware, this used to be constant, but on modern ones,
>> this is associated with the error type, as Tony already explained.
>
> You mean "different" hardware.

I mean _old_ hardware, e. g. non-MCA hardware. On MCA, the MISCV flag
(at least on Intel) changes the address granularity.

>> Don't create a crappy API, just because you want to save 32 bits.
>> Btw, a "string" grain will spare much more than just 32 bits.
>
> Don't create a bloated API just to fit your purpose because you're
> staring at the world through your glasses.

It is not a bloated API. The error grain should be reported to userspace,
as:
- Not all drivers have the same address granularity, as you've shown
above;
- No other userspace API provides it;
- The granularity is a property of the per-error address;
- There are well-known cases where the address grain changes are
dynamically filled by the error registers (MCA arch on Intel).

So, the memory error tracepoint is the proper place to store it, as it is
the place where the address and the other memory error information is
reported to userspace.

Also, converting the grain to a string, as you proposed would require at
least 26 bytes to store "grain: 0xdeadbeef:deadbeef", while putting it as
a u64 will consume only 8 bytes.

Regards,
Mauro.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/