Re: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Mauro Carvalho Chehab
Date: Fri Jun 01 2012 - 08:16:03 EST


Em 01-06-2012 06:10, Borislav Petkov escreveu:
> On Thu, May 31, 2012 at 08:52:21PM +0000, Luck, Tony wrote:
>>> It could be very quiet (i.e., machine runs with no errors) and it could
>>> have bursts where it reports a large number of errors back-to-back
>>> depending on access patterns, DIMM health, temperature, sea level and at
>>> least a bunch more factors.
>>
>> Yes - the normal case is a few errors from stray neutrons ... perhaps
>> a few per month, maybe on a very big system a few per hour. When something
>> breaks, especially if it affects a wide range of memory addresses, then
>> you will see a storm of errors.
>
> IOW, when the sh*t hits the fan :-)
>
>>> So I can imagine buffers filling up suddenly and fast, and userspace
>>> having hard time consuming them in a timely manner.
>>
>> But I'm wondering what agent is going to be reporting all these
>> errors. Intel has CMCI - so you can get a storm of interrupts
>> which would each generate a trace record ... but we are working
>> on a patch to turn off CMCI if a storm is detected.
>
> Yeah, about that. What are you guys doing about losing CECCs when
> throttling is on, I'm assuming there's no way around it?
>
>> AMD doesn't have CMCI, so errors just report from polling
>
> It does, look at <arch/x86/kernel/cpu/mcheck/mce_amd.c> That's the error
> thresholding. We were talking about having an APIC interrupt fire at
> _every_ CECC but I don't know/haven't tested how the software would
> behave in such cases where the hw spits out an overwhelming amount of
> errors.
>
>> - and we have a
>> maximum poll rate which is quite low by trace standards (even
>> when multiplied by NR_CPUS).
>>
>> Will EDAC drivers loop over some chipset registers blasting
>> out huge numbers of trace records ... that seems just as bad
>> for system throughput as a CMCI storm. And just as useless.
>
> Why useless?
>
> I don't know but we need to be as slim as possible on the reporting side
> for future use cases like that.
>
> Also, we probably want to proactively do something about such storms
> like offline pages or disable some hardware components so that they
> subside.
>
> Switching to polling mode IMHO only cures the symptom but not the
> underlying cause.
>
>> General principle: If there are very few errors happening then it is
>> important to log every single one of them.
>
> Absolutely.
>
>> If there are so many that we can't keep up, then we must sample at
>> some level, and we might as well do that at generation point.
>
> Yes, and then take action to recover and stop the storm.

In this case, just saving one field won't help. What helps is to group
all similar events into one trace. So, the solution is to add an
error count field, and let the EDAC core or the drivers to group
similar events.

We can also save some bytes by using u8 instead of "int". We may also
represent the grain as a shift mask, reducing it to 8 bits also:

TP_STRUCT__entry(
__field( u8, err_type )
__string( msg, error_msg )
__string( label, label )
__field( u16, err_count )
__field( u8, mc_index )
__field( u8, top_layer )
__field( u8, middle_layer )
__field( u8, lower_layer )
__field( long, address )
__field( u8, grain_bits )
__field( long, syndrome )
__string( driver_detail, driver_detail )
),

Where
grain = 1 << grain_bits

Regards,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/