Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

From: Borislav Petkov
Date: Thu Sep 17 2020 - 04:40:50 EST

On Thu, Sep 10, 2020 at 03:29:56PM +0000, Shiju Jose wrote:
> Ok. However the functions such as __find_elem() use
> memory specific PFN() and PAGE_SHIFT.

You can add your version find_elem_cpu() or so. You can do this with a
set of function pointers which belong to the different type of storage
the CEC needs, you can do all kinds of fun.

> I will check this. For CPU, the corrected errors count for a short
> time period to be checked. Thus old errors outside this period would
> not be considered and would be cleared. It is not clear to me whether
> in the current CEC, the count for the old errors outside a time period
> would be excluded for the threshold check or removed?

Currently, the CEC decays the errors each time do_spring_cleaning()
runs, by decrementing DECAY_BITS in the PFN record. Those which get
DECAY_BITS of 0, get overwritten when the data structure is full.

You can do something similar by halving the error count or something
more complex like save the error timestamp and eliminate...

You can't know what exactly you wanna do if you don't have a use case
you're trying to address.

> According to the ARM Processor CPER definition the error types
> reported are Cache Error, TLB Error, Bus Error and micro-architectural
> Error.

Bus error sounds like not even originating in the CPU but the CPU only
reporting it. Imagine if that really were the case, and you go disable
the CPU but the error source is still there. You've just disabled the
reporting of the error only and now you don't even know anymore that
you're getting errors.

> Few thoughts on this,
> 1. Not sure will a CPU core would work/perform as normal after disabling
> a functional unit?

You can disable parts of caches, etc, so that you can have a somewhat
functioning CPU until the replacement maintenance can take place.

> 2. Support in the HW to disable a function unit alone may not available.


> 3. If it is require to store and retrieve the error count based on
> functional unit, then CEC will become more complex?

Depends on how it is designed. That's why we're first talking about what
needs to be done exactly before going off and doing something.

> This requirement is the part of the early fault prediction by taking
> action when large number of corrected errors reported on a CPU core
> before it causing serious faults.

And do you know of actual real-life examples where this is really the
case? Do you have any users who report a large error count on ARM CPUs,
originating from the caches and that something like that would really

Because from my x86 CPUs limited experience, the cache arrays are mostly
fine and errors reported there are not something that happens very
frequently so we don't even need to collect and count those.

So is this something which you need to have in order to check a box
somewhere that there is some functionality or is there an actual
real-life use case behind it which a customer has requested?

> We are mainly looking for disable CPU core on large number of L1/L2
> cache corrected errors reported on a CPU core. Can we add atleast
> removing CPU core for the CPU cache corrected errors filtering out
> other error types?

See above.