Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core
From: James Morse
Date: Thu Oct 01 2020 - 13:16:12 EST
Hi guys,
On 17/09/2020 09:40, Borislav Petkov wrote:
> On Thu, Sep 10, 2020 at 03:29:56PM +0000, Shiju Jose wrote:
> You can't know what exactly you wanna do if you don't have a use case
> you're trying to address.
>
>> According to the ARM Processor CPER definition the error types
>> reported are Cache Error, TLB Error, Bus Error and micro-architectural
>> Error.
>
> Bus error sounds like not even originating in the CPU but the CPU only
> reporting it. Imagine if that really were the case, and you go disable
> the CPU but the error source is still there. You've just disabled the
> reporting of the error only and now you don't even know anymore that
> you're getting errors.
>
>> Few thoughts on this,
>> 1. Not sure will a CPU core would work/perform as normal after disabling
>> a functional unit?
>
> You can disable parts of caches, etc, so that you can have a somewhat
> functioning CPU until the replacement maintenance can take place.
This is implementation-specific stuff that only firmware can do...
>> 2. Support in the HW to disable a function unit alone may not available.
>
> Yes.
>
>> 3. If it is require to store and retrieve the error count based on
>> functional unit, then CEC will become more complex?
>
> Depends on how it is designed. That's why we're first talking about what
> needs to be done exactly before going off and doing something.
>
>> This requirement is the part of the early fault prediction by taking
>> action when large number of corrected errors reported on a CPU core
>> before it causing serious faults.
>
> And do you know of actual real-life examples where this is really the
> case? Do you have any users who report a large error count on ARM CPUs,
> originating from the caches and that something like that would really
> help?
>
> Because from my x86 CPUs limited experience, the cache arrays are mostly
> fine and errors reported there are not something that happens very
> frequently so we don't even need to collect and count those.
>
> So is this something which you need to have in order to check a box
> somewhere that there is some functionality or is there an actual
> real-life use case behind it which a customer has requested?
If the corrected-count is available somewhere, can't this policy be made in user-space?
Thanks,
James