Re: [PATCH 1/1] RAS: Add CPU Correctable Error Collector to isolate an erroneous CPU core

From: Borislav Petkov
Date: Wed Sep 09 2020 - 08:21:07 EST


On Tue, Sep 01, 2020 at 04:20:54PM +0000, Shiju Jose wrote:
> CPU CEC derived the infrastructure of the CEC only and the logic
> used in the CEC for CE count storage, CE count calculation and page
> isolation is very unique for the memory pages, which seems cannot be
> reusable for the CPU CEs.

Oh, because it saves the reported error's PFN and you want to save

[CPU num | error count]

?

Well, you can easily change that by extending the existing CEC to have a
different storage format for CPU errors, i.e., use a different ce_array
which gets passed to the functions anyway.

> Also the values set for the parameters such as threshold, time period
> for the memory errors and CPU errors would be different.

And your implementation with sliding windows is so totally different
that it warrants the duplication of the code? I don't think so.

You can use the current CEC to do exactly what you wanna do, with the
decaying and so on.

Because all you wanna do is count the errors a CPU triggered.

However, a CPU can trigger a *lot* of different types of errors.
You're putting them all in the same basket by doing:

else if (guid_equal(sec_type, &CPER_SEC_PROC_ARM))
/* add to CEC */

and only for correctable.

What type of errors get reported in CPER_SEC_PROC_ARM?

If they're all lumped together and if some functional unit generates a
lot of errors, instead of disabling that unit only, you'll go and remove
the whole CPU?

Doesn't make a whole lot of sense to me.

How about you define what exactly you're trying to solve, maybe give an
example of a real issue someone is encountering and you're trying to
address? Because there was never a necessity so far to disable CPUs on
x86 due to correctable errors. Why is that needed on ARM?

> Thus extending cec.c to support CPU CEs would include adding CPU CEC
> specific code for storing error count, isolation etc which I thought
> would result the code less tidy and less readable unless find more
> reusable logic.

Depends on how you design it.

But with what I'm seeing so far, I'm still sceptical this is needed at
all.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette