Re: [RFC PATCH 0/7] RAS/CEC: Extend CEC for errors count check on short time period

From: Borislav Petkov
Date: Fri Oct 02 2020 - 14:02:30 EST


On Fri, Oct 02, 2020 at 06:33:17PM +0100, James Morse wrote:
> > I think adding the CPU error collection to the kernel
> > has the following advantages,
> > 1. The CPU error collection and isolation would not be active if the
> > rasdaemon stopped running or not running on a machine.

Wasn't there this thing called systemd which promised that it would
restart daemons when they fail? And even if it is not there, you can
always do your own cronjob which checks rasdaemon presence and restarts
it if it has died and sends a mail to the admin to check why it had
died.

Everything else I've trimmed but James has put it a lot more eloquently
than me and I cannot agree more with what he says. Doing this in
userspace is better in every aspect you can think of.

The current CEC thing runs in the kernel because it has a completely
different purpose - to limit corrected error reports which turn into
very expensive support calls for errors which were corrected but people
simply don't get that they were corrected. Instead, they throw hands in
the air and go "OMG, my hardware is failing".

Where those are, as James says:

> These are corrected errors. Nothing has gone wrong.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette