Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

From: Borislav Petkov
Date: Sat Apr 20 2019 - 05:41:37 EST


On Fri, Apr 19, 2019 at 08:04:01AM -0700, Luck, Tony wrote:
> Now there isn't really anything better that CEC can do in
> this situation. It won't help to have a bigger array. Taking
> pages offline wouldn't solve the problem (though if that
> did happen at least it would break the silence).
>
> Same situation for other DRAM failure modes that affect a
> wide range of pages (rank, bank, perhaps row ... though all
> the errors from a single row failure might fit in the CEC array).
>
> Allowing the user to bypass CEC (without a reboot ... cloud folks
> hate to reboot their systems) would allow the sysadmin to see
> what is happening (either via /dev/mcelog, or via EDAC driver).

Err, this all sounds to me like the storm detection code should
*automatically* disable the CEC in such cases, I'd say. Because I
don't see a cloud admin going into the debugfs and turning it off.
Rather, if the detection heuristic we use is smart enough, disabling it
automatically should be a lot better serviceability action.

Hmmm?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.