Re: [PATCH] RAS/CEC: Add debugfs switch to disable at run time

From: Cong Wang
Date: Thu Apr 18 2019 - 19:58:37 EST


On Thu, Apr 18, 2019 at 4:29 PM Borislav Petkov <bp@xxxxxxxxx> wrote:
>
> On Thu, Apr 18, 2019 at 03:51:07PM -0700, Cong Wang wrote:
> > On Thu, Apr 18, 2019 at 3:02 PM Tony Luck <tony.luck@xxxxxxxxx> wrote:
> > >
> > > Useful when running error injection tests that want to
> > > see all of the MCi_(STATUS|ADDR|MISC) data via /dev/mcelog.
> > >
> > > Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
> >
> > We saw the same problem, CONFIG_RAS hijacks all the
> > correctable memory errors, which leaves mcelog "broken"
> > silently. I know it is arguable, but until we can switch from
> > mcelog to rasdaemon, mcelog should still work as before.
>
> It is "arguable" because this is not how the CEC is supposed to be used.

No, it is all about whether we should break users' expectation.

>
> If you want to collect errors with mcelog, you don't use the CEC at all.
> And there's ras=cec_disable for that or you simply don't enable it in
> your .config.
>
> As Tony says in the commit message, the enable should be used only for
> injection tests. Which is where that thing should only be used for -
> debugging the CEC itself.

This doesn't sounds like a valid reason for us to break users'
expectation.

Prior to CONFIG_RAS, mcelog just works fine for users (at least Intel
users). Suddenly after enabling CONFIG_RAS in kernel, mcelog will
no longer receive any correctable memory errors _silently_. What's
more, we don't even have rasdaemon running in our system, so there is
no consumer of RAS CEC, these errors just simply disappear from
users' expected place.

I know CONFIG_RAS is new feature supposed to replace MCELOG,
but they can co-exist in kernel config, which means mcelog should
continue to work as before until it gets fully replaced.

Even the following PoC change could make this situation better,
because with this change when we enable CONFIG_RAS,mcelog
will break _loudly_ rather than just silently, users will notice mcelog
is no longer supported and will look for its alternative choice.

diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index b834ff555188..f2e2b75fffbe 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -1,5 +1,6 @@
menuconfig RAS
bool "Reliability, Availability and Serviceability (RAS) features"
+ depends on !X86_MCELOG_LEGACY
help
Reliability, availability and serviceability (RAS) is a computer
hardware engineering term. Computers designed with higher levels


Just my 2 cents.

Thanks.