Re: [patch 01/10] x86/fpu/signal: Clarify exception handling in restore_fpregs_from_user()

From: Borislav Petkov
Date: Tue Aug 31 2021 - 03:39:07 EST


On Tue, Aug 31, 2021 at 02:34:16AM +0200, Thomas Gleixner wrote:
> what's worse is that even if you have access to such a machine, there is
> no documented way to do proper hardware based error injection.

Oh brother, welcome to my nightmare :) How much time to do you have? We
haz cookies.

> The injection mechanism which claims to do hardware error injection in
> arch/x86/kernel/cpu/mce/inject.c is a farce:

No no, that's an *attempt* to have something which at least works on
the arch level, without having other "agents" involved. Just keep on
readin'...

> All it does is to "prepare" the MSRs with some fake error values and
> raising #MC via int 18 afterwards in the hope that the previously
> prepared MSR values are still valid.

What do you mean? Something might swoop in and overwrite them before the
INT? Bah, we can do some locking but it is not worth it.

> Great way to test stuff by setting the MSR to the expected failure
> value and then raising the exception in software.

No no, the great way to do error injection is the ACPI-spec'ed, firwmare
implemented

drivers/acpi/apei/einj.c

Yap, you heard me right, firmware. And when you hear firmware, you can
imagine how it all works in practice... Yeap, exactly.

We even wrote documentation what to do:

Documentation/firmware-guide/acpi/apei/einj.rst

But but, this is firmware so

- it is f*cking broken in all ways imaginable

- if it works, it doesn't support the error type which you wanna inject

- if it does, enterprise sh*t hw has added value crap which analyzes and
looks at hardware errors first</me rolls eyes, trying to remain serious>
so you might get the error report if you get lucky.

So right now wrt to RAS my approach is: don't let it get worse than it
is. Yap, that's called maintainer resignation.

And all those hw vendors can come at me with the fanciest feature ideas
- my reply is: you wanted to do it all in the BIOS. Go do that there
too.

> NHM had a documented mechanism to inject at least ECC failures at the
> hardware level, but with the later memory controllers this ended up in
> the documentation black hole along with all the other undocumented real
> HW injection mechanisms which allow actual testing of this stuff.
>
> The HW injection mechanisms definitely exist, but without documentation
> they are useless. Intel still thinks that the secrecy around that stuff
> is valuable and they can get away with those untestable mechanisms even
> for their endeavours in the safety critical space.

My impression with error injection with hw people is just like what they
do with perf counters: it counts *something* right? You should be happy
that it does.

So yeah, hw error injection and RAS in general is a stinking pile of
doodoo. If I knew that then, I would've steered away from it.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette