Re: [PATCH] acpi/ghes: Make ghes_panic_timeout adjustable as a parameter

From: Huang, Ying
Date: Mon Dec 30 2024 - 06:05:21 EST


Borislav Petkov <bp@xxxxxxxxx> writes:

> On Mon, Dec 30, 2024 at 01:54:36PM +0800, Huang, Ying wrote:
>> For example, it may be OK to wait forever for a software error, but it may
>> be better to reboot the system to contain the influence of the hardware
>> error for some hardware errors.
>
> A default panic timeout of 30 seconds for hw errors?! You do realize that 30
> seconds for a machine is an eternity and by that time your hardware error has
> long propagated and corrupted results, right?
>
> So your timeout is not even trying to do what you want.
>
> So unless I'm missing something, this ghes timeout needs to go - if you want
> to "contain the influence" you need to panic *immediately*! And not even that
> would help in some cases - hw has its own protections there so the OS
> panicking is meh. At least on x86, that is.

OK. 30 seconds isn't good enough for hw errors.

Another possible benefit of ghes_panic_timeout is,

rebooting instead of waiting forever can help us to log/report the
hardware errors earlier.

For example, the hardware errors could be logged in some simple
non-volatile storage (such as EFI variables) during panic or kdump, etc.
Then, after reboot, the new kernel could report the hardware errors in
some way.

>> So, we introduced another knob for that.
>
> No, that another knob is piling more of the silly ontop.

---
Best Regards,
Huang, Ying