Re: [PATCH RESEND v2] x86/mce: Set PG_hwpoison page flag to avoid the capture kernel panic

From: Zhiquan Li
Date: Mon Oct 09 2023 - 20:38:12 EST



On 2023/10/3 03:06, Ingo Molnar wrote:
> The English in this commit is *atrocious*, both in the changelog and in
> the comments - how on Earth did 'Posion' typo and half a dozen other
> typos and bad grammar survive ~3 iterations and a Reviewed-by tag?? The
> version below fixes up the worst, but I suspect that's not the only
> problem with this patch...

Many thanks for your attention and fixes up, Ingo.

I’d like to introduce more background of this patch.

Memory errors don’t happen very often, especially the severity is fatal.
However, in large-scale scenarios, such as data centers, it might still
happen. For some MCE fatal error cases, the kernel might call
mce_panic() to terminate the production kernel directly, but not try to
make the kernel survive via memory_failure() handling. Unfortunately,
the capture kernel will panic for the same reason if it touches the
error memory again. The consequence is that only an incomplete vmcore
is left for sustaining engineers, it’s a big headache for them to make
clear what happened in the past.


We had considered 3 solutions and finally chose the last one.

1. When the capture kernel boots up, re-scans the MCE banks to check if
there are fatal errors, set the PG_hwpoison flag for each error
pages.
We can foresee this solution is heavy. It needs to find the struct
page of error pages from old memory and set the flag. Looks like we
need to remake the wheel, so we gave up it.

2. Replace the function copy_to_iter() at __copy_oldmem_page() with the
function _copy_mc_to_iter(), which is a #MC safe version.
This solution is lightweight but has following drawbacks:

1) Such issues are quite rare events; we don’t want to use a #MC safe
copy to accommodate it. Especially, if the problem can be deal
with by MCE handling rather than touching the Kdump stuff.

2) The #MC safe copy is conditionally, whether it can fix the #MC
error depends on MCE handling can reach the fixup_exception()
function at do_machine_check(). However, in fatal error case, it
might invoke mce_panic() to crash the capture kernel earlier than
fixing up the error.

3. The solution in this patch overcomes all above drawbacks. It set the
flag just before the production kernel calls panic(), which would not
introduce additional overhead in capture kernel or conflict with
other hwpoision-related code in production kernel. Furthermore, it
leverages the already existing mechanisms to fix the issue as much as
possible, the code changes are also lightweight.


To verify the fix is not difficult. The issue can be simulated by
ras-tools
(https://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git),
"copyout" test case. It can inject a fatal memory error in kernel space
via APEI ENIJ interface (need hardware platform support), and then it
touches the error page to produce the issue. The patch has been
validated by this tool.

Any idea is welcome!

Best Regards,
Zhiquan