RE: [PATCH] x86/mce: Add workaround for SKX/CLX/CPX spurious machine checks
From: Luck, Tony
Date: Wed Feb 16 2022 - 13:42:08 EST
> Well, we could try to decode the instructions around rIP when the #MC
> is raised and see what caused the MCE and perhaps pick apart which insn
> caused it, is it accessing behind the buffer boundaries, etc.
Is this a case of "perfect is the enemy of good enough"?
It is a rare scenario (only a pain point for Jue because Google has billions and billions
of cores running this code). You need:
1) An uncorrected error
2) That error must be in first cache line of a page
3) Kernel must execute page_copy from the page immediately before that page
When all three happen, kernel crashes because we don't
have a recover path from kernel page_copy
-Tony