[PATCH v2] x86/mm: Change so poison pages are either unmapped or marked uncacheable

From: Luck, Tony
Date: Wed May 20 2020 - 12:35:53 EST


An interesting thing happened when a guest Linux instance took
a machine check. The VMM unmapped the bad page from guest physical
space and passed the machine check to the guest.

Linux took all the normal actions to offline the page from the process
that was using it. But then guest Linux crashed because it said there
was a second machine check inside the kernel with this stack trace:

do_memory_failure
set_mce_nospec
set_memory_uc
_set_memory_uc
change_page_attr_set_clr
cpa_flush
clflush_cache_range_opt

This was odd, because a CLFLUSH instruction shouldn't raise a machine
check (it isn't consuming the data). Further investigation showed that
the VMM had passed in another machine check because is appeared that the
guest was accessing the bad page.

Fix is to check the scope of the poison by checking the MCi_MISC register.
If the entire page is affected, then unmap the page. If only part of the
page is affected, then mark the page as uncacheable.

This assumes that VMMs will do the logical thing and pass in the "whole
page scope" via the MCi_MISC register (since they unmapped the entire
page).

Reported-by: Jue Wang <juew@xxxxxxxxxx>
Tested-by: Jue Wang <juew@xxxxxxxxxx>
Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()")
Cc: <stable@xxxxxxxxxxxxxxx>
Signed-off-by: Tony Luck <tony.luck@xxxxxxxxx>
---
Boris wrote:
> Maybe I'm not making myself clear enough here: I'm talking about using
> a *special* MCE signature which says "your mapping has disappeared from
> underneath you." Maybe a bit in MCi_MISC which the hw doesn't use. Or
> some other deprecated bit, whatever.

My reactions (in order)
1) How can I do that for 10 year old h/w?
2) Oh wait, Boris is talking about the VMM faking a signature to the guest.
They can mess with whatever bits they want.
3) Unused/deprecated bits are in short supply. Taking one over and getting
all the VMM vendors to agree to this "standard" is going to be a huge effort.
4) Could we do something inside the existing architecture?
5) MCi_MISC uses low order bits to indicate the "blast radius" of an error
6) Asks Jue "Could you set those MCi_MISC bits to say the whole page is gone?
7) Jue: We already do that!
8) Me: writes this patch

arch/x86/include/asm/set_memory.h | 19 +++++++++++++------
arch/x86/kernel/cpu/mce/core.c | 11 +++++++++--
include/linux/set_memory.h | 2 +-
3 files changed, 23 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/set_memory.h b/arch/x86/include/asm/set_memory.h
index ec2c0a094b5d..5948218f35c5 100644
--- a/arch/x86/include/asm/set_memory.h
+++ b/arch/x86/include/asm/set_memory.h
@@ -86,28 +86,35 @@ int set_direct_map_default_noflush(struct page *page);
extern int kernel_set_to_readonly;

#ifdef CONFIG_X86_64
-static inline int set_mce_nospec(unsigned long pfn)
+/*
+ * Prevent speculative access to the page by either unmapping
+ * it (if we do not require access to any part of the page) or
+ * marking it uncacheable (if we want to try to retrieve data
+ * from non-poisoned lines in the page).
+ */
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
unsigned long decoy_addr;
int rc;

/*
- * Mark the linear address as UC to make sure we don't log more
- * errors because of speculative access to the page.
* We would like to just call:
- * set_memory_uc((unsigned long)pfn_to_kaddr(pfn), 1);
+ * set_memory_XX((unsigned long)pfn_to_kaddr(pfn), 1);
* but doing that would radically increase the odds of a
* speculative access to the poison page because we'd have
* the virtual address of the kernel 1:1 mapping sitting
* around in registers.
* Instead we get tricky. We create a non-canonical address
* that looks just like the one we want, but has bit 63 flipped.
- * This relies on set_memory_uc() properly sanitizing any __pa()
+ * This relies on set_memory_XX() properly sanitizing any __pa()
* results with __PHYSICAL_MASK or PTE_PFN_MASK.
*/
decoy_addr = (pfn << PAGE_SHIFT) + (PAGE_OFFSET ^ BIT(63));

- rc = set_memory_uc(decoy_addr, 1);
+ if (unmap)
+ rc = set_memory_np(decoy_addr, 1);
+ else
+ rc = set_memory_uc(decoy_addr, 1);
if (rc)
pr_warn("Could not invalidate pfn=0x%lx from 1:1 map\n", pfn);
return rc;
diff --git a/arch/x86/kernel/cpu/mce/core.c b/arch/x86/kernel/cpu/mce/core.c
index 54165f3569e8..c1a480a27164 100644
--- a/arch/x86/kernel/cpu/mce/core.c
+++ b/arch/x86/kernel/cpu/mce/core.c
@@ -529,6 +529,13 @@ bool mce_is_memory_error(struct mce *m)
}
EXPORT_SYMBOL_GPL(mce_is_memory_error);

+static bool whole_page(struct mce *m)
+{
+ if (!mca_cfg.ser || !(m->status & MCI_STATUS_MISCV))
+ return true;
+ return MCI_MISC_ADDR_LSB(m->misc) >= PAGE_SHIFT;
+}
+
bool mce_is_correctable(struct mce *m)
{
if (m->cpuvendor == X86_VENDOR_AMD && m->status & MCI_STATUS_DEFERRED)
@@ -600,7 +607,7 @@ static int uc_decode_notifier(struct notifier_block *nb, unsigned long val,

pfn = mce->addr >> PAGE_SHIFT;
if (!memory_failure(pfn, 0))
- set_mce_nospec(pfn);
+ set_mce_nospec(pfn, whole_page(mce));

return NOTIFY_OK;
}
@@ -1098,7 +1105,7 @@ static int do_memory_failure(struct mce *m)
if (ret)
pr_err("Memory error not recovered");
else
- set_mce_nospec(m->addr >> PAGE_SHIFT);
+ set_mce_nospec(m->addr >> PAGE_SHIFT, whole_page(m));
return ret;
}

diff --git a/include/linux/set_memory.h b/include/linux/set_memory.h
index 86281ac7c305..860e0f843c12 100644
--- a/include/linux/set_memory.h
+++ b/include/linux/set_memory.h
@@ -26,7 +26,7 @@ static inline int set_direct_map_default_noflush(struct page *page)
#endif

#ifndef set_mce_nospec
-static inline int set_mce_nospec(unsigned long pfn)
+static inline int set_mce_nospec(unsigned long pfn, bool unmap)
{
return 0;
}
--
2.21.1