Re: [PATCH v2 5/7] x86/mm, tracing: Fix CR2 corruption

From: Linus Torvalds
Date: Sat Jul 06 2019 - 20:08:52 EST


On Sat, Jul 6, 2019 at 3:41 PM Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Sat, Jul 6, 2019 at 3:27 PM Steven Rostedt <rostedt@xxxxxxxxxxx> wrote:
> >
> > We also have to deal with reading vmalloc'd data as that can fault too.
>
> Ahh, that may be a better reason for PeterZ's patches and reading cr2
> very early from asm code than the stack trace case.

Hmm. Another alternative might be to simply just make our vmalloc page
fault handling more robust.

Right now, if we take a vmalloc page fault in an inconvenient spot, it
is fatal because it corrupts the cr2 in the outer context.

However, it doesn't *need* to be fatal. Who cares if the outer context
cr2 gets corrupted? We probably *shouldn't* care - it's an odd and
unusual case, and the outer context could just handle the wrong
vmalloc-address cr2 fine (it's going to be a no-op, since the inner
page fault will have handled it already), return, and then re-fault.

The only reason it's fatal right now is that we care much too deeply about

(a) the error code
(b) the pt_regs state

when we handle vmalloc faults.

So one option is that we simply handle the vmalloc faults _without_
caring about the error code and the pt_regs state, because even if the
error code or the pt_regs implies that the fault comes from user
space, the cr2 value might be due to a vmalloc fault from the inner
kernel page fault that corrupted cr2.

Right now vmalloc faults are already special and need to be handled
without holding any locks etc. We'd just make them even more special,
and say that we might have a vmalloc area fault from any context.

IOW, somethinig like the attached patch would make nesting vmalloc
faults harmless. Sure, we'll do the "vmalloc fault" twice, and return
and re-do the original page fault, but this is a very unusual case, so
from a performance angle we don't really care.

But I guess the "read cr2 early" is fine too..

Linus
arch/x86/mm/fault.c | 32 +++++++++-----------------------
1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 46df4c6aae46..3a03504bc624 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1245,6 +1245,15 @@ static void
do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
unsigned long address)
{
+ /*
+ * The kernel vmalloc area can fault in at any time, and
+ * we should not check the hw error code, since the cr2 value
+ * could be a stale one from a nested vmalloc fault, but the
+ * error code got pushed by hardware.
+ */
+ if (vmalloc_fault(address) >= 0)
+ return;
+
/*
* Protection keys exceptions only happen on user pages. We
* have no user pages in the kernel portion of the address
@@ -1252,29 +1261,6 @@ do_kern_addr_fault(struct pt_regs *regs, unsigned long hw_error_code,
*/
WARN_ON_ONCE(hw_error_code & X86_PF_PK);

- /*
- * We can fault-in kernel-space virtual memory on-demand. The
- * 'reference' page table is init_mm.pgd.
- *
- * NOTE! We MUST NOT take any locks for this case. We may
- * be in an interrupt or a critical region, and should
- * only copy the information from the master page table,
- * nothing more.
- *
- * Before doing this on-demand faulting, ensure that the
- * fault is not any of the following:
- * 1. A fault on a PTE with a reserved bit set.
- * 2. A fault caused by a user-mode access. (Do not demand-
- * fault kernel memory due to user-mode accesses).
- * 3. A fault caused by a page-level protection violation.
- * (A demand fault would be on a non-present page which
- * would have X86_PF_PROT==0).
- */
- if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
- if (vmalloc_fault(address) >= 0)
- return;
- }
-
/* Was the fault spurious, caused by lazy TLB invalidation? */
if (spurious_kernel_fault(hw_error_code, address))
return;