Re: [RFC][PATCH 2/8] x86/mm: break out kernel address space handling

From: Andy Lutomirski
Date: Fri Sep 07 2018 - 18:21:49 EST




> On Sep 7, 2018, at 12:48 PM, Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx> wrote:
>
>
> From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>
> The page fault handler (__do_page_fault()) basically has two sections:
> one for handling faults in the kernel porttion of the address space
> and another for faults in the user porttion of the address space.
>
> But, these two parts don't stick out that well. Let's make that more
> clear from code separation and naming. Pull kernel fault
> handling into its own helper, and reflect that naming by renaming
> spurious_fault() -> spurious_kernel_fault().
>
> Also, rewrite the vmalloc handling comment a bit. It was a bit
> stale and also glossed over the reserved bit handling.
>
> Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Cc: Sean Christopherson <sean.j.christopherson@xxxxxxxxx>
> Cc: "Peter Zijlstra (Intel)" <peterz@xxxxxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: x86@xxxxxxxxxx
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> ---
>
> b/arch/x86/mm/fault.c | 98 ++++++++++++++++++++++++++++++--------------------
> 1 file changed, 59 insertions(+), 39 deletions(-)
>
> diff -puN arch/x86/mm/fault.c~pkeys-fault-warnings-00 arch/x86/mm/fault.c
> --- a/arch/x86/mm/fault.c~pkeys-fault-warnings-00 2018-09-07 11:21:46.145751902 -0700
> +++ b/arch/x86/mm/fault.c 2018-09-07 11:23:37.643751624 -0700
> @@ -1033,7 +1033,7 @@ mm_fault_error(struct pt_regs *regs, uns
> }
> }
>
> -static int spurious_fault_check(unsigned long error_code, pte_t *pte)
> +static int spurious_kernel_fault_check(unsigned long error_code, pte_t *pte)
> {
> if ((error_code & X86_PF_WRITE) && !pte_write(*pte))
> return 0;
> @@ -1072,7 +1072,7 @@ static int spurious_fault_check(unsigned
> * (Optional Invalidation).
> */
> static noinline int
> -spurious_fault(unsigned long error_code, unsigned long address)
> +spurious_kernel_fault(unsigned long error_code, unsigned long address)
> {
> pgd_t *pgd;
> p4d_t *p4d;
> @@ -1103,27 +1103,27 @@ spurious_fault(unsigned long error_code,
> return 0;
>
> if (p4d_large(*p4d))
> - return spurious_fault_check(error_code, (pte_t *) p4d);
> + return spurious_kernel_fault_check(error_code, (pte_t *) p4d);
>
> pud = pud_offset(p4d, address);
> if (!pud_present(*pud))
> return 0;
>
> if (pud_large(*pud))
> - return spurious_fault_check(error_code, (pte_t *) pud);
> + return spurious_kernel_fault_check(error_code, (pte_t *) pud);
>
> pmd = pmd_offset(pud, address);
> if (!pmd_present(*pmd))
> return 0;
>
> if (pmd_large(*pmd))
> - return spurious_fault_check(error_code, (pte_t *) pmd);
> + return spurious_kernel_fault_check(error_code, (pte_t *) pmd);
>
> pte = pte_offset_kernel(pmd, address);
> if (!pte_present(*pte))
> return 0;
>
> - ret = spurious_fault_check(error_code, pte);
> + ret = spurious_kernel_fault_check(error_code, pte);
> if (!ret)
> return 0;
>
> @@ -1131,12 +1131,12 @@ spurious_fault(unsigned long error_code,
> * Make sure we have permissions in PMD.
> * If not, then there's a bug in the page tables:
> */
> - ret = spurious_fault_check(error_code, (pte_t *) pmd);
> + ret = spurious_kernel_fault_check(error_code, (pte_t *) pmd);
> WARN_ONCE(!ret, "PMD has incorrect permission bits\n");
>
> return ret;
> }
> -NOKPROBE_SYMBOL(spurious_fault);
> +NOKPROBE_SYMBOL(spurious_kernel_fault);
>
> int show_unhandled_signals = 1;
>
> @@ -1203,6 +1203,55 @@ static inline bool smap_violation(int er
> return true;
> }
>
> +static void
> +do_kern_addr_space_fault(struct pt_regs *regs, unsigned long hw_error_code,
> + unsigned long address)
> +{

Can you add a comment above this documenting *when* itâs called? Is it all faults, !user_mode faults, or !PF_USER?

> + /*
> + * We can fault-in kernel-space virtual memory on-demand. The
> + * 'reference' page table is init_mm.pgd.
> + *
> + * NOTE! We MUST NOT take any locks for this case. We may
> + * be in an interrupt or a critical region, and should
> + * only copy the information from the master page table,
> + * nothing more.
> + *
> + * Before doing this on-demand faulting, ensure that the
> + * fault is not any of the following:
> + * 1. A fault on a PTE with a reserved bit set.
> + * 2. A fault caused by a user-mode access. (Do not demand-
> + * fault kernel memory due to user-mode accesses).
> + * 3. A fault caused by a page-level protection violation.
> + * (A demand fault would be on a non-present page which
> + * would have X86_PF_PROT==0).
> + */
> + if (!(hw_error_code & (X86_PF_RSVD | X86_PF_USER | X86_PF_PROT))) {
> + if (vmalloc_fault(address) >= 0)
> + return;
> + }
> +
> + /* Was the fault spurious, caused by lazy TLB invalidation? */
> + if (spurious_kernel_fault(hw_error_code, address))
> + return;
> +
> + /* kprobes don't want to hook the spurious faults: */
> + if (kprobes_fault(regs))
> + return;
> +
> + /*
> + * This is a "bad" fault in the kernel address space. There
> + * is no reasonable explanation for it. We will either kill
> + * the process for making a bad access, or oops the kernel.
> + */

Or call an extable handler?

Maybe the wording should be less scary, e.g. âthis fault is a genuine error. Send a signal, call an exception handler, or oops, as appropriate.â