Re: [PATCH 5/5] x86/mm/init: remove freed kernel image areas from alias mapping

From: Hugh Dickins
Date: Wed Aug 01 2018 - 19:17:07 EST


On Wed, 1 Aug 2018, Dave Hansen wrote:
>
> From: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
>
> The kernel image is mapped into two places in the virtual address
> space (addresses without KASLR, of course):
>
> 1. The kernel direct map (0xffff880000000000)
> 2. The "high kernel map" (0xffffffff81000000)
>
> We actually execute out of #2. If we get the address of a kernel
> symbol, it points to #2, but almost all physical-to-virtual
> translations point to #1.
>
> Parts of the "high kernel map" alias are mapped in the userspace
> page tables with the Global bit for performance reasons. The
> parts that we map to userspace do not (er, should not) have
> secrets.
>
> This is fine, except that some areas in the kernel image that
> are adjacent to the non-secret-containing areas are unused holes.
> We free these holes back into the normal page allocator and
> reuse them as normal kernel memory. The memory will, of course,
> get *used* via the normal map, but the alias mapping is kept.
>
> This otherwise unused alias mapping of the holes will, by default
> keep the Global bit, be mapped out to userspace, and be
> vulnerable to Meltdown.
>
> Remove the alias mapping of these pages entirely. This is likely
> to fracture the 2M page mapping the kernel image near these areas,
> but this should affect a minority of the area.
>
> This unmapping behavior is currently dependent on PTI being in
> place. Going forward, we should at least consider doing this for
> all configurations. Having an extra read-write alias for memory
> is not exactly ideal for debugging things like random memory
> corruption and this does undercut features like DEBUG_PAGEALLOC
> or future work like eXclusive Page Frame Ownership (XPFO).
>
> Before this patch:
>
> current_kernel:---[ High Kernel Mapping ]---
> current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
> current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
> current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
> current_kernel-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
> current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
> current_kernel-0xffffffff82c00000-0xffffffff82e00000 2M RW NX pte
> current_kernel-0xffffffff82e00000-0xffffffff83200000 4M RW PSE NX pmd
> current_kernel-0xffffffff83200000-0xffffffffa0000000 462M pmd
>
> current_user:---[ High Kernel Mapping ]---
> current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
> current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
> current_user-0xffffffff81e11000-0xffffffff82000000 1980K RW NX pte
> current_user-0xffffffff82000000-0xffffffff82600000 6M ro PSE GLB NX pmd
> current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
>
>
> After this patch:
>
> current_kernel:---[ High Kernel Mapping ]---
> current_kernel-0xffffffff80000000-0xffffffff81000000 16M pmd
> current_kernel-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> current_kernel-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
> current_kernel-0xffffffff81e11000-0xffffffff82000000 1980K pte
> current_kernel-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
> current_kernel-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
> current_kernel-0xffffffff82488000-0xffffffff82600000 1504K pte
> current_kernel-0xffffffff82600000-0xffffffff82c00000 6M RW PSE NX pmd
> current_kernel-0xffffffff82c00000-0xffffffff82c0d000 52K RW NX pte
> current_kernel-0xffffffff82c0d000-0xffffffff82dc0000 1740K pte
>
> current_user:---[ High Kernel Mapping ]---
> current_user-0xffffffff80000000-0xffffffff81000000 16M pmd
> current_user-0xffffffff81000000-0xffffffff81e00000 14M ro PSE GLB x pmd
> current_user-0xffffffff81e00000-0xffffffff81e11000 68K ro GLB x pte
> current_user-0xffffffff81e11000-0xffffffff82000000 1980K pte
> current_user-0xffffffff82000000-0xffffffff82400000 4M ro PSE GLB NX pmd
> current_user-0xffffffff82400000-0xffffffff82488000 544K ro NX pte
> current_user-0xffffffff82488000-0xffffffff82600000 1504K pte
> current_user-0xffffffff82600000-0xffffffffa0000000 474M pmd
>
> Signed-off-by: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
> Cc: Kees Cook <keescook@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Ingo Molnar <mingo@xxxxxxxxxx>
> Cc: Andrea Arcangeli <aarcange@xxxxxxxxxx>
> Cc: Juergen Gross <jgross@xxxxxxxx>
> Cc: Josh Poimboeuf <jpoimboe@xxxxxxxxxx>
> Cc: Greg Kroah-Hartman <gregkh@xxxxxxxxxxxxxxxxxxx>
> Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
> Cc: Hugh Dickins <hughd@xxxxxxxxxx>
> Cc: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxx>
> Cc: Borislav Petkov <bp@xxxxxxxxx>
> Cc: Andy Lutomirski <luto@xxxxxxxxxx>
> Cc: Andi Kleen <ak@xxxxxxxxxxxxxxx>
> ---
>
> b/arch/x86/mm/init.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff -puN arch/x86/mm/init.c~x86-unmap-freed-areas-from-kernel-image arch/x86/mm/init.c
> --- a/arch/x86/mm/init.c~x86-unmap-freed-areas-from-kernel-image 2018-07-30 09:53:14.862915689 -0700
> +++ b/arch/x86/mm/init.c 2018-07-30 09:53:14.866915689 -0700
> @@ -778,8 +778,26 @@ void free_init_pages(char *what, unsigne
> */
> void free_kernel_image_pages(void *begin, void *end)
> {
> - free_init_pages("unused kernel image",
> - (unsigned long)begin, (unsigned long)end);
> + unsigned long begin_ul = (unsigned long)begin;
> + unsigned long end_ul = (unsigned long)end;
> + unsigned long len_pages = (end_ul - begin_ul) >> PAGE_SHIFT;
> +
> +
> + free_init_pages("unused kernel image", begin_ul, end_ul);
> +
> + /*
> + * PTI maps some of the kernel into userspace. For
> + * performance, this includes some kernel areas that
> + * do not contain secrets. Those areas might be
> + * adjacent to the parts of the kernel image being
> + * freed, which may contain secrets. Remove the
> + * "high kernel image mapping" for these freed areas,
> + * ensuring they are not even potentially vulnerable
> + * to Meltdown regardless of the specific optimizations
> + * PTI is currently using.
> + */
> + if (cpu_feature_enabled(X86_FEATURE_PTI))
> + set_memory_np(begin_ul, len_pages);
> }
>
> void __ref free_initmem(void)
> _

Ironically, that set_memory_np() is giving me a problem.

I don't see it when booting the 8GB laptop normally, but when booting
with "mem=1G", I get a not-present fault when ext4_iget() is trying to
do its business in starting init. But boots fine with "mem=1G nopti".

I get the feeling that set_memory_np() is marking those freed pages
as not-present in the direct map, so they're no longer usable at all.

I can jot down some console messages if you need, but hope I've said
enough for you to see it immediately, and just say whoops, forget 5/5?

Hugh