Re: [PATCH 4/3 v2] x86/mm/doc: Enhance the x86-64 virtual memory layout descriptions

From: Andy Lutomirski
Date: Sat Oct 06 2018 - 18:17:51 EST


On Sat, Oct 6, 2018 at 10:03 AM Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
>
> There's one PTI related layout asymmetry I noticed between 4-level and 5-level kernels:
>
> 47-bit:
> > + |
> > + | Kernel-space virtual memory, shared between all processes:
> > +____________________________________________________________|___________________________________________________________
> > + | | | |
> > + ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
> > + ffff880000000000 | -120 TB | ffffc7ffffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
> > + ffffc80000000000 | -56 TB | ffffc8ffffffffff | 1 TB | ... unused hole
> > + ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
> > + ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
> > + ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
> > + ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
> > + ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
> > + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
> > + | | | | vaddr_end for KASLR
> > + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
> > + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | LDT remap for PTI
> > + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
> > +__________________|____________|__________________|_________|____________________________________________________________
> > + |
>
> 56-bit:
> > + |
> > + | Kernel-space virtual memory, shared between all processes:
> > +____________________________________________________________|___________________________________________________________
> > + | | | |
> > + ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
> > + ff10000000000000 | -60 PB | ff8fffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
> > + ff90000000000000 | -28 PB | ff9fffffffffffff | 4 PB | LDT remap for PTI
> > + ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
> > + ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
> > + ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
> > + ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
> > + ffdf000000000000 | -8.25 PB | fffffdffffffffff | ~8 PB | KASAN shadow memory
> > + fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
> > + | | | | vaddr_end for KASLR
> > + fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
> > + fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
> > + ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
>
> The two layouts are very similar beyond the shift in the offset and the region sizes, except
> one big asymmetry: is the placement of the LDT remap for PTI.
>
> Is there any fundamental reason why the LDT area is mapped into a 4 petabyte (!) area on 56-bit
> kernels, instead of being at the -1.5 TB offset like on 47-bit kernels?
>
> The only reason I can see is that this way is that it's currently coded at the PGD level only:
>
> static void map_ldt_struct_to_user(struct mm_struct *mm)
> {
> pgd_t *pgd = pgd_offset(mm, LDT_BASE_ADDR);
>
> if (static_cpu_has(X86_FEATURE_PTI) && !mm->context.ldt)
> set_pgd(kernel_to_user_pgdp(pgd), *pgd);
> }
>
> ( BTW., the 4 petabyte size of the area is misleading: a 5-level PGD entry covers 256 TB of
> virtual memory, i.e 0.25 PB, not 4 PB. So in reality we have a 0.25 PB area there, used up
> by the LDT mapping in a single PGD entry, plus a 3.75 PB hole after that. )
>
> ... but unless I'm missing something it's not really fundamental for it to be at the PGD level
> - it could be two levels lower as well, and it could move back to the same place where it's on
> the 47-bit kernel.
>

The subtlety is that, if it's lower than the PGD level, there end up
being some tables that are private to each LDT-using mm that map
things other than the LDT. Those tables cover the same address range
as some corresponding tables in init_mm, and if those tables in
init_mm change after the LDT mapping is set up, the changes won't
propagate.

So it probably could be made to work, but it would take some extra care.