Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
From: Andy Lutomirski
Date: Wed Nov 01 2017 - 03:59:31 EST
On Tue, Oct 31, 2017 at 4:44 PM, Dave Hansen
<dave.hansen@xxxxxxxxxxxxxxx> wrote:
> On 10/31/2017 04:27 PM, Linus Torvalds wrote:
>> Inconveniently, the people you cc'd on the actual patches did *not*
>> get cc'd with this 00/23 cover letter email.
>
> Urg, sorry about that.
>
>> (a) is this on top of Andy's entry cleanups?
>>
>> If not, that probably needs to be sorted out.
>
> It is not. However, I did a version on top of his earlier cleanups, so
> I know this can be easily ported on top of them. It didn't make a major
> difference in the number of places that KAISER had to patch, unfortunately.
>
>> (b) the TLB global bit really is nastily done. You basically disable
>> _PAGE_GLOBAL entirely.
>>
>> I can see how/why that would make things simpler, but it's almost
>> certainly the wrong approach. The small subset of kernel pages that
>> are always mapped should definitely retain the global bit, so that you
>> don't always take a TLB miss on those! Those are probably some of the
>> most latency-critical pages, since there's generally no prefetching
>> for the kernel entry code or for things like IDT/GDT accesses..
>>
>> So even if you don't want to have global pages for normal kernel
>> entries, you don't want to just make _PAGE_GLOBAL be defined as zero.
>> You'd want to just use _PAGE_GLOBAL conditionally.
>>
>> Hmm?
>
> That's a good point. Shouldn't be hard to implement at all. We'll just
> need to take _PAGE_GLOBAL out of the default _KERNPG_TABLE definition, I
> think.
>
>> (c) am I reading the code correctly, and the shadow page tables are
>> *completely* duplicated?
>>
>> That seems insane. Why isn't only tyhe top level shadowed, and
>> then lower levels are shared between the shadowed and the "kernel"
>> page tables?
>
> There are obviously two PGDs. The userspace half of the PGD is an exact
> copy so all the lower levels are shared. You can see this bit in the
> memcpy that we do in clone_pgd_range().
>
> For the kernel half, we don't share any of the lower levels. That's
> mostly because the stuff that we're mapping into the user/shadow copy is
> only 4k aligned and (probably) never >2MB, so there's really no
> opportunity to share.
>
I think we should map exactly two kernel PGDs: one for the fixmap and
one for the special shared stuff. Those PGDs should be mapped
identically in the user tables. We can eventually (or immediately)
get rid of the fixmap, too, by moving the IDT and GDT and making a
special user fixmap table for the vsyscall page.
--Andy