Re: [tip:efi/core] x86/mm/pat: Use _PAGE_GLOBAL bit for EFI page table mappings

From: Andy Lutomirski
Date: Wed Feb 24 2016 - 14:49:52 EST

On Wed, Feb 24, 2016 at 11:33 AM, Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx> wrote:
> On Wed, 24 Feb, at 08:36:33AM, Andy Lutomirski wrote:
>> On Wed, Feb 24, 2016 at 8:20 AM, Borislav Petkov <bp@xxxxxxxxx> wrote:
>> > On Wed, Feb 24, 2016 at 02:10:46PM +0000, Matt Fleming wrote:
>> >> > Normally, the only pages with are _PAGE_GLOBAL are those that are in
>> >> > the normal kernel mappings (swapper_pg_dir and normal mm_struct pgds).
>> >> > By allowing _PAGE_GLOBAL to be set in EFI mappings, you're breaking
>> >> > that convention, which forces you to use extra-expensive
>> >> > __flush_tlb_all calls in efi_call_virt.
>> >
>> > Hold on, do you mean the __flush_tlb_all() in the CONFIG_EFI_MIXED code?
>> >
>> > That's mixed mode. I think you mean the FLUSH_TLB_ALL in efi_call.
>> > That's EFI on 64-bit but that is mandated by the spec, AFAIR.
>> I mean the one in efi_call_virt. Why would the spec mandate a TLB
>> flush at all? EFI runtime services have no business touching the
>> paging structures directly. Heck, the 32-bit ones don't even know the
>> *format* of the paging structures.
> Right, and it would necessitate copying out arguments because the
> firmware won't understand where/how the kernel has mapped things.
> No firmware is going to be doing that.

Just so I understand correctly: could we get away with putting the EFI
virtual runtime mappings at positive (user) addresses for 64-bit UEFI,
or is there some reason that we need the high bit set?

If we could use positive addresses, then we could use the existing
use_mm infrastructure directly with no funny business at all except to
the extent that we might need to use unusual APIs to set up the VMAs
(if we use real VMAs) in the first place. (We could cheat and
allocate a single monstrous VM_MIXEDMAP or VM_PFNMAP vma with a .fault
handler that always fails.) If we have to use negative addresses,
then we'll always be stuck with a funny pgd, but we could still
probably use use_mm instead of manually fiddling with cr3.

Some day I want to experiment with calling runtime services at CPL 3,
too :) We'd want to add some infrastructure to permit kernel threads
to run through the entry/exit code as if they were user processes, but
there's nothing conceptually wrong with that. We already allow kernel
threads to call execve and "return" to real user mode, so it's not
much of a stretch. The main issue would be dealing with signal
handling and such -- we'd want to report faults back to the kernel
thread's CPL3-invocation thunks rather than delivering a signal at CPL

Hmm, now it's time to muse about how the interface would work. If we
kept it in line with existing practice, we'd add an API to make a
special kernel thread with an attached user context. To enter CPL3,
you'd return from the thread's main function. When user mode was done
(fault or syscall), new hooks in the entry code (similar to seccomp
and the die notifier stuff) would re-enter the main function with some
arguments indicating what happened.

If we wanted to make it a bit easier to use, we'd have to allocate an
extra kernel sack, and we could have:

void invoke_cpl3(struct cpl3_context *ctx);

where ctx contains memory for an extra stack as well as a bunch of
data indicating the reason that it returned.

The latter is harder to implement but probably much easier to use.

If anyone wants to work on this, ping me and I'll help and do a bunch of review.


Andy Lutomirski
AMA Capital Management, LLC