[PATCH V5 0/3] Use efi_rts_wq to invoke EFI Runtime Services

From: Sai Praneeth Prakhya
Date: Mon May 28 2018 - 22:21:04 EST


From: Sai Praneeth <sai.praneeth.prakhya@xxxxxxxxx>

Problem statement:
------------------
Presently, efi_runtime_services() silently switch %cr3 from swapper_pgd
to efi_pgd. As a consequence, kernel code that runs in efi_pgd (e.g.,
perf code via an NMI) will have incorrect user space mappings[1]. This
could lead to otherwise unexpected access errors and, worse, unauthorized
access to firmware code and data.

Detailed discussion of problem statement:
-----------------------------------------
As this switch is not propagated to other kernel subsystems; they will
wrongly assume that swapper_pgd is still in use and it can lead to
following issues:

1. If kernel code tries to access user space addresses while in efi_pgd,
it could lead to unauthorized accesses to firmware code/data.
(e.g: <__>/copy_from_user_nmi()).
[This could also be disastrous if the frame pointer happens to point at
MMIO in the EFI runtime mappings] - Mark Rutland.

An example of a subsystem that could touch user space while in efi_pgd is
perf. Assume that we are in efi_pgd, a user could use perf to profile
some user data and depending on the address the user is trying to
profile, two things could happen.
1. If the mappings are absent, perf fails to profile.
2. If efi_pgd does have mappings for the requested address (these
mappings are erroneous), perf profiles firmware code/data. If the
address is MMIO'ed, perf could have potentially changed some device state.

The culprit in both the cases is, EFI subsystem swapping out pgd and not
perf. Because, EFI subsystem has broken the *general assumption* that
all other subsystems rely on - "user space might be valid and nobody has
switched %cr3".

Solutions:
----------
There are two ways to fix this issue:
1. Educate about pgd change to *all* the subsystems that could
potentially access user space while in efi_pgd.
On x86, AFAIK, it could happen only when some one touches user space
from the back of an NMI (a quick audit on <__>/copy_from_user_nmi,
showed perf and oprofile). On arm, it could happen from multiple
places as arm runs efi_runtime_services() interrupts enabled (ARM folks,
please comment on this as I might be wrong); whereas x86 runs
efi_runtime_services() interrupts disabled.

I think, this solution isn't holistic because
a. Any other subsystem might well do the same, if not now, in future.
b. Also, this solution looks simpler on x86 but not true if it's the
same for arm (ARM folks, please comment on this as I might be wrong).
c. This solution looks like a work around rather than addressing the issue.

2. Running efi_runtime_services() in kthread context.
This makes sense because efi_pgd doesn't have user space and kthread
by definition means that user space is not valid. Any kernel code that
tries to touch user space while in kthread is buggy in itself. If so,
it should be an easy fix in the other subsystem. This also take us one
step closer to long awaiting proposal of Andy - Running EFI at CPL 3.

What does this patch set do?
----------------------------
Introduce efi_rts_wq (EFI runtime services work queue).
When a user process requests the kernel to execute any efi_runtime_service(),
kernel queues the work to efi_rts_wq, a kthread comes along, switches to
efi_pgd and executes efi_runtime_service() in kthread context. IOW, this
patch set adds support to the EFI subsystem to handle all calls to
efi_runtime_services() using a work queue (which in turn uses kthread).

How running efi_runtime_services() in kthread fixes above discussed issues?
---------------------------------------------------------------------------
If we run efi_runtime_services() in kthread context and if perf
checks for it, we could get both the above scenarios correct by perf
aborting the profiling. Not only perf, but any subsystem that tries to
touch user space should first check for kthread context and if so,
should abort.

Q. If we still need check for kthread context in other subsystems that
access user space, what does this patch set fix?
A. This patch set makes sure that EFI subsystem is not at fault.
Without this patch set the blame is upon EFI subsystem, because it's the
one that changed pgd and hasn't communicated this change to everyone and
hence broke the general assumption. Running efi_runtime_services() in
kthread means explicitly communicating that user space is invalid, now
it's the responsibility of other subsystem to make sure that it's
running in right context.

Testing:
--------
Tested using LUV (Linux UEFI Validation) for x86_64, x86_32 and arm64
(qemu only). Will appreciate the effort if someone could test the
patches on real ARM/ARM64 machines.
LUV: https://01.org/linux-uefi-validation

Credits:
--------
Thanks to Ricardo, Dan, Miguel, Mark, Peter Z and Ard for reviews and
suggestions. Thanks to Boris and Andy for making me think through/help
on what I am addressing with this patch set.

Please feel free to pour in your comments and concerns.

Note:
-----
Patches are based on Linus's kernel v4.17-rc7

[1] Backup: Detailing efi_pgd:
------------------------------
efi_pgd has mappings for EFI Runtime Code/Data (on x86, plus EFI Boot time
Code/Data) regions. Due to the nature of these mappings, they fall
in user space address ranges and they are not the same as swapper.

[On arm64, the EFI mappings are in the VA range usually used for user
space. The two halves of the address space are managed by separate
tables, TTBR0 and TTBR1. We always map the kernel in TTBR1, and we map
user space or EFI runtime mappings in TTBR0.] - Mark Rutland

Changes from V4 to V5:
----------------------
1. As suggested by Ard, don't use efi_rts_wq for non-blocking variants.
Non-blocking variants are supposed to not block and using workqueue
exactly does the opposite, hence refrain from using it.
2. Use non-blocking variants in efi_delete_dummy_variable(). Use of
blocking variants means that we have to call efi_delete_dummy_variable()
after efi_rts_wq has been created.
3. Remove in_atomic() check in set_variable<>() and query_variable_info<>().
Any caller wishing to use set_variable() and query_variable_info() in
atomic context should use their non-blocking variants.

Changes from V3 to V4:
----------------------
1. As suggested by Peter, use completions instead of flush_work() as the
former is cheaper
2. Call efi_delete_dummy_variable() from efisubsys_init(). Sorry! Ard,
wasn't able to find a better alternative to keep this change local to
arch/x86.

Changes from V2 to V3:
----------------------
1. Rewrite the cover letter to clearly state the problem. What we are
fixing and what we are not fixing.
2. Make efi_delete_dummy_variable() change local to x86.
3. Avoid using BUG(), instead, print error message and exit gracefully.
4. Move struct efi_runtime_work to runtime-wrappers.c file.
5. Give enum a name (efi_rts_ids) and use it in efi_runtime_work.
6. Add Naresh (maintainer of LUV for ARM) and Miguel to the CC list.

Changes from V1 to V2:
----------------------
1. Remove unnecessary include of asm/efi.h file - Fixes build error on
ia64, reported by 0-day
2. Use enum to identify efi_runtime_services()
3. Use alloc_ordered_workqueue() to create efi_rts_wq as
create_workqueue() is scheduled for depreciation.
4. Make efi_call_rts() static, as it has no callers outside
runtime-wrappers.c
5. Use BUG(), when we are unable to queue work or unable to identify
requested efi_runtime_service() - Because these two situations should
*never* happen.

Sai Praneeth (3):
x86/efi: Make efi_delete_dummy_variable() use
set_variable_nonblocking() instead of set_variable()
efi: Create efi_rts_wq and efi_queue_work() to invoke all
efi_runtime_services()
efi: Use efi_rts_wq to invoke EFI Runtime Services

arch/x86/platform/efi/quirks.c | 11 +-
drivers/firmware/efi/efi.c | 14 ++
drivers/firmware/efi/runtime-wrappers.c | 218 +++++++++++++++++++++++++++++---
include/linux/efi.h | 3 +
4 files changed, 224 insertions(+), 22 deletions(-)

Signed-off-by: Sai Praneeth Prakhya <sai.praneeth.prakhya@xxxxxxxxx>
Suggested-by: Andy Lutomirski <luto@xxxxxxxxxx>
Cc: Lee Chun-Yi <jlee@xxxxxxxx>
Cc: Borislav Petkov <bp@xxxxxxxxx>
Cc: Tony Luck <tony.luck@xxxxxxxxx>
Cc: Will Deacon <will.deacon@xxxxxxx>
Cc: Dave Hansen <dave.hansen@xxxxxxxxx>
Cc: Mark Rutland <mark.rutland@xxxxxxx>
Cc: Bhupesh Sharma <bhsharma@xxxxxxxxxx>
Cc: Naresh Bhat <naresh.bhat@xxxxxxxxxx>
Cc: Ricardo Neri <ricardo.neri@xxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Ravi Shankar <ravi.v.shankar@xxxxxxxxx>
Cc: Matt Fleming <matt@xxxxxxxxxxxxxxxxxxx>
Cc: Dan Williams <dan.j.williams@xxxxxxxxx>
Cc: Ard Biesheuvel <ard.biesheuvel@xxxxxxxxxx>
Cc: Miguel Ojeda <miguel.ojeda.sandonis@xxxxxxxxx>

--
2.7.4