[PATCH 00/26] Address Space Isolation (ASI) 2024
From: Brendan Jackman
Date: Fri Jul 12 2024 - 13:01:05 EST
Overview
========
This RFC demonstrates an implementation of Address Space Isolation
(ASI), similar to Junaid Shahid’s proposal from 2022 [1].
Until now, mitigating hardware vulnerabilities has required one or both
of:
- Highly custom mitigations being developed under pressure for every
specific exploit,
- Prohibitive performance penalties.
ASI is an attempt to improve both of these points by providing a single
technique that mitigates a very broad class of vulnerabilities while
still achieving a tolerable performance overhead.
The basic idea is to run the kernel in a “restricted address space”,
where any page that could contain “sensitive” data is unmapped. When the
kernel needs to access such data, a page fault occurs, in which we
switch back to the normal (“unrestricted”) address space and perform
vulnerability mitigations. Before returning to potentially malicious
code (VM guest/userspace) we transition back into the restricted address
space and get a chance to perform additional mitigations. Thus, we only
pay the cost of security mitigations for kernel entries (such as VM
Exit) that actually access sensitive data. If we can arrange for these
accesses to be infrequent, it becomes viable to perform aggressive
mitigations on address space transitions. For example, in this RFC we
attempt to obliterate indirect branch predictor training, without
needing to concern ourselves too much with microarchitectural details of
specific exploits. My talk at LSF/MM/BPF this year [2] has some
additional conceptual introduction with diagrams etc, plus some more
detailed discussion of the strategic pros and cons of ASI. Junaid’s RFC
cover letter [1] has some additional discussion too, I won’t rehash it
in detail.
Like Junaid’s RFC, this only implements ASI for protecting against
malicious KVM guests; this is a somewhat simpler use-case to start with.
However, ASI is written as a framework so that we can later use it to
sandbox bare metal processes too. Work has begun on prototyping this but
we don’t have a working implementation yet.
Rough structure of this series:
- 01-14: Establish ASI infrastructure, e.g. for manipulating pagetables,
performing address space transitions.
- 15-19: Map data into the restricted address space.
- 20-23: Finalize a functionality correct ASI for KVM.
- 24-26: Switch it on and demonstrate actual vuln mitigation.
What’s new in this RFC?
=======================
Since Junaid’s initial efforts, Google has steadily invested more and
more deeply towards ASI as a keystone of hardware security. This RFC is
basically the same system that Junaid presented, but I’ve done my best
to shrink it as much as possible. So, this is really just enough to
demonstrate ASI working end-to-end.
The most radical simplifications are the removal of “local nonsensitive”
memory (see [1] for explanation) and the removal of all of the
TLB-flushing smarts. Those will be implemented later as an enhancement.
What’s needed to make this a PATCH?
===================================
.:: Major problems
Aside from general missing features and performance issues there are two
major problems with this patchset:
1. It adds a page flag.
2. It creates artificial OOM conditions.
See “mm: asi: Map non-user buddy allocations as nonsensitive” for
details of both problems.
I hope to solve these with a more intrusive but less hacky integration
into the buddy allocator. This was discussed at LSF/MM/BPF [2], I won’t
go into detail here, I just failed to get a prototype ready in time for
this RFC. I’ll need to have one ready before I can reasonably ask to
merge anything. It remains an open question if we can find a way to
merge a minimal ASI without that complex integration, without creating
technical debt such as a page flag.
.:: Configuration
As well as the above, I think it needs a cleaner idea of how ASI should
be configured. In this RFC, it’s enabled by setting asi=on on the kernel
command-line, and has barely any interaction with bugs.c. ASI does not
trivially fit into the existing configuration mechanism:
a. Existing mitigations are generally configured per-vuln, while ASI
is not a per-vuln mitigation.
b. ASI will never be strictly equivalent to any other mitigation
configuration (because it deliberately drops protection for at least
some memory), so making it the default represents a moderately bold
policy decision.
ASI also warrants configuration beyond on/off: In general because it
provides a way to avoid paying mitigation cost most of the time, in my
opinion ASI is best used in a mode that mitigates exploits beyond those
that are currently known to be possible on a given platform. For
example, in this RFC we attempt to obliterate _all_ indirect branch
predictor training before leaving the restricted address space, even on
platforms where no practical exploit is known to necessitate this. But I
expect many users to reject this philosophy, and the kernel ought to
support a different policy.
Input on this topic would be appreciated - even if it feels like
bikeshedding, I think it’s likely to provoke more interesting discussion
as a side effect. Otherwise I’ll just come up with _something_ and we
can discuss more at [PATCH] time. Perhaps a simple starting point would
be “mitigations=asi”.
.:: Minor issues
- KVM’s rseq_test fails with asi=on. I think this is “just” a
performance problem; KVM rseq logic is known to trigger ASI
transitions without additional optimisations that will be explored for
a later series.
- fill_return_buffer() causes an “unreachable instruction” objtool
warning. I haven’t investigated this.
- Some BUGs that should probably not crash the kernel.
What is “sensitive memory”?
===========================
ASI is fundamentally creating a new security boundary. So, where does
the boundary go? In other words, what gets mapped into the restricted
address space?
This is determined at allocation time. In this RFC, there is a new
__GFP_SENSITIVE flag (currently only supported for buddy allocations,
not slab), and everything else is considered non-sensitive. This
default-nonsensitive approach is known as a “denylist” model. By simply
adding __GFP_SENSITIVE to GFP_USER, we can already deliver significant
protection from real-world attacks, while already being within reach of
pretty high performance results (more on this later).
However, it’s obviously not the case that all data worth leaking is
always in GFP_USER pages. There are two ways to respond to this problem:
1. Expand the denylist, i.e. try to set __GFP_SENSITIVE for all memory
that can contain secrets.
2. Switch to an “allowlist” model where sensitive is the default. Then
our job would instead be to set __GFP_NONSENSITIVE wherever we can
determine it’s safe and worthwhile for performance.
Option 2 clearly puts us in a stronger security posture, but it has the
major disadvantage of risking unpredictable performance impacts: since
ASI transitions are costly, a random system change that causes new pages
to start being touched by the kernel is much more likely to create
sudden, hard-to-diagnose performance degradations. This makes switching
ASI on in production a much scarier proposition.
Opinions at LSF/MM/BPF were surprisingly relaxed about this topic. So if
possible I’d like to prefer option 1, and focus on getting Linux as soon
as possible to a version of ASI that’s viable to run in production, and
from there iterate towards stronger security guarantees. However,
discussion is welcome.
Performance
===========
I’m a little embarrassed that I don’t have performance data with this
RFC, progress on getting this data has been painful so I decided to just
get discussion started on the implementation, and I hope to follow up
soon with data. Since the initial patchset I’ll be proposing to merge
will be minimal (something similar in scope to this RFC), we should
expect it to perform badly. So, I’ll need to put together a
forward-looking branch that includes that patchset plus additional
features from future patchsets, so that we can prove that good
performance is achievable longer-term.
Google’s internal version of ASI shows less than 5% degradation on all
end-to-end performance metrics, less than 1% is common. However for some
workloads this has required more advanced optimisations than those I
expect to post in the initial upstream branch, so we can expect a worse
degradation in some cases.
The branch that I published for LSF/MM/BPF [2] (not radically different
from this RFC) showed comparable performance to Safe RET for a single-VM
Redis benchmark (<5%), although this was not a rigorous analysis. See
[5] for a graph showing that ASI performs dramatically better than a
comparable blanket mitigation (IBPB on VM Exit).
I’m planning to try and run either the VM-supported workloads from
mmtests [3], or some set of workloads from PerfKit Benchmarker [4],
whichever turns out to be easiest. I’ll compare ASI against
mitigations=off and one or two example configurations for existing
mitigations. Let me know if you have any specific requests/suggestions
for workloads or baseline-comparisons.
What’s next?
============
This cover letter is getting rather long, but briefly here are some work
items that need to be done for a “complete ASI”, but which I’d like to
defer until infrastructure is already in place in-tree:
- More sensitivity annotations, which will require more allocator
integrations
- More advanced/flexible mitigations in address space transitions
- Support for sandboxing bare-metal processes
- Avoid address space transitions by expanding the scope of what can be
run in the restricted address space (e.g. context-switching between
tasks in the same mm, returning to userspace)
- Deferring TLB flushing and using PCID properly
- Preventing cross-SMT attacks by halting sibling hyperthreads
- Non-x86 support (this isn’t prototyped at all, requires research,
probably a much longer-term topic).
Acknowledgements
================
Thanks to Alexander Chartre for the initial implementation that inspired
Junaid’s RFC.
Of course thanks to Junaid Shahid and Ofir Weisse for their fantastic
work on the 2022 RFC and Google’s initial internal implementation.
Reiji Watanabe, Yosry Ahmed and Patrick Bellasi are also major
contributors to this effort from Google (you’ll see them attributed in
commit messages too).
Further thanks to Alexandra Sandulescu and Matteo Rizzo who have
provided security expertise for Google’s deployment. Alexandra is also
working on reliable easy-to-run exploit PoCs (as kernel selftests) which
have helped us to gain confidence that ASI actually mitigates
vulnerabilities.
References
==========
[1] Junaid’s RFC:
https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx/
[2] LSF/MM/BPF: https://www.youtube.com/watch?v=DxaN6X_fdlI
LWN coverage: https://lwn.net/Articles/974390/
Code: http://github.com/googleprodkernel/linux-kvm/tree/asi-lsfmmbpf-24
[3] mmtests: https://github.com/gormanm/mmtests
[4] PerfKit Benchmarker: https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
[5] Performance data at LSF/MM/BPF (timestamp link):
https://youtu.be/DxaN6X_fdlI?t=557
To: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
To: Ingo Molnar <mingo@xxxxxxxxxx>
To: Borislav Petkov <bp@xxxxxxxxx>
To: Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>
To: H. Peter Anvin <hpa@xxxxxxxxx>
To: Andy Lutomirski <luto@xxxxxxxxxx>
To: "H. Peter Anvin" <hpa@xxxxxxxxx>
To: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
To: Sean Christopherson <seanjc@xxxxxxxxxx>
To: Paolo Bonzini <pbonzini@xxxxxxxxxx>
To: Alexandre Chartre <alexandre.chartre@xxxxxxxxxx>
To: Liran Alon <liran.alon@xxxxxxxxxx>
To: Jan Setje-Eilers <jan.setjeeilers@xxxxxxxxxx>
To: Catalin Marinas <catalin.marinas@xxxxxxx>
To: Will Deacon <will@xxxxxxxxxx>
To: Mark Rutland <mark.rutland@xxxxxxx>
To: Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>
To: Mel Gorman <mgorman@xxxxxxx>
To: Lorenzo Stoakes <lstoakes@xxxxxxxxx>
To: David Hildenbrand <david@xxxxxxxxxx>
To: Vlastimil Babka <vbabka@xxxxxxx>
To: Michal Hocko <mhocko@xxxxxxxxxx>
To: Khalid Aziz <khalid.aziz@xxxxxxxxxx>
To: Juri Lelli <juri.lelli@xxxxxxxxxx>
To: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
To: Dietmar Eggemann <dietmar.eggemann@xxxxxxx>
To: Steven Rostedt <rostedt@xxxxxxxxxxx>
To: Valentin Schneider <vschneid@xxxxxxxxxx>
To: Paul Turner <pjt@xxxxxxxxxx>
To: Reiji Watanabe <reijiw@xxxxxxxxxx>
To: Junaid Shahid <junaids@xxxxxxxxxx>
To: Ofir Weisse <oweisse@xxxxxxxxxx>
To: Yosry Ahmed <yosryahmed@xxxxxxxxxx>
To: Patrick Bellasi <derkling@xxxxxxxxxx>
To: KP Singh <kpsingh@xxxxxxxxxx>
To: Alexandra Sandulescu <aesa@xxxxxxxxxx>
To: Matteo Rizzo <matteorizzo@xxxxxxxxxx>
To: Jann Horn <jannh@xxxxxxxxxx>
Cc: x86@xxxxxxxxxx
Cc: linux-kernel@xxxxxxxxxxxxxxx
Cc: linux-mm@xxxxxxxxx
Cc: kvm@xxxxxxxxxxxxxxx
Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx>
---
Brendan Jackman (15):
x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
objtool: let some noinstr functions make indirect calls
mm: asi: Add infrastructure for boot-time enablement
mm: asi: ASI support in interrupts/exceptions
mm: asi: Avoid warning from NMI userspace accesses in ASI context
mm: Add __PAGEFLAG_FALSE
mm: asi: Map non-user buddy allocations as nonsensitive
mm: asi: Map kernel text and static data as nonsensitive
mm: asi: Map vmalloc/vmap data as nonsesnitive
KVM: x86: asi: Restricted address space for VM execution
KVM: x86: asi: Stabilize CR3 when potentially accessing with ASI
mm: asi: Stabilize CR3 in switch_mm_irqs_off()
mm: asi: Make TLB flushing correct under ASI
mm: asi: Stop ignoring asi=on cmdline flag
KVM: x86: asi: Add some mitigations on address space transitions
Junaid Shahid (8):
mm: asi: Make some utility functions noinstr compatible
mm: asi: Introduce ASI core API
mm: asi: Switch to unrestricted address space before a context switch
mm: asi: Use separate PCIDs for restricted address spaces
mm: asi: Make __get_current_cr3_fast() ASI-aware
mm: asi: ASI page table allocation functions
mm: asi: Functions to map/unmap a memory range into ASI page tables
mm: asi: Add basic infrastructure for global non-sensitive mappings
Ofir Weisse (1):
mm: asi: asi_exit() on PF, skip handling if address is accessible
Reiji Watanabe (1):
mm: asi: Map dynamic percpu memory as nonsensitive
Yosry Ahmed (1):
percpu: clean up all mappings when pcpu_map_pages() fails
arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/Kbuild | 1 +
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 1 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 1 +
arch/x86/Kconfig | 27 ++
arch/x86/include/asm/asi.h | 267 +++++++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 50 ++-
arch/x86/include/asm/kvm_host.h | 5 +
arch/x86/include/asm/nospec-branch.h | 2 +
arch/x86/include/asm/processor.h | 15 +-
arch/x86/include/asm/special_insns.h | 8 +-
arch/x86/include/asm/tlbflush.h | 5 +
arch/x86/kernel/process.c | 2 +
arch/x86/kernel/traps.c | 22 +
arch/x86/kvm/svm/svm.c | 2 +
arch/x86/kvm/vmx/nested.c | 8 +
arch/x86/kvm/vmx/vmx.c | 124 +++--
arch/x86/kvm/x86.c | 60 ++-
arch/x86/lib/retpoline.S | 7 +
arch/x86/mm/Makefile | 1 +
arch/x86/mm/asi.c | 748 +++++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 119 ++++-
arch/x86/mm/init.c | 5 +-
arch/x86/mm/init_64.c | 25 +-
arch/x86/mm/mm_internal.h | 3 +
arch/x86/mm/tlb.c | 136 +++++-
arch/xtensa/include/asm/Kbuild | 1 +
include/asm-generic/asi.h | 84 ++++
include/asm-generic/vmlinux.lds.h | 11 +
include/linux/compiler_types.h | 8 +
include/linux/gfp_types.h | 15 +-
include/linux/mm_types.h | 7 +
include/linux/page-flags.h | 16 +
include/linux/pgtable.h | 3 +
include/trace/events/mmflags.h | 12 +-
kernel/fork.c | 3 +
kernel/sched/core.c | 3 +
mm/init-mm.c | 4 +
mm/internal.h | 2 +
mm/page_alloc.c | 143 +++++-
mm/percpu-vm.c | 52 ++-
mm/percpu.c | 4 +-
mm/vmalloc.c | 61 ++-
tools/objtool/check.c | 14 +
tools/perf/builtin-kmem.c | 1 +
62 files changed, 1977 insertions(+), 136 deletions(-)
---
base-commit: a38297e3fb012ddfa7ce0321a7e5a8daeb1872b6
change-id: 20240524-asi-rfc-24-2ea47c41352d
Best regards,
--
Brendan Jackman <jackmanb@xxxxxxxxxx>