[PATCH RFC v2 00/29] Address Space Isolation (ASI)

From: Brendan Jackman
Date: Fri Jan 10 2025 - 13:41:24 EST


ASI is a technique to mitigate a broad class of CPU vulnerabilities
by unmapping sensitive data from the kernel address space. If no data
is mapped that needs protecting, this class of exploits cannot leak
that data and so the kernel can skip expensive mitigation actions.
For a more detailed overview, see the v1 RFC (which was wrongly
labeled as a PATCH) [0].

This new iteration adds support for protecting against bare-metal
processes as well as KVM guests. The basic principle is unchanged.

.:: Multi-class ASI

So far ASI has been a KVM-only solution, although I've been claiming
that in principle it can be extended to also sandbox userspace.
Dave Hansen's most important feedback at LPC [1] was that he wanted
some evidence to support this claim. If it can be shown that ASI is
just as powerful for bare-metal as for KVM, it's much more likely to
actually offer an escape path from maintaining and reactively
developing per-exploit mitigations.

v1 already supported a notion of "ASI classes", with the only class
being KVM. This RFC introduces a second class for userspace. Each
process has a separate restricted address space ("domain") for each
class.

In v1, the only possible ASI transitions were between the KVM
restricted address space, and the unrestricted address space. Now
that there are multiple classes, it's possible to transition directly
between two restricted address spaces.

(Could we dodge this complexity by just transitioning via the
unrestricted address space? Yes, but experience from Google's
internal deployment suggests there's a significant benefit in
avoiding an asi_exit() when switching between userspace and KVM,
despite all the optimizations that exist to avoid that switching).

Compared to v1, this version has a new mechanism to determine
what mitigation actions are required when switching between address
spaces. ASI classes provide a "taint policy" which describes what
uarch state their sandboxee might leave behind, and what uarch state
needs to be purged before their sandboxee can safely be run. The ASI
core takes care of doing the actual flushes.

This enables a reasonably advanced model of what flushes are needed
when; for example the kernel is now able to model "when transitioning
from a VMM to its KVM guest there is no point in flushing speculative
control flow state, but if we _later_ exit to the unrestricted
address space we do need to flush it". It's quite possible this is
actually more advanced than what is needed so suggestions are
welcome.

.:: Performance issues: bogus mitigation costs

Although this implementation of ASI is pretty generous in what it
considers "nonsensitive", there remain unnecessary performance costs
that need to be addressed. For example:

- The entire page cache is removed from the direct map. Traditional
file operations will hit an asi_exit(), paying a pointless cost to
protect data from a process that obviously has the right to read
that data.
- Anything that accesses guest or user memory via the direct map
instead of the user address space will hit an asi_exit().
- Pages being zeroed in the page allocator

Most of these issues existed in v1 too, but now that ASI sandboxes
userspace processes, the page-cache issue becomes very significant.
For FIO 4k read (I suppose this workload is maximally sensitive to
this issue) I saw a 70% degradation in throughput, with a Sapphire
Rapids machine hard-coded to perform IBPB and RSB-stuffing on
asi_exit().

Given a result like that I haven't gone into more detailed analysis.
Note also that I ran with an unrealistic mitigation policy, results
would be much different if ran with platform-appropriate flushes, but
it would presumably lead to the same conclusion.

There are some interesting discussions to be had about tackling that
problem (e.g. reintroducing "local-nonsensitivity" from Junaid's 2022
ASI implementation [2], or creating ephemeral CPU-local mappings),
but for this RFC I prefer to focus on deciding if the overall
framework makes sense.

.:: Next steps

Aside from lack of userspace support, all the other issues listed in
RFCv1 remain. I'll also need a proof-of-concept solution for the
page-cache issue before we can credibly claim to be reaching a
[PATCH], but before that I want to develop a more complete page_alloc
integration. I plan to propose a topic about that at LSF/MM/BPF.

Anyway, despite the further research needed on my side I think
there's still useful stuff to discuss here. For example:

- Does the "tainting" model make intuitive sense? Is there a simpler
way to achieve something similar?

- The taints offer a model for different parts of the kernel to
communicate with each other about what mitigations they've taken
care of. For example, KVM could clear ASI taints if it existing
conditional-L1D-flush logic fires. Does it make sense to take
advantage of this? (I think yes). How does this influence the
design of the bugs.c kernel arguments?

- Suggestions on how to map file pages into processes that can read
them, while minimizing TLB management pain.

Finally, a more extensive branch can be found at [3]. It has some tests
and some of the lower-hanging fruit for optimising performance of KVM
guests.

[0] RFC v1:
https://lore.kernel.org/linux-mm/20240712-asi-rfc-24-v1-0-144b319a40d8@xxxxxxxxxx/

[1] LPC session: https://lpc.events/event/18/contributions/1761/

[2] Junaid’s RFC:
https://lore.kernel.org/all/20220223052223.1202152-1-junaids@xxxxxxxxxx/

[3] GitHub branch:
https://github.com/googleprodkernel/linux-kvm/tree/asi-rfcv2-preview

Signed-off-by: Brendan Jackman <jackmanb@xxxxxxxxxx>

Ingo Molnar <mingo@xxxxxxxxxx>, Borislav Petkov <bp@xxxxxxxxx>,
Dave Hansen <dave.hansen@xxxxxxxxxxxxxxx>,
"H. Peter Anvin" <hpa@xxxxxxxxx>,
Andy Lutomirski <luto@xxxxxxxxxx>,
Peter Zijlstra <peterz@xxxxxxxxxxxxx>,
Sean Christopherson <seanjc@xxxxxxxxxx>,
Paolo Bonzini <pbonzini@xxxxxxxxxx>,
Alexandre Chartre <alexandre.chartre@xxxxxxxxxx>,
Liran Alon <liran.alon@xxxxxxxxxx>,
Jan Setje-Eilers <jan.setjeeilers@xxxxxxxxxx>,
Catalin Marinas <catalin.marinas@xxxxxxx>,
Will Deacon <will@xxxxxxxxxx>,
Mark Rutland <mark.rutland@xxxxxxx>,
Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx>,
Mel Gorman <mgorman@xxxxxxx>,
Lorenzo Stoakes <lstoakes@xxxxxxxxx>,
David Hildenbrand <david@xxxxxxxxxx>,
Vlastimil Babka <vbabka@xxxxxxx>,
Michal Hocko <mhocko@xxxxxxxxxx>,
Khalid Aziz <khalid.aziz@xxxxxxxxxx>,
Juri Lelli <juri.lelli@xxxxxxxxxx>,
Vincent Guittot <vincent.guittot@xxxxxxxxxx>,
Dietmar Eggemann <dietmar.eggemann@xxxxxxx>,
Steven Rostedt <rostedt@xxxxxxxxxxx>,
Valentin Schneider <vschneid@xxxxxxxxxx>,
Paul Turner <pjt@xxxxxxxxxx>, Reiji Watanabe <reijiw@xxxxxxxxxx>,
Junaid Shahid <junaids@xxxxxxxxxx>,
Ofir Weisse <oweisse@xxxxxxxxxx>,
Yosry Ahmed <yosryahmed@xxxxxxxxxx>,
Patrick Bellasi <derkling@xxxxxxxxxx>,
KP Singh <kpsingh@xxxxxxxxxx>,
Alexandra Sandulescu <aesa@xxxxxxxxxx>,
Matteo Rizzo <matteorizzo@xxxxxxxxxx>,
Jann Horn <jannh@xxxxxxxxxx>
kvm@xxxxxxxxxxxxxxx, Brendan Jackman <jackmanb@xxxxxxxxxx>,
Dennis Zhou <dennis@xxxxxxxxxx>

---
Changes in v2:
- Added support for sandboxing userspace processes.
- Link to v1: https://lore.kernel.org/r/20240712-asi-rfc-24-v1-0-144b319a40d8@xxxxxxxxxx

---
Brendan Jackman (21):
mm: asi: Make some utility functions noinstr compatible
x86: Create CONFIG_MITIGATION_ADDRESS_SPACE_ISOLATION
mm: asi: Introduce ASI core API
mm: asi: Add infrastructure for boot-time enablement
mm: asi: ASI support in interrupts/exceptions
mm: asi: Avoid warning from NMI userspace accesses in ASI context
mm: Add __PAGEFLAG_FALSE
mm: asi: Map non-user buddy allocations as nonsensitive
[TEMP WORKAROUND] mm: asi: Workaround missing partial-unmap support
mm: asi: Map kernel text and static data as nonsensitive
mm: asi: Map vmalloc/vmap data as nonsensitive
mm: asi: Stabilize CR3 in switch_mm_irqs_off()
mm: asi: Make TLB flushing correct under ASI
KVM: x86: asi: Restricted address space for VM execution
mm: asi: exit ASI before accessing CR3 from C code where appropriate
mm: asi: Add infrastructure for mapping userspace addresses
mm: asi: Restricted execution fore bare-metal processes
x86: Create library for flushing L1D for L1TF
mm: asi: Add some mitigations on address space transitions
x86/pti: Disable PTI when ASI is on
mm: asi: Stop ignoring asi=on cmdline flag

Junaid Shahid (4):
mm: asi: Make __get_current_cr3_fast() ASI-aware
mm: asi: ASI page table allocation functions
mm: asi: Functions to map/unmap a memory range into ASI page tables
mm: asi: Add basic infrastructure for global non-sensitive mappings

Ofir Weisse (1):
mm: asi: asi_exit() on PF, skip handling if address is accessible

Reiji Watanabe (1):
mm: asi: Map dynamic percpu memory as nonsensitive

Yosry Ahmed (2):
mm: asi: Use separate PCIDs for restricted address spaces
mm: asi: exit ASI before suspend-like operations

arch/alpha/include/asm/Kbuild | 1 +
arch/arc/include/asm/Kbuild | 1 +
arch/arm/include/asm/Kbuild | 1 +
arch/arm64/include/asm/Kbuild | 1 +
arch/csky/include/asm/Kbuild | 1 +
arch/hexagon/include/asm/Kbuild | 1 +
arch/loongarch/include/asm/Kbuild | 3 +
arch/m68k/include/asm/Kbuild | 1 +
arch/microblaze/include/asm/Kbuild | 1 +
arch/mips/include/asm/Kbuild | 1 +
arch/nios2/include/asm/Kbuild | 1 +
arch/openrisc/include/asm/Kbuild | 1 +
arch/parisc/include/asm/Kbuild | 1 +
arch/powerpc/include/asm/Kbuild | 1 +
arch/riscv/include/asm/Kbuild | 1 +
arch/s390/include/asm/Kbuild | 1 +
arch/sh/include/asm/Kbuild | 1 +
arch/sparc/include/asm/Kbuild | 1 +
arch/um/include/asm/Kbuild | 2 +-
arch/x86/Kconfig | 27 +
arch/x86/boot/compressed/ident_map_64.c | 10 +
arch/x86/boot/compressed/pgtable_64.c | 11 +
arch/x86/include/asm/asi.h | 306 +++++++++
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/disabled-features.h | 8 +-
arch/x86/include/asm/idtentry.h | 50 +-
arch/x86/include/asm/kvm_host.h | 3 +
arch/x86/include/asm/l1tf.h | 11 +
arch/x86/include/asm/nospec-branch.h | 2 +
arch/x86/include/asm/pgalloc.h | 6 +
arch/x86/include/asm/pgtable_64.h | 4 +
arch/x86/include/asm/processor-flags.h | 24 +
arch/x86/include/asm/processor.h | 20 +-
arch/x86/include/asm/pti.h | 6 +-
arch/x86/include/asm/special_insns.h | 45 +-
arch/x86/include/asm/tlbflush.h | 6 +
arch/x86/kernel/process.c | 2 +
arch/x86/kernel/process_32.c | 2 +-
arch/x86/kernel/process_64.c | 2 +-
arch/x86/kernel/traps.c | 22 +
arch/x86/kvm/Kconfig | 1 +
arch/x86/kvm/svm/svm.c | 2 +
arch/x86/kvm/vmx/nested.c | 6 +
arch/x86/kvm/vmx/vmx.c | 113 ++--
arch/x86/kvm/x86.c | 81 ++-
arch/x86/lib/Makefile | 1 +
arch/x86/lib/l1tf.c | 96 +++
arch/x86/lib/retpoline.S | 10 +
arch/x86/mm/Makefile | 1 +
arch/x86/mm/asi.c | 1039 ++++++++++++++++++++++++++++++
arch/x86/mm/fault.c | 124 +++-
arch/x86/mm/init.c | 7 +-
arch/x86/mm/init_64.c | 25 +-
arch/x86/mm/mm_internal.h | 3 +
arch/x86/mm/pti.c | 14 +-
arch/x86/mm/tlb.c | 167 ++++-
arch/x86/virt/svm/sev.c | 2 +-
arch/xtensa/include/asm/Kbuild | 1 +
drivers/firmware/efi/libstub/x86-5lvl.c | 2 +-
include/asm-generic/asi.h | 113 ++++
include/asm-generic/vmlinux.lds.h | 11 +
include/linux/entry-common.h | 11 +
include/linux/gfp.h | 5 +
include/linux/gfp_types.h | 15 +-
include/linux/mm_types.h | 7 +
include/linux/page-flags.h | 18 +
include/linux/pgtable.h | 3 +
include/trace/events/mmflags.h | 12 +-
init/main.c | 2 +
kernel/entry/common.c | 1 +
kernel/fork.c | 5 +
kernel/sched/core.c | 9 +
mm/init-mm.c | 4 +
mm/internal.h | 2 +
mm/mm_init.c | 1 +
mm/page_alloc.c | 160 ++++-
mm/percpu-vm.c | 50 +-
mm/percpu.c | 4 +-
mm/vmalloc.c | 53 +-
tools/perf/builtin-kmem.c | 1 +
80 files changed, 2582 insertions(+), 190 deletions(-)
---
base-commit: ebd6ea9c6976c64ed5af3e6dce672616447e8e62
change-id: 20241115-asi-rfc-v2-5d9bbb441186

Best regards,
--
Brendan Jackman <jackmanb@xxxxxxxxxx>