Re: [RFC v2 00/27] Kernel Address Space Isolation

From: Alexander Graf
Date: Sun Jul 14 2019 - 14:17:46 EST




On 12.07.19 16:36, Andy Lutomirski wrote:
On Fri, Jul 12, 2019 at 6:45 AM Alexandre Chartre
<alexandre.chartre@xxxxxxxxxx> wrote:


On 7/12/19 2:50 PM, Peter Zijlstra wrote:
On Fri, Jul 12, 2019 at 01:56:44PM +0200, Alexandre Chartre wrote:

I think that's precisely what makes ASI and PTI different and independent.
PTI is just about switching between userland and kernel page-tables, while
ASI is about switching page-table inside the kernel. You can have ASI without
having PTI. You can also use ASI for kernel threads so for code that won't
be triggered from userland and so which won't involve PTI.

PTI is not mapping kernel space to avoid speculation crap (meltdown).
ASI is not mapping part of kernel space to avoid (different) speculation crap (MDS).

See how very similar they are?


Furthermore, to recover SMT for userspace (under MDS) we not only need
core-scheduling but core-scheduling per address space. And ASI was
specifically designed to help mitigate the trainwreck just described.

By explicitly exposing (hopefully harmless) part of the kernel to MDS,
we reduce the part that needs core-scheduling and thus reduce the rate
the SMT siblngs need to sync up/schedule.

But looking at it that way, it makes no sense to retain 3 address
spaces, namely:

user / kernel exposed / kernel private.

Specifically, it makes no sense to expose part of the kernel through MDS
but not through Meltdow. Therefore we can merge the user and kernel
exposed address spaces.

The goal of ASI is to provide a reduced address space which exclude sensitive
data. A user process (for example a database daemon, a web server, or a vmm
like qemu) will likely have sensitive data mapped in its user address space.
Such data shouldn't be mapped with ASI because it can potentially leak to the
sibling hyperthread. For example, if an hyperthread is running a VM then the
VM could potentially access user sensitive data if they are mapped on the
sibling hyperthread with ASI.

So I've proposed the following slightly hackish thing:

Add a mechanism (call it /dev/xpfo). When you open /dev/xpfo and
fallocate it to some size, you allocate that amount of memory and kick
it out of the kernel direct map. (And pay the IPI cost unless there
were already cached non-direct-mapped pages ready.) Then you map
*that* into your VMs. Now, for a dedicated VM host, you map *all* the
VM private memory from /dev/xpfo. Pretend it's SEV if you want to
determine which pages can be set up like this.

Does this get enough of the benefit at a negligible fraction of the
code complexity cost? (This plus core scheduling, anyway.)

The problem with that approach is that you lose the ability to run legacy workloads that do not support an SEV like model of "guest owned" and "host visible" pages, but instead assume you can DMA anywhere.

Without that, your host will have visibility into guest pages via user space (QEMU) pages which again are mapped in the kernel direct map, so can be exposed via a spectre gadget into a malicious guest.

Also, please keep in mind that even register state of other VMs may be a secret that we do not want to leak into other guests.


Alex