Re: [PATCH 0/6] Memory Mapping (VMA) protection using PKU - set 1

From: Stephen Röttger
Date: Fri May 19 2023 - 07:14:19 EST


On Thu, May 18, 2023 at 11:04 PM Dave Hansen <dave.hansen@xxxxxxxxx> wrote:
>
> On 5/18/23 13:20, Jeff Xu wrote:>> Here's my concern about this whole
> thing: it's headed down a rabbit hole
> >> which is *highly* specialized both in the apps that will use it and the
> >> attacks it will mitigate. It probably *requires* turning off a bunch of
> >> syscalls (like io_uring) that folks kinda like in general.
> >>
> > ChromeOS currently disabled io_uring, but it is not required to do so.
> > io_uring supports the IORING_OP_MADVICE operation, which calls the
> > do_madvise() function. This means that io_uring will have the same
> > pkey checks as the madvice() system call. From that perspective, we
> > will fully support io_uring for this feature.
>
> io_uring fundamentally doesn't have the same checks. The kernel side
> work can be done from an asynchronous kernel thread. That kernel thread
> doesn't have a meaningful PKRU value. The register has a value, but
> it's not really related to the userspace threads that are sending it
> requests.
>
> >> We're balancing that highly specialized mitigation with a feature that
> >> add new ABI, touches core memory management code and signal handling.
> >>
> > The ABI change uses the existing flag field in pkey_alloc() which is
> > reserved. The implementation is backward compatible with all existing
> > pkey usages in both kernel and user space. Or do you have other
> > concerns about ABI in mind ?
>
> I'm not worried about the past, I'm worried any time we add a new ABI
> since we need to support it forever.
>
> > Yes, you are right about the risk of touching core mm code. To
> > minimize the risk, I try to control the scope of the change (it is
> > about 3 lines in mprotect, more in munmap but really just 3 effective
> > lines from syscall entry). I added new self-tests in mm to make sure
> > it doesn't regress in api behavior. I run those tests before and after
> > my kernel code change to make sure the behavior remains the same, I
> > tested it on 5.15 and 6.1 and 6.4-rc1. Actually, the testing
> > discovered a behavior change for mprotect() between 6.1 and 6.4 (not
> > from this patch, there are refactoring works going on in mm) see this
> > thread [1]
> > I hope those steps will help to mitigate the risk.
> >
> > Agreed on signaling handling is a tough part: what do you think about
> > the approach (modifying PKRU from saved stack after XSAVE), is there a
> > blocker ?
>
> Yes, signal entry and sigreturn are not necessarily symmetric so you
> can't really have a stack.
>
> >> On the x86 side, PKRU is a painfully special snowflake. It's exposed in
> >> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel.
> >> This would be making it an even more special snowflake because it would
> >
> > I admit I'm quite ignorant on XSAVE to understand the above
> > statement, and how that is related. Could you explain it to me please
> > ? And what is in your mind that might improve the situation ?
>
> In a nutshell: XSAVE components are classified as either user or
> supervisor. User components can be modified from userspace and
> supervisor ones only from the kernel. In general, user components don't
> affect the kernel; the kernel doesn't care what is in ZMM11 (an
> XSAVE-managed register). That lets us do fun stuff like be lazy about
> when ZMM11 is saved/restored. Being lazy is good because it give us
> things like faster context switches and KVM VMEXIT handling.
>
> PKRU is a user component, but it affects the kernel when the kernel does
> copy_to/from_user() and friends. That means that the kernel can't do
> any "fun stuff" with PKRU. As soon as userspace provides a new value,
> the kernel needs to start respecting it. That makes PKRU a very special
> snowflake.
>
> So, even though PKRU can be managed by XSAVE, it isn't. It isn't kept
> in the kernel XSAVE buffer. But it *IS* in the signal stack XSAVE
> buffer. You *can* save/restore it with the other XSAVE components with
> ptrace(). The user<->kernel ABI pretends that PKRU is XSAVE managed
> even though it is not.
>
> All of this is special-cased. There's a ton of code to handle this
> mess. It's _complicated_. I haven't even started talking about how
> this interacts with KVM and guests.
>
> How could we improve it? A time machine would help to either change the
> architecture or have Linux ignore the fact that XSAVE knows anything
> about PKRU.
>
> So, the bar is pretty high for things that want to further muck with
> PKRU. Add signal and sigaltstack in particular into the fray, and we've
> got a recipe for disaster. sigaltstack and XSAVE don't really get along
> very well. https://lwn.net/Articles/862541/
>
> >> need new altstack ABI and handling.
> >>
> > I thought adding protected memory support to signaling handling is an
> > independent project with its own weight. As Jann Horn points out in
> > [2]: "we could prevent the attacker from corrupting the signal
> > context if we can protect the signal stack with a pkey." However,
> > the kernel will send SIGSEGV when the stack is protected by PKEY, so
> > there is a benefit to make this work. (Maybe Jann can share some more
> > thoughts on the benefits)
> >
> > And I believe we could do this in a way with minimum ABI change, as below:
> > - allocate PKEY with a new flag (PKEY_ALTSTACK)
> > - at sigaltstack() call, detect the memory is PKEY_ALTSTACK protected,
> > (similar as what mprotect does in this patch) and save it along with
> > stack address/size.
> > - at signaling handling, use the saved info to fill in PKRU.
> > The ABI change is similar to PKEY_ENFORCE_API, and there is no
> > backward compatibility issue.
> >
> > Will these mentioned help our case ? What do you think ?
>
> To be honest, no.
>
> What you've laid out here is the tip of the complexity iceberg. There
> are a lot of pieces of the kernel that are not yet factored in.
>
> Let's also remember: protection keys is *NOT* a security feature. It's
> arguable that pkeys is a square peg trying to go into a round security hole.

While they're not a security feature, they're pretty close to providing us with
exactly what we need: per-thread memory permissions that we can use for
in-process isolation.
We've spent quite some effort up front thinking about potential attacks and
we're confident we can build something that will pose a meaningful boundary.

> On the x86 side, PKRU is a painfully special snowflake. It's exposed in
> the "XSAVE" ABIs, but not actually managed *with* XSAVE in the kernel.
> This would be making it an even more special snowflake because it would
> need new altstack ABI and handling.

Most of the complexity in the signal handling proposal seems to come from the
saving/restoring pkru before/after the signal handler execution. However, this
is just nice to have. We just need the kernel to allow us to register
pkey-tagged memory as a sigaltstack, i.e. it shouldn't crash when trying to
write the register state to the stack. Everything else, we can do in userland.

> It probably *requires* turning off a bunch of
> syscalls (like io_uring) that folks kinda like in general.

Kind of. This approach only works in combination with an effort in userland to
restrict the syscalls. Though that doesn't mean you have to turn them off,
there's also the option of adding validation before it.
The same applies to the memory management syscalls in this patchset. We can add
validation for these in userland, but we're hoping to do it in kernel instead
for the reasons I mentioned before (e.g. they're very common and it's much
easier to validate in the kernel). Also subjectively it seems like a
nice property
if the pkey protections would not just apply to the memory contents, but also
apply to the metadata.

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature