Re: [PATCH 22/35] x86/mm: Prevent VM_WRITE shadow stacks

From: Edgecombe, Rick P
Date: Fri Feb 11 2022 - 20:44:36 EST


On Fri, 2022-02-11 at 14:19 -0800, Dave Hansen wrote:
> On 1/30/22 13:18, Rick Edgecombe wrote:
> > Shadow stack accesses are writes from handle_mm_fault()
> > perspective. So to
> > generate the correct PTE, maybe_mkwrite() will rely on the presence
> > of
> > VM_SHADOW_STACK or VM_WRITE in the vma.
> >
> > In future patches, when VM_SHADOW_STACK is actually creatable by
> > userspace, a problem could happen if a user calls
> > mprotect( , , PROT_WRITE) on VM_SHADOW_STACK shadow stack memory.
> > The code
> > would then be confused in the event of shadow stack accesses, and
> > create a
> > writable PTE for a shadow stack access. Then the process would
> > fault in a
> > loop.
> >
> > Prevent this from happening by blocking this kind of memory
> > (VM_WRITE and
> > VM_SHADOW_STACK) from being created, instead of complicating the
> > fault
> > handler logic to handle it.
> >
> > Add an x86 arch_validate_flags() implementation to handle the
> > check.
> > Rename the uapi/asm/mman.h header guard to be able to use it for
> > arch/x86/include/asm/mman.h where the arch_validate_flags() will
> > be.
>
> It would be great if this also said:
>
> There is an existing arch_validate_flags() hook for mmap() and
> mprotect() which allows architectures to reject unwanted
> ->vm_flags combinations. Add an implementation for x86.
>
> That's somewhat implied from what is there already, but making it
> more
> clear would be nice. There's a much higher bar to add a new arch
> hook
> than to just implement an existing one.

Ok, makes sense.

>
>
> > diff --git a/arch/x86/include/asm/mman.h
> > b/arch/x86/include/asm/mman.h
> > new file mode 100644
> > index 000000000000..b44fe31deb3a
> > --- /dev/null
> > +++ b/arch/x86/include/asm/mman.h
> > @@ -0,0 +1,21 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _ASM_X86_MMAN_H
> > +#define _ASM_X86_MMAN_H
> > +
> > +#include <linux/mm.h>
> > +#include <uapi/asm/mman.h>
> > +
> > +#ifdef CONFIG_X86_SHADOW_STACK
> > +static inline bool arch_validate_flags(unsigned long vm_flags)
> > +{
> > + if ((vm_flags & VM_SHADOW_STACK) && (vm_flags & VM_WRITE))
> > + return false;
> > +
> > + return true;
> > +}
>
> The design decision here seems to be that VM_SHADOW_STACK is itself a
> pseudo-VM_WRITE flag. Like you said: "Shadow stack accesses are
> writes
> from handle_mm_fault()".
>
> Very early on, this series seems to have made the decision that
> shadow
> stacks are writable and need lots of write handling behavior, *BUT*
> shouldn't have VM_WRITE set. As a whole, that seems odd.
>
> The alternative would be *requiring* VM_WRITE and VM_SHADOW_STACK be
> set
> together. I guess the downside is that pte_mkwrite() would need to
> be
> made to work on shadow stack PTEs.
>
> That particular design decision was never discussed. I think it has
> a
> really big impact on the rest of the series. What do you think? Was
> it
> a good idea? Or would the alternative be more complicated than what
> you
> have now?

First of all, thanks again for the deep review of the MM piece. I'm
still pondering the overall problem, which is why I haven't responded
to those yet.

I had originally thought that the MM changes were a bit hard to follow.
I was also somewhat amazed at how naturally normal COW worked. I was
wondering where the big COW stuff would be happening. In the way that
COW was sort of tucked away, overloading writability seemed sort of
aligned. But the names are very confusing, and this patch probably
should have been a hint that there are problems design wise.

For writability, especially with WRSS, I do think it's a bit unnatural
to think of shadow stack memory as anything but writable. Especially
when it comes to COW. But shadow stack accesses are not always writes,
incssp for example. The code will create shadow stack memory for shadow
stack access loads, which of course isn't writing anything, but is
required to make the instruction work. So it calls mkwrite(), which is
weird. But... it does need to leave it in a state that is kind of
writable, so makes a little sense I guess.

I was wondering if maybe the mm code can't be fully sensible for shadow
stacks without creating maybe_mkshstk() and adding it everywhere in a
whole new fault path. Then you have reads, writes and shadow stack
accesses that each have their own logic. It might require so many
additions that better names and comments are preferable. I don't know
though, still trying to come up with a good opinion.