Re: [PATCH v6 33/41] x86/shstk: Introduce map_shadow_stack syscall

From: Edgecombe, Rick P
Date: Wed Feb 22 2023 - 20:11:59 EST


On Wed, 2023-02-22 at 16:03 -0800, Deepak Gupta wrote:
> On Sat, Feb 18, 2023 at 01:14:25PM -0800, Rick Edgecombe wrote:
> > When operating with shadow stacks enabled, the kernel will
> > automatically
> > allocate shadow stacks for new threads, however in some cases
> > userspace
> > will need additional shadow stacks. The main example of this is the
> > ucontext family of functions, which require userspace allocating
> > and
> > pivoting to userspace managed stacks.
> >
> > Unlike most other user memory permissions, shadow stacks need to be
> > provisioned with special data in order to be useful. They need to
> > be setup
> > with a restore token so that userspace can pivot to them via the
> > RSTORSSP
> > instruction. But, the security design of shadow stack's is that
> > they
> > should not be written to except in limited circumstances. This
> > presents a
> > problem for userspace, as to how userspace can provision this
> > special
> > data, without allowing for the shadow stack to be generally
> > writable.
> >
> > Previously, a new PROT_SHADOW_STACK was attempted, which could be
> > mprotect()ed from RW permissions after the data was provisioned.
> > This was
> > found to not be secure enough, as other thread's could write to the
> > shadow stack during the writable window.
> >
> > The kernel can use a special instruction, WRUSS, to write directly
> > to
> > userspace shadow stacks. So the solution can be that memory can be
> > mapped
> > as shadow stack permissions from the beginning (never generally
> > writable
> > in userspace), and the kernel itself can write the restore token.
> >
> > First, a new madvise() flag was explored, which could operate on
> > the
> > PROT_SHADOW_STACK memory. This had a couple downsides:
> > 1. Extra checks were needed in mprotect() to prevent writable
> > memory from
> > ever becoming PROT_SHADOW_STACK.
> > 2. Extra checks/vma state were needed in the new madvise() to
> > prevent
> > restore tokens being written into the middle of pre-used shadow
> > stacks.
> > It is ideal to prevent restore tokens being added at arbitrary
> > locations, so the check was to make sure the shadow stack had
> > never been
> > written to.
> > 3. It stood out from the rest of the madvise flags, as more of
> > direct
> > action than a hint at future desired behavior.
> >
> > So rather than repurpose two existing syscalls (mmap, madvise) that
> > don't
> > quite fit, just implement a new map_shadow_stack syscall to allow
> > userspace to map and setup new shadow stacks in one step. While
> > ucontext
> > is the primary motivator, userspace may have other unforeseen
> > reasons to
> > setup it's own shadow stacks using the WRSS instruction. Towards
> > this
> > provide a flag so that stacks can be optionally setup securely for
> > the
> > common case of ucontext without enabling WRSS. Or potentially have
> > the
> > kernel set up the shadow stack in some new way.
>
> Was following ever attempted?
>
> void *shstk = mmap(0, size, PROT_SHADOWSTACK, ...);
> - limit PROT_SHADOWSTACK protection flag to only mmap (and thus
> mprotect can't
> convert memory from shadow stack to non-shadow stack type or vice
> versa)
> - limit PROT_SHADOWSTACK protection flag to anonymous memory only.
> - top level mmap handler to put a token at the base using WRUSS if
> prot == PROT_SHADOWSTACK
>
> You essentially would get shadow stack manufacturing with existing
> (single) syscall.
> Acting a bit selfish here, this allows other architectures as well to
> re-use this and
> do their own implementation of mapping and placing the token at the
> base.

Yes, I looked at it. You end up with a pile of checks and hooks added
to mmap() and various other places as you outline. We also now have the
MAP_ABOVE4G limitation for x86 shadow stack that would need checking
for too. It's not exactly a clean fit. Then, callers would have to pass
special x86 flags in anyway.

It doesn't seem like the complexity of the checks is worth saving the
tiny syscall. Is there some reason why riscv can't use the same syscall
stub? It doesn't need to live forever in x86 code. Not sure what the
savings are for riscv of the mmap+checks approach are either...

I did wonder if there could be some sort of more general syscall for
mapping and provisioning special security-type memory. But we probably
need a few more non-shadow stack examples to get an idea of what that
would look like.