Re: [PATCH RFT v8 0/9] fork: Support shadow stacks in clone3()
From: Jann Horn
Date: Fri Aug 16 2024 - 11:53:09 EST
On Thu, Aug 8, 2024 at 10:16 AM Mark Brown <broonie@xxxxxxxxxx> wrote:
> Since clone3() is readily extensible let's add support for specifying a
> shadow stack when creating a new thread or process in a similar manner
> to how the normal stack is specified, keeping the current implicit
> allocation behaviour if one is not specified either with clone3() or
> through the use of clone(). The user must provide a shadow stack
> address and size, this must point to memory mapped for use as a shadow
> stackby map_shadow_stack() with a shadow stack token at the top of the
> stack.
As a heads-up so you don't get surprised by this in the future:
Because clone3() does not pass the flags in a register like clone()
does, it is not available in places like docker containers that use
the default Docker seccomp policy
(https://github.com/moby/moby/blob/master/profiles/seccomp/default.json).
Docker uses seccomp to filter clone() arguments (to prevent stuff like
namespace creation), and that's not possible with clone3(), so
clone3() is blocked.
The same thing applies to things like sandboxed renderer processes of
web browsers - they want to block anything other than creating normal
threads, so they use seccomp to block stuff like namespace creation
and creating new processes.
I briefly mentioned this here during clone3 development, though I
probably should have been more explicit about how it would be
beneficial for clone3 to pass flags in a register:
<https://lore.kernel.org/all/CAG48ez3q=BeNcuVTKBN79kJui4vC6nw0Bfq6xc-i0neheT17TA@xxxxxxxxxxxxxx/>
So if you want your feature to be available in such contexts, you'll
probably have to either add a new syscall clone4() that passes the
flags in a register; or do the plumbing work required to make it
possible to seccomp-filter things other than register contexts (by
invoking seccomp again from the clone3 handler with some kinda
pseudo-syscall?); or change the signature of the existing syscall (but
that would require something like using the high bit of the size to
signal that there's a flags argument in another register, which is
probably more ugly than just adding a new syscall).