Re: [PATCH 1/3] syscall_user_dispatch: Allow allowed range wrap-around
From: Dmitry Vyukov
Date: Tue Feb 18 2025 - 12:34:57 EST
On Tue, 18 Feb 2025 at 17:58, Gregory Price <gourry@xxxxxxxxxx> wrote:
>
> On Tue, Feb 18, 2025 at 05:04:34PM +0100, Dmitry Vyukov wrote:
> > There are two possible scenarios for syscall filtering:
> > - having a trusted/allowed range of PCs, and intercepting everything else
> > - or the opposite: a single untrusted/intercepted range and allowing
> > everything else
> > The current implementation only allows the former use case due to
> > allowed range wrap-around check. Allow the latter use case as well
> > by removing the wrap-around check.
> > The latter use case is relevant for any kind of sandboxing scenario,
> > or monitoring behavior of a single library. If a program wants to
> > intercept syscalls for PC range [START, END) then it needs to call:
> > prctl(..., END, -(END-START), ...);
>
> I don't necessarily disagree with the idea, but this sounds like using
> the wrong tool for the job. The purpose of SUD was for emulating
> foreign OS system calls of entire programs - not a single library.
>
> The point being that it's very difficult to sandbox an individual
> library when you can't ensure it won't allocate resources outside the
> monitored bounds (this would be very difficult to guarantee, at least).
>
> If the intent is to load and re-use a single foreign-OS library, this
> change seems to be the question of "why not allow multiple ranges?",
> and you'd be on your way to reimplementing seccomp or BPF.
The problem with seccomp BPF is that the filter is inherited across
fork/exec which can't be used with SIGSYS and fine-grained custom
user-space policy. USER_DISPATCH is much more flexible in this regard.
Re allocating resources outside of monitored bounds: this is exactly
what syscall filtering is for, right :)
If we install a filter on a library/sandbox, we can control and
prevent it from allocating any more executable pages outside of the
range.
The motivation is sandboxing of libraries loaded within a known fixed
address range, while non-sandboxed code can live on both sides of the
sandboxed range (say, non-pie binary at lower addresses, and libc at
higher addresses).