Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

From: Sargun Dhillon
Date: Fri Jun 05 2020 - 02:06:53 EST

On Fri, May 29, 2020 at 11:01 PM Gabriel Krisman Bertazi
<krisman@xxxxxxxxxxxxx> wrote:
> Modern Windows applications are executing system call instructions
> directly from the application's code without going through the WinAPI.
> This breaks Wine emulation, because it doesn't have a chance to
> intercept and emulate these syscalls before they are submitted to Linux.
> In addition, we cannot simply trap every system call of the application
> to userspace using PTRACE_SYSEMU, because performance would suffer,
> since our main use case is to run Windows games over Linux. Therefore,
> we need some in-kernel filtering to decide whether the syscall was
> issued by the wine code or by the windows application.
> The filtering cannot really be done based solely on the syscall number,
> because those could collide with existing Linux syscalls. Instead, our
> proposed solution is to trap syscalls based on the userspace memory
> region that triggered the syscall, as wine is responsible for the
> Windows code allocations and it can apply correct memory protections to
> those areas.
> Therefore, this patch reuses the seccomp infrastructure to trap
> system calls, but introduces a new mode to trap based on a vma attribute
> that describes whether the userspace memory region is allowed to execute
> syscalls or not. The protection is defined at mmap/mprotect time with a
> new protection flag PROT_NOSYSCALL. This setting only takes effect if
> the new SECCOMP_MODE_MEMMAP is enabled through seccomp().
> It goes without saying that this is in no way a security mechanism
> despite being built on top of seccomp, since an evil application can
> always jump to a whitelisted memory region and run the syscall. This
> is not a concern for Wine games. Nevertheless, we reuse seccomp as a
> way to avoid adding a new mechanism to essentially do the same job of
> filtering system calls.
> We experimented with dynamically generating BPF filters for whitelisted
> memory regions and using SECCOMP_MODE_FILTER, but there are a few
> reasons why it isn't enough nor a good idea for our use case:
> 1. We cannot set the filters at program initialization time and forget
> about it, since there is no way of knowing which modules will be loaded,
> whether native and windows. Filter would need a way to be updated
> frequently during game execution.
> 2. We cannot predict which Linux libraries will issue syscalls directly.
> Most of the time, whitelisting libc and a few other libraries is enough,
> but there are no guarantees other Linux libraries won't issue syscalls
> directly and break the execution. Adding every linux library that is
> loaded also has a large performance cost due to the large resulting
> filter.
> 3. As I mentioned before, performance is critical. In our testing with
> just a single memory segment blacklisted/whitelisted, the minimum size
> of a bpf filter would be 4 instructions. In that scenario,
> SECCOMP_MODE_FILTER added an average overhead of 10% to the execution
> time of sysinfo(2) in comparison to seccomp disabled, while the impact
> of SECCOMP_MODE_MEMMAP was averaged around 1.5%.
> Indeed, points 1 and 2 could be worked around with some userspace work
> and improved SECCOMP_MODE_FILTER support, but at a high performance and
> some stability cost, to obtain the semantics we want. Still, the
> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
> enough that I believe it should be considered as an upstream solution.
> Sending as an RFC for now to get the discussion started. In particular:
I have a totally different question. I am experimenting with a
patchset which is designed
to help with the "extended syscall" case (as Kees calls it).
Effectively syscalls like openat2,
where the syscall arguments are passed as a (potentially mixed size)
structure need to be
able to be inspected through user notif. `We can kind-of deal with
this with other syscalls
with mechanisms like pidfd_getfd, addfd, and potentially being able to
(re)set the registers
prior to actual invocation of the syscall. Unfortunately, you cannot
do the same trick with
user memory, because it opens you up to a time-of-check, time-of-use
attack, since the
kernel copies the syscall arguments from the invoking program again.

One of the things I've been experimenting with is using tricks like
userfaultfd / mprotect to
try to deal with this. I think that I might have to add some
capability to the kernel to actually
deal with this. In general, the approach is:
1. Syscall is invoked, and wakes up the manager
2. The manager gets the arguments, and a handle (either the ID, or an
FD). It then uses this
ID to read memory. Either something like process_vm_readv, an ioctl, or read.
3. When the kernel reads these arguments, it splits the VMA for the
address the pointer
lies in, and sets up access() with a special mapping that checks if
the page has been
tampered with by userspace in the read ranges between the manager read
and the writes.
We can either SIGBUS or stall writes to the range if we want to make
things "simple",
or we can mess with uaccess bits and EPERM if the kernel tries to read
that memory.
4. When the syscall returns, or the kernel writes to that area, we
reset the mapping.

I'm wondering if you're dynamically generating these special mappings
with protection,
and how many of them you're generating. How often are you generating them? What
kind of performance cost do you see in normal programs?