Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

From: hpa
Date: Mon Jun 01 2020 - 13:49:07 EST

On June 1, 2020 6:59:26 AM PDT, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> On Jun 1, 2020, at 2:23 AM, Billy Laws <blaws05@xxxxxxxxx> wrote:
>> ï
>>> On May 30, 2020, at 5:26 PM, Gabriel Krisman Bertazi
><krisman@xxxxxxxxxxxxx> wrote:
>>> ïAndy Lutomirski <luto@xxxxxxxxxxxxxx> writes:
>>>>>>> On May 29, 2020, at 11:00 PM, Gabriel Krisman Bertazi
><krisman@xxxxxxxxxxxxx> wrote:
>>>>>> ïModern Windows applications are executing system call
>>>>>> directly from the application's code without going through the
>>>>>> This breaks Wine emulation, because it doesn't have a chance to
>>>>>> intercept and emulate these syscalls before they are submitted to
>>>>>> In addition, we cannot simply trap every system call of the
>>>>>> to userspace using PTRACE_SYSEMU, because performance would
>>>>>> since our main use case is to run Windows games over Linux.
>>>>>> we need some in-kernel filtering to decide whether the syscall
>>>>>> issued by the wine code or by the windows application.
>>>> Do you really need in-kernel filtering? What if you could have
>>>> efficient userspace filtering instead? That is, set something up
>>>> that all syscalls, except those from a special address, are
>>>> to CALL thunk where the thunk is configured per task. Then the
>>>> can do whatever emulation is needed.
>>> Hi,
>>> I suggested something similar to my customer, by using
>>> libsyscall-intercept. The idea would be overwritting the syscall
>>> instruction with a call to the entry point. I'm not a specialist on
>>> specifics of Windows games, (cc'ed Paul Gofman, who can provide more
>>> details on that side), but as far as I understand, the reason why
>>> is not feasible is that the anti-cheat protection in games will
>>> execution if the binary region was modified either on-disk or
>>> Is there some mechanism to do that without modiyfing the
>> Hi,
>> I work on an emulator for the Nintendo Switch that uses a similar
>> in our testing it works very well and is much more performant than
>> To work around DRM reading the memory contents I think mprotect could
>> be used, after patching the syscall a copy of the original code could
>> kept somewhere in memory and the patched region mapped --X.
>> With this, any time the DRM attempts to read to the patched region
>> perform integrity checks it will cause a segfault and a branch to the
>> signal handler. This handler can then return the contents of the
>> unpatched region to satisfy them checks.
>> Are memory contents checked by DRM solutions too often for this to be
>> performant?
>A bigger issue is that hardware support for âX is quite spotty. There
>is no x86 CPU that can do it cleanly in a bare metal setup, and client
>CPUs that can do it at all without hypervisor help may be nonexistent.
>I donât know if the ARM situation is much better.
>> --
>> Billy Laws
>>>> Getting the details and especially the interaction with any seccomp
>>>> filters that may be installed right could be tricky, but the
>>>> should be decent, at least on non-PTI systems.
>>>> (If we go this route, I suspect that the correct interaction with
>>>> seccomp is that this type of redirection takes precedence over
>>>> and seccomp filters are not invoked for redirected syscalls. After
>>>> a redirected syscall is, functionally, not a syscall at all.)
>>> --
>>> Gabriel Krisman Bertazi

Running these things in a minimal VM container would allow this kind of filtering/trapping to be done in the VMM, too. I don't know how many layers deep you invoke native Linux libraries, and so if the option would exist to use out-of-range system call numbers for the Linux system numbers?
Sent from my Android device with K-9 Mail. Please excuse my brevity.