Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

From: Paul Gofman
Date: Sun May 31 2020 - 08:39:44 EST


Hi!

thanks for looking into this.

On 5/31/20 08:56, Gabriel Krisman Bertazi wrote:
>
>> Is it possible to disassemble and instrument the Windows code to insert
>> breakpoints (or emulation calls) at all the Windows syscall points?
> Hi Kees,
>
> I considered instrumenting the syscall instructions with calls to some
> wrapper, but I was told that modifying the game in memory or on disk
> will trigger all sorts of anti-cheating mechanisms (my main use case are
> windows games).

Yes, this is the case. Besides, before instrumenting, we would need some
way to find those syscalls in the highly obfuscated dynamically
generated code, the whole purpose of which is to prevent disassembling,
debugging and finding things like that in it.

Ultimately, even for the cases when it would be technically feasible I
don't think Wine could ever go for modifying the program's code (unless
of course this is the part of what the program is doing itself using
winapi calls). Wine is trying to implement the API as close to Windows
as possible so the DRM can work with Wine, modifying the program's code
is something different.

>
>>> [...]
>>> * Why not SECCOMP_MODE_FILTER?
>>>
>>> We experimented with dynamically generating BPF filters for whitelisted
>>> memory regions and using SECCOMP_MODE_FILTER, but there are a few
>>> reasons why it isn't enough nor a good idea for our use case:
>>>
>>> 1. We cannot set the filters at program initialization time and forget
>>> about it, since there is no way of knowing which modules will be loaded,
>>> whether native and windows. Filter would need a way to be updated
>>> frequently during game execution.
>>>
>>> 2. We cannot predict which Linux libraries will issue syscalls directly.
>>> Most of the time, whitelisting libc and a few other libraries is enough,
>>> but there are no guarantees other Linux libraries won't issue syscalls
>>> directly and break the execution. Adding every linux library that is
>>> loaded also has a large performance cost due to the large resulting
>>> filter.
>> Just so I can understand the expected use: given the dynamic nature of
>> the library loading, how would Wine be marking the VMAs?
> Paul (cc'ed) is the wine expert, but my understanding is that memory
> allocation and initial program load of the emulated binary will go
> through wine. It does the allocation and mark the vma accordingly
> before returning the allocated range to the windows application.
Yes, exactly. Pretty much any memory allocation which Wine does needs
syscalls (if those are ever encountered later during executing code from
those areas) to be trapped by Wine and passed to Wine's implementation
of the corresponding Windows API function. Linux native libraries
loading and memory allocations performed by them go outside of Wine control.
>
>>> Indeed, points 1 and 2 could be worked around with some userspace work
>>> and improved SECCOMP_MODE_FILTER support, but at a high performance and
>>> some stability cost, to obtain the semantics we want. Still, the
>>> performance would suffer, and SECCOMP_MODE_MEMMAP is non intrusive
>>> enough that I believe it should be considered as an upstream solution.
>> It looks like you're using SECCOMP_RET_TRAP for this? Signal handling
>> can be pretty slow. Did you try SECCOMP_RET_USER_NOTIF?

We are not much concerned with the overhead of the trapped syscall in
our use case, those are very rare. What we are concerned with is the
performance impact on the normal Linux syscalls when the syscall
trapping is enabled. When I was measuring that impact I've got the same
10% overhead for the untrapped syscalls (that is, always hitting
SECCOMP_RET_ALLOW case) with the filters Gabriel mentioned.


>>
>>> +
>>> + if (!(vma->vm_flags & VM_NOSYSCALL))
>>> + return 0;
>>> +
>>> + syscall_rollback(current, task_pt_regs(current));
>>> + seccomp_send_sigsys(this_syscall, 0);
>>> +
>>> + seccomp_log(this_syscall, SIGSYS, SECCOMP_RET_TRAP, true);
>>> +
>>> + return -1;
>>> +}
>> This really just looks like an ip_address filter, but I get what you
>> mean about stacking filters, etc. This may finally be the day we turn to
>> eBPF in seccomp, since that would give you access to using a map lookup
>> on ip_address, and the map could be updated externally (removing the
>> need for the madvise() changes).
> And 64-bit comparisons :)
>
> that would be a good solution, we'd still need to look at some details,
> like disabling/updating filters at runtime when some new library is
> loaded, but since we can update externally, I think it covers it.
I am afraid that for a general case the filter add / update / removal
should be done not just when a new library is loaded or unloaded but
pretty much any time new (executable) memory region is allocated or
deallocated, or PROT_EXEC should be changed on allocated pages . But I
am not sure I've got enough details yet on the suggested approach here
and might be missing important details. I guess maybe we could discuss
the details separately with Gabriel first.