Re: Kernel prctl feature for syscall interception and emulation

From: Gabriel Krisman Bertazi
Date: Thu Nov 19 2020 - 12:33:00 EST


Rich Felker <dalias@xxxxxxxx> writes:

> On Thu, Nov 19, 2020 at 11:15:46AM -0500, Gabriel Krisman Bertazi wrote:
>> Rich Felker <dalias@xxxxxxxx> writes:
>>
>> > On Wed, Nov 18, 2020 at 01:57:26PM -0500, Gabriel Krisman Bertazi via Libc-alpha wrote:
>>
>> [...]
>>
>> >
>> > SIGSYS (or signal handling in general) is not the right way to do
>> > this. It has all the same problems that came up in seccomp filtering
>> > with SIGSYS, and which were solved by user_notif mode (running the
>> > interception in a separate thread rather than an async context
>> > interrupting the syscall. In fact I wouldn't be surprised if what you
>> > want can already be done with reasonable efficiency using seccomp
>> > user_notif.
>>
>> Hi Rich,
>>
>> User_notif was raised in the kernel discussion and we had experimented
>> with it, but the latency of user_notif is even worse than what we can do
>> right now with other seccomp actions.
>
> Is there a compelling argument that the latency matters here? What
> syscalls are windows binaries making like this? Is there a reason you
> can't do something like intercepting the syscall with seccomp the
> first time it happens, then rewriting the code not to use a direct
> syscall on future invocations?

We can't do any code rewriting without tripping DRM protections and
anti-cheating mechanisms.

I should correct myself here. While it is true that user_notif is
slower than other seccomp actions, this is not a problem in itself. The
frequency of syscalls that need to be emulated is much smaller than
regular syscalls, and the performance problem actually appears due to
the filtering. I should investigate user_notif more, but I don't oppose
SUD doing user_notif instead of SIGSYS. I will raise that with Wine
developers and the kernel community.

>> Regarding SIGSYS, the x86 maintainer suggested redirecting the syscall
>> return to a userspace thunk, but the understanding among Wine developers
>> is that SIGSYS is enough for their emulation needs.
>
> It might work for Wine needs, if Wine can guarantee it will never be
> running code with signals blocked and some other constraints, but then
> you end up with a mechanism that's designed just for Wine and that
> will have gratuitous reasons it's not usable elsewhere. That does not
> seem appropriate for inclusion in kernel.
>
>> > The default-intercept and excepting libc code segment is also bogus,
>> > and will break stuff, including vdso syscall mechanism on i386 and any
>> > code outside libc that makes its own syscalls from asm. If you need to
>> > tag regions to control interception, it should be tagging the emulated
>> > Windows guest code, which is bounded and you have full control over,
>> > rather than the host code, which is unbounded and includes any
>> > libraries that get linked indirectly by Wine.
>>
>> The vdso trampoline, for the architectures that have it, is solved by
>> the kernel implementation, who makes sure that region is allowed.
>
> I guess that works but it's ugly and assumes particular policy goals
> matching Wine's rather than being a general mechanism.
>
>> The Linux code is not bounded, but the dispatcher region main goal is to
>> support trampolines outside of the vdso case. The correct userspace
>> implementation requires flipping the selector on any Windows/Linux code
>> boundary cross, exactly because other libraries can issue syscalls
>> directly. The fact that libc is not the only one issuing syscalls is
>> the exact reason we need something more complex than a few seccomp
>> filters.
>
> I don't think this is correct. Rather than listing all the host
> library code ranges to allow, you just list all the guest Windows code
> ranges to intercept. Wine knows them by virtue of being the loader for
> them. This all seems really easy to do with seccomp with a very small
> filter.

The Windows code is not completely loaded at initialization time. It
also has dynamic libraries loaded later. yes, wine knows the memory
regions, but there is no guarantee there is a small number of segments
or that the full picture is known at any given moment.

>> > But I'm skeptical that doing any new kernel-side logic for tagging is
>> > needed. Seccomp already lets you filter on instruction pointer so you
>> > can install filters that will trigger user_notif just for guest code,
>> > then let you execute the emulation in the watcher thread and skip the
>> > actual syscall in the watched thread.
>>
>> As I mentioned, we can check IP in seccomp and write filters. But this
>> has two problems:
>>
>> 1) Performance. seccomp filters use cBPF which means 32bit comparisons,
>> no maps and a very limited instruction set. We need to generate
>> boundary checks for each memory segment. The filter becomes very large
>> very quickly and becomes a observable bottleneck.
>
> This sounds like you're doing something wrong. Range checking is O(log
> n) and n cannot be large enough to make log n significant. If you do
> it with a linear search rather than binary then of course it's slow.

And SUD is O(1). The filtering overhead is the big point here. The
seccomp kselftests benchmark shows a 32% overhead introduced by seccomp
for a simple getpid syscall. With a second filter (not a second
verification on the same filter), the overhead goes to 47%. SUD shows
an overhead of 13.4% over the same syscall.

I understand two filters is very different than 1 filter with more vmas,
but since we cannot remove filters, we'd need to add more filters to
make it more strict.

>> 2) Seccomp filters cannot be removed. And we'd need to update them
>> frequently.
>
> What are the updating requirements?

As far as I understand (I'm not a wine developer), they need to remove
and modify filters. Given seccomp is a security feature, It would be a
hard sell to support these operations. We discussed this on the kernel
list.

> I'm not sure if Windows code is properly PIC or not, but if it is,
> then you just do your own address assignment in a single huge range
> (first allocated with PROT_NONE, then MAP_FIXED over top of it) so
> that a single static range check suffices.

I'm Cc'ing some wine developers who can assist with this point.

--
Gabriel Krisman Bertazi