Re: [PATCH RFC] seccomp: Implement syscall isolation based on memory areas

From: Robert O'Callahan
Date: Thu Jun 25 2020 - 21:08:50 EST


On Fri, Jun 26, 2020 at 11:49 AM Gabriel Krisman Bertazi
<krisman@xxxxxxxxxxxxx> wrote:
> We couldn't patch Windows code because of the aforementioned DRM and
> anti-cheat mechanisms, but I suppose this limitation doesn't apply to
> Wine/native code, and if this assumption is correct, this approach could
> work.
>
> One complexity might be the consistent model for the syscall live
> patching. I don't know how much of the problem is diminished from the
> original userspace live-patching problem, but I believe at least part of
> it applies. And fencing every thread to patch would kill performance.
> Also, we cannot just patch everything at the beginning. How does rr
> handle that?

That's a good point. rr only allows one tracee thread to run at a time
for other reasons, so when we consider patching a syscall instruction,
we inspect all thread states to see if the patch would interfere with
any other thread, and skip patching it in that unlikely case. (We'll
try to patch it again next time that instruction is executed.) Wine
would need some other solution, but indeed that could be a
showstopper.

> Another problem is that we will want to support i386 and other
> architectures. For int 0x80, it is trickier to encode a branch to
> another region, given the limited instruction space, and the patching
> might not be possible in hot paths.

This is no worse than for x86-64 `syscall`, which is also two bytes.
We have pretty much always been able to patch the frequently executed
syscalls by replacing both the syscall instruction and an instruction
before or after the syscall with a five-byte jump, and folding the
extra instruction into the stub.

>I did port libsyscall-intercept to
> x86-32 once and I could correctly patch glibc, but it's not guaranteed
> that an updated libc or something else won't break it.

We haven't found this to be much of a problem in rr. From time to time
we have to add new patch patterns. The good news is that if things
change so a syscall can't be patched, the result is degraded
performance, not functional breakage.

> I'm not sure the benefit of not needing enhanced kernel support
> justifies the complexity and performance cost required to make this work
> reliably, in particular since the semantics for a kernel implementation
> that we are discussing doesn't seem overly intrusive and might have
> other applications like in the generic filter Andy mentioned.

That's fair --- our solution is complex. (But worth it --- for us,
it's valuable that rr works on quite old Linux kernels.)

As for performance, it performs well for us. I think we'd prefer our
current approach to Andy's hypothetical PR_SET_SYSCALL_THUNK because
the latter would have higher overhead (two trips into the kernel per
syscall). We definitely don't want to require CAP_SYS_ADMIN so that
rules out any eBPF-based alternative too. I would love to see a
low-overhead unprivileged syscall interception mechanism that would
obsolete our patching approach --- preferably one that's also
stackable so rr can record and replay processes that use the new
mechanism --- but I don't think any of the proposals here are that,
yet, unfortunately.

Rob
--
Su ot deraeppa sah dna Rehtaf eht htiw saw hcihw, efil lanrete eht uoy
ot mialcorp ew dna, ti ot yfitset dna ti nees evah ew; deraeppa efil
eht. Efil fo Drow eht gninrecnoc mialcorp ew siht - dehcuot evah sdnah
ruo dna ta dekool evah ew hcihw, seye ruo htiw nees evah ew hcihw,
draeh evah ew hcihw, gninnigeb eht morf saw hcihw taht.