Re: [PATCH v2] x86/vsyscall: allow seccomp filter in vsyscall=emulate

From: Will Drewry
Date: Sat Jul 14 2012 - 00:06:08 EST


On Fri, Jul 13, 2012 at 7:48 PM, Will Drewry <wad@xxxxxxxxxxxx> wrote:
> On Fri, Jul 13, 2012 at 6:00 PM, Andrew Lutomirski <luto@xxxxxxx> wrote:
>> On Fri, Jul 13, 2012 at 10:06 AM, Will Drewry <wad@xxxxxxxxxxxx> wrote:
>>> If a seccomp filter program is installed, older static binaries and
>>> distributions with older libc implementations (glibc 2.13 and earlier)
>>> that rely on vsyscall use will be terminated regardless of the filter
>>> program policy when executing time, gettimeofday, or getcpu. This is
>>> only the case when vsyscall emulation is in use (vsyscall=emulate is the
>>> default).
>>>
>>> This patch emulates system call entry inside a vsyscall=emulate by
>>> populating regs->ax and regs->orig_ax with the system call number prior
>>> to calling into seccomp such that all seccomp-dependencies function
>>> normally. Additionally, system call return behavior is emulated in line
>>> with other vsyscall entrypoints for the trace/trap cases.
>>>
>>> Reported-by: Owen Kibel <qmewlo@xxxxxxxxx>
>>> Signed-off-by: Will Drewry <wad@xxxxxxxxxxxx>
>>>
>>> v2: - fixed ip and sp on SECCOMP_RET_TRAP/TRACE (thanks to luto@xxxxxxx)
>>
>>> @@ -253,6 +273,12 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>>>
>>> current_thread_info()->sig_on_uaccess_error = prev_sig_on_uaccess_error;
>>>
>>> + if (skip) {
>>> + if ((long)regs->ax <= 0L) /* seccomp errno emulation */
>>> + goto do_ret;
>>> + goto done; /* seccomp trace/trap */
>>> + }
>>> +
>>> if (ret == -EFAULT) {
>>> /* Bad news -- userspace fed a bad pointer to a vsyscall. */
>>> warn_bad_vsyscall(KERN_INFO, regs,
>>> @@ -271,10 +297,11 @@ bool emulate_vsyscall(struct pt_regs *regs, unsigned long address)
>>>
>>> regs->ax = ret;
>>>
>>> +do_ret:
>>> /* Emulate a ret instruction. */
>>> regs->ip = caller;
>>> regs->sp += 8;
>>> -
>>> +done:
>>> return true;
>>>
>>> sigsegv:
>>> --
>>> 1.7.9.5
>>>
>>
>> This has the same odd property as the sigsegv path that the faulting
>> instruction will appear to be the mov, not the syscall. That seems to
>> be okay, though -- various pieces of code that try to restart the segv
>> are okay with that.
>
> Yeah - I would otherwise do
> regs->ip += 9;
> but I wanted to match the code that was therefor SIGSEGV. If regs->ip
> += 9 _just_ for the SIGSYS case is fine, then I'll make that change
> shortly. Since any code that sees the vsyscall address should be wise
> enough to avoid it, perhaps that's why the SIGSEGV hasn't had a
> problem so far.

I dashed this off without more thought. It's best to leave it as is
because any return to the emulated page will cause a vsyscall fault
event.

>> Is there any code that assumes that changing rax (i.e. the syscall
>> number) and restarting a syscall after SIGSYS will invoke the new
>> syscall? (The RET_TRACE path might be similar -- does the
>> ptrace_event(PTRACE_EVENT_SECCOMP, data) in seccomp.c give a debugger
>> a chance to synchronously cancel or change the syscall?
>
> Unfortunately, it does in normal interception. I don't see any way out
> of that quirk with vsyscall=emulate. As is without seccomp,
> vsyscall=emulate doesn't allow ptrace interception (or syscall
> auditing for that matter) while vsyscall=native does. So the option
> here is to document the quirky interaction in
> Documentation/prctl/seccomp_filter.txt. In particular, if the tracer
> sees either (time|gettimeofday|getcpu) and rip in the vsyscall page,
> it will know it can't rewrite or bypass the call. Is there a better
> option?
>
> Given that, I will include a tweak to the documentation to indicate
> that behavior so that userspace authors of BPF programs that use
> SECCOMP_RET_TRACE will be aware of the behavior.
>
>> If those issues aren't problems, then:
>>
>> Reviewed-by: Andy Lutomirski <luto@xxxxxxxxxxxxxx>
>>
>> (If the syscall number needs to change after the fact in the
>> SECCOMP_RET_TRAP case, it'll be a mess.)
>
> Nah - traps are delivered like the forced sigsegv path.
>
> I'll spin a v3 soon including the documentation tweak and the ip
> offset to match vsyscall=native behavior (regs->ip += 9 _just_ for the
> skip case). Of course, any better ideas for the trace-case will be
> more than welcome, but it seems to me to be an acceptable tradeoff - I
> hope others agree.
>
> I'll make the changes and then put it through its paces to see if any
> other little idiosyncrasies emerge.

I've written up a documentation patch to accompany this one. It
reflects one more change I've made in a v3 of the patch, but it is
optional. I've added support for SECCOMP_RET_TRACE to still
skip/emulate the system call if it desires. In v2 it can't. Either
way is fine in practice, but I'd need to change the accompanying
documentation.

thanks again!
will
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/