Re: [Regression v4.2 ?] 32-bit seccomp-BPF returned errno values wrong in VM?

From: Denys Vlasenko
Date: Thu Aug 13 2015 - 17:35:50 EST

On 08/13/2015 08:47 PM, Kees Cook wrote:
> On Thu, Aug 13, 2015 at 10:39 AM, David Drysdale <drysdale@xxxxxxxxxx> wrote:
>> On Thu, Aug 13, 2015 at 6:15 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>>> On Thu, Aug 13, 2015 at 9:28 AM, David Drysdale <drysdale@xxxxxxxxxx> wrote:
>>>> On Thu, Aug 13, 2015 at 4:17 PM, Denys Vlasenko <dvlasenk@xxxxxxxxxx> wrote:
>>>>> On 08/13/2015 10:30 AM, David Drysdale wrote:
>>>>>> Hi folks,
>>>>>> I've got an odd regression with the v4.2 rc kernel, and I wondered if anyone
>>>>>> else could reproduce it.
>>>>>> The problem occurs with a seccomp-bpf filter program that's set up to return
>>>>>> an errno value -- an errno of 1 is always returned instead of what's in the
>>>>>> filter, plus other oddities (selftest output below).
>>>>>> The problem seems to need a combination of circumstances to occur:
>>>>>> - The seccomp-bpf userspace program needs to be 32-bit, running against a
>>>>>> 64-bit kernel -- I'm testing with seccomp_bpf from
>>>>>> tools/testing/selftests/seccomp/, built via 'CFLAGS=-m32 make'.
>>>>> Does it work correctly when built as 64-bit program?
>>>> Yep, 64-bit works fine (both at v4.2-rc6 and at commit 3f5159).
>>>>>> - The kernel needs to be running as a VM guest -- it occurs inside my
>>>>>> VMware Fusion host, but not if I run on bare metal. Kees tells me he
>>>>>> cannot repro with a kvm guest though.
>>>>>> Bisecting indicates that the commit that induces the problem is
>>>>>> 3f5159a9221f19b0, "x86/asm/entry/32: Update -ENOSYS handling to match the
>>>>>> 64-bit logic", included in all the v4.2-rc* candidates.
>>>>>> Apologies if I've just got something odd with my local setup, but the
>>>>>> bisection was unequivocal enough that I thought it worth reporting...
>>>>>> Thanks,
>>>>>> David
>>>>>> seccomp_bpf failure outputs:
>>>> [snip]
>>>>> End result should be:
>>>>> pt_regs->ax = -E2BIG (via syscall_set_return_value())
>>>>> pt_regs->orig_ax = -1 ("skip syscall")
>>>>> and syscall_trace_enter_phase1() usually returns with 0,
>>>>> meaning "re-execute syscall at once, no phase2 needed".
>>>>> This, in turn, is called from .S files, and when it returns there,
>>>>> execution loops back to syscall dispatch.
>>>>> Because of orig_ax = -1, syscall dispatch should skip calling syscall.
>>>>> So -E2BIG should survive and be returned...
>>>> So I was just about to send:
>>>> That makes sense, and given that exactly the same 32-bit binary
>>>> runs fine on a different machine, there's presumably something up
>>>> with my local setup. The failing machine is a VMware guest, but
>>>> maybe that's not the relevant interaction -- particularly if no-one
>>>> else can repro.
>>>> But then I noticed some odd audit entries in the main log:
>>>> Aug 13 16:52:56 ubuntu kernel: [ 20.687249] audit: type=1326
>>>> audit(1439481176.034:62): auid=4294967295 uid=1000 gid=1000
>>>> ses=4294967295 pid=2621 comm=""
>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=9 arch=40000003 syscall=172
>>>> compat=1 ip=0xf773cc90 code=0x0
>>>> Aug 13 16:52:56 ubuntu kernel: [ 20.691157] audit: type=1326
>>>> audit(1439481176.038:63): auid=4294967295 uid=1000 gid=1000
>>>> ses=4294967295 pid=2631 comm=""
>>>> exe="/home/dmd/secccomp_bpf.kees.m32" sig=31 arch=40000003 syscall=20
>>>> compat=1 ip=0xf773cc90 code=0x10000000
>>>> ...
>>>> I didn't think I had any audit stuff turned on, and indeed:
>>>> # auditctl -l
>>>> No rules
>>>> But as soon as I'd run that auditctl command, the 32-bit
>>>> seccomp_bpf binary started running fine!
>>>> So now I'm confused, and I can no longer reproduce the
>>>> problem. Which probably means this was a false alarm, in
>>>> which case, my apologies.
>>> You might have triggered TIF_AUDIT or whatever it's called, which
>>> causes a whole different path through the asm tangle, so you might
>>> really have a problem.
>>> Try auditctl -a task,never. If that doesn't change anything, try
>>> rebooting the guest.
>> Aha, that seems to re-instate the problem -- with that auditctl setup
>> I get the 32-bit seccomp failures on two different machines (one VM,
>> one bare). So can anyone else repro?
>> I guess the relevant steps are thus:
>> - sudo auditctl -a task,never
>> - cd tools/testing/selftests/seccomp
>> - CFLAGS=-m32 make clean run_tests
> That was it! I can reproduce this now on kvm (after adding the auditctl rule).

I suspect this change:

.macro auditsys_entry_common
movl %ebx,%esi /* 2nd arg: 1st syscall arg */
movl %eax,%edi /* 1st arg: syscall number */
call __audit_syscall_entry
- movl RAX(%rsp),%eax /* reload syscall number */
- cmpq $(IA32_NR_syscalls-1),%rax
- ja ia32_badsys
+ movl ORIG_RAX(%rsp),%eax /* reload syscall number */
movl %ebx,%edi /* reload 1st syscall arg */
movl RCX(%rsp),%esi /* reload 2nd syscall arg */
movl RDX(%rsp),%edx /* reload 3rd syscall arg */

We were reloading syscall# from pt_regs->ax.

After the patch, pt_regs->ax isn't equal to syscall# on entry,
instead it contains -ENOSYS. Therefore the change shown above
was made, to reload it from pt_regs->orig_ax.

Well. This still should work... in fact it is "more correct"
than it was before...

64-bit code has no call to __audit_syscall_entry, it uses
syscall_trace_enter_phase1/phase2 mechanism instead of
"only audit" shortcut. If the bug is here (though I don't see it),
it explains why 64-bit binary works.

Now, how do we reach this bit of code?

jnz sysenter_tracesys
jz sysenter_auditsys
auditsys_entry_common <== OUR MACRO
movl %ebp,%r9d /* reload 6th syscall arg */
jmp sysenter_dispatch

jnz cstar_tracesys
jz cstar_auditsys
movl %r9d,R9(%rsp) /* register to be clobbered by call */
auditsys_entry_common <== OUR MACRO
movl R9(%rsp),%r9d /* reload 6th syscall arg */
jmp cstar_dispatch

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at