Re: [PATCH v4 0/5] x86: two-phase syscall tracing and seccomp fastpath
From: Andy Lutomirski
Date: Wed Jul 30 2014 - 13:25:38 EST
On Wed, Jul 30, 2014 at 9:59 AM, Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> On Tue, Jul 29, 2014 at 04:30:58PM -0700, Andy Lutomirski wrote:
>> On Tue, Jul 29, 2014 at 1:54 PM, Andy Lutomirski <luto@xxxxxxxxxxxxxx> wrote:
>> > On Jul 29, 2014 12:22 PM, "Oleg Nesterov" <oleg@xxxxxxxxxx> wrote:
>> >>
>> >> Andy, to avoid the confusion: I am not trying to review this changes.
>> >> As you probably know my understanding of asm code in entry.S is very
>> >> limited.
>> >>
>> >> Just a couple of questions to ensure I understand this correctly.
>> >>
>> >> On 07/28, Andy Lutomirski wrote:
>> >> >
>> >> > This is both a cleanup and a speedup. It reduces overhead due to
>> >> > installing a trivial seccomp filter by 87%. The speedup comes from
>> >> > avoiding the full syscall tracing mechanism for filters that don't
>> >> > return SECCOMP_RET_TRACE.
>> >>
>> >> And only after I look at 5/5 I _seem_ to actually understand where
>> >> this speedup comes from.
>> >>
>> >> So. Currently tracesys: path always lead to "iret" after syscall, with
>> >> this change we can avoid it if phase_1() returns zero, correct?
>> >>
>> >> And, this also removes the special TIF_SYSCALL_AUDIT-only case in entry.S,
>> >> cool.
>> >>
>> >> I am wondering if we can do something similar with do_notify_resume() ?
>> >>
>> >>
>> >> Stupid question. To simplify, lets forget that syscall_trace_enter()
>> >> already returns the value. Can't we simplify the asm code if we do
>> >> not export 2 functions, but make syscall_trace_enter() return
>> >> "bool slow_path_is_needed". So that "tracesys:" could do
>> >>
>> >> // pseudo code
>> >>
>> >> tracesys:
>> >> SAVE_REST
>> >> FIXUP_TOP_OF_STACK
>> >>
>> >> call syscall_trace_enter
>> >>
>> >> if (!slow_path_is_needed) {
>> >> addq REST_SKIP, %rsp
>> >> jmp system_call_fastpath
>> >> }
>> >>
>> >> ...
>> >>
>> >> ?
>> >>
>> >> Once again, I am just curious, it is not that I actually suggest to consider
>> >> this option.
>> >
>> > We could, but this would lose a decent amount of the speedup. I could
>> > try it and benchmark it, but I'm guessing that the save and restore is
>> > kind of expensive. This will make audit slower than it currently is,
>> > which may also annoy some people. (Not me.)
>> >
>> > I'm also not convinced that it would be much simpler. My code is currently:
>> >
>> > tracesys:
>> > leaq -REST_SKIP(%rsp), %rdi
>> > movq $AUDIT_ARCH_X86_64, %rsi
>> > call syscall_trace_enter_phase1
>> > test %rax, %rax
>> > jnz tracesys_phase2 /* if needed, run the slow path */
>> > LOAD_ARGS 0 /* else restore clobbered regs */
>> > jmp system_call_fastpath /* and return to the fast path */
>> >
>> > tracesys_phase2:
>> > SAVE_REST
>> > FIXUP_TOP_OF_STACK %rdi
>> > movq %rsp, %rdi
>> > movq $AUDIT_ARCH_X86_64, %rsi
>> > movq %rax,%rdx
>> > call syscall_trace_enter_phase2
>> >
>> > LOAD_ARGS ARGOFFSET, 1
>> > RESTORE_REST
>> >
>> > ... slow path here ...
>> >
>> > It would end up looking more like (totally untested):
>> >
>> > tracesys:
>> > SAVE_REST
>> > FIXUP_TOP_OF_STACK %rdi
>> > mov %rsp, %rdi
>> > movq $AUDIT_ARCH_X86_64, %rsi
>> > call syscall_trace_enter
>> > LOAD_ARGS
>> > RESTORE_REST
>> > test [whatever condition]
>> > j[cond] system_call_fastpath
>> >
>> > ... slow path here ...
>> >
>> > So it's a bit simpler. Oddly, the ia32entry code doesn't have this
>> > multiple syscall path distinction.
>> >
>> > SAVE_REST is 6 movq instructions and a subq. FIXUP_TOP_OF_STACK is 7
>> > movqs (and 8 if I ever get my way). RESTORE_TOP_OF_STACK is 4.
>> > RESTORE_REST is 6 movqs and an adsq. So we're talking about avoiding
>> > 21 movqs, and addq, and a subq. That may be significant. (And I
>> > suspect that the difference is much larger on platforms like arm64,
>> > but that's a separate issue.)
>>
>> To put some more options on the table: there's an argument to be made
>> that the whole fast-path/slow-path split isn't worth it. We could
>> unconditionally set up a full frame for all syscalls. This means:
>>
>> No FIXUP_TOP_OF_STACK. Instead, the system_call entry sets up RSP,
>> SS, CS, RCX, and EFLAGS right away. That's five stores for all
>> syscalls instead of two loads and five stores for syscalls that need
>> it. But it also gets rid of RESTORE_TOP_OF_STACK completely.
>>
>> No more bugs involving C code that assumes a full stack frame when no
>> such frame exists inside syscall code.
>>
>> We could possibly remove a whole bunch of duplicated code.
>>
>> The upshot would be simpler code, faster slow-path syscalls, and
>> slower fast-path syscalls (but probably not much slower). On the
>> other hand, there's zero chance that this would be ready for 3.17.
>
> Indeed.
>
> If we ever take that direction (ie: remove that partial frame optimization),
> there is that common part that we can find in many archs when they return
> from syscalls, exceptions or interrupts which could be more or less consolidated as:
>
> void sysret_check(struct pt_regs *regs)
> {
> while(test_thread_flag(TIF_ALLWORK_MASK)) {
> int resched = need_resched();
>
> local_irq_enable();
> if (resched)
> schedule();
> else
> do_notify_resume(regs);
> local_irq_disable()
> }
> }
>
> But well, probably the syscall fastpath is still worth it.
And yet x86_64 has this code implemented in assembly even in the
slowpath. Go figure.
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/