Re: [PATCH 3/3] context_tracking,x86: remove extraneous irq disable & enable from context tracking on syscall entry
From: Andy Lutomirski
Date: Fri May 01 2015 - 12:03:53 EST
On Fri, May 1, 2015 at 8:59 AM, Ingo Molnar <mingo@xxxxxxxxxx> wrote:
>
> * Rik van Riel <riel@xxxxxxxxxx> wrote:
>
>> > I.e. what's the baseline we are talking about?
>>
>> It's an astounding difference. This is not a kernel without
>> nohz_full, just a CPU without nohz_full running the same kernel I
>> tested with yesterday:
>>
>> run time system time
>> vanilla 5.49s 2.08s
>> __acct patch 5.21s 1.92s
>> both patches 4.88s 1.71s
>> CPU w/o nohz 3.12s 1.63s <-- your numbers, mostly
>>
>> What is even more interesting is that the majority of the time
>> difference seems to come from _user_ time, which has gone down from
>> around 3.4 seconds in the vanilla kernel to around 1.5 seconds on
>> the CPU without nohz_full enabled...
>>
>> At syscall entry time, the nohz_full context tracking code is very
>> straightforward. We check thread_info->flags &
>> _TIF_WORK_SYSCALL_ENTRY, and call syscall_trace_enter_phase1, which
>> handles USER -> KERNEL context transition.
>>
>> Syscall exit time is a convoluted mess. Both do_notify_resume and
>> syscall_trace_leave call exit_user() on entry and enter_user() on
>> exit, leaving the time spent looping around between int_with_check
>> and syscall_return: in entry_64.S accounted as user time.
>>
>> I sent an email about this last night, it may be useful to add a
>> third test & function call point to the syscall return code, where
>> we can call user_enter() just ONCE, and remove the other context
>> tracking calls from that loop.
>
> So what I'm wondering about is the big picture:
>
> - This is crazy big overhead in something as fundamental as system
> calls!
>
> - We don't even have the excuse of the syscall auditing code, which
> kind of has to run for every syscall if it wants to do its job!
>
> - [ The 'precise vtime' stuff that is driven from syscall entry/exit
> is crazy, and I hope not enabled in any distro. ]
>
> - So why are we doing this in every syscall time at all?
>
> Basically the whole point of user-context tracking is to be able to
> flush pending RCU callbacks. But that's crazy, we can sure defer a few
> kfree()s on this CPU, even indefinitely!
>
> If some other CPU does a sync_rcu(), then it can very well pluck those
> callbacks from this super low latency CPU's RCU lists (with due care)
> and go and free stuff itself ... There's no need to disturb this CPU
> for that!
>
> If user-space does not do anything kernel-ish then there won't be any
> new RCU callbacks piled up, so it's not like it's a resource leak
> issue either.
>
> So what's the point? Why not remove this big source of overhead
> altogether?
The last time I asked, the impression I got was that we needed two things:
1. We can't pluck things from the RCU list without knowing whether the
CPU is in an RCU read-side critical section, and we can't know that
unless we have regular grade periods or we know that the CPU is idle.
To make the CPU detectably idle, we need to set a bit somewhere.
2. To suppress the timing tick, we need to get some timing for, um,
the scheduler? I wasn't really sure about this one.
Could we reduce the overhead by making the IN_USER vs IN_KERNEL
indication be a single bit and, worst case, an rdtsc and maybe a
subtraction? We could probably get away with banning full nohz on
non-invariant tsc systems.
(I do understand why it would be tricky to transition from IN_USER to
IN_KERNEL with IRQs on. Solvable, maybe, but tricky.)
--Andy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/