Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint

From: Mathieu Desnoyers
Date: Fri Jun 05 2015 - 08:32:21 EST


----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:

> On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
>> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
>> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
>> remote wakeups. This commit appeared upstream in Linux v3.0.
>>
>> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
>> tracepoint. Analyzing wakup latencies depends on getting the wakeup
>> chain right: which process is the waker, which is the wakee. Moving this
>> instrumention outside of the waker context prevents trace analysis tools
>> from getting the waker pid, either through "current" in the tracepoint
>> probe, or by deducing it using other scheduler events based on the CPU
>> executing the tracepoint.
>>
>> Another side-effect of moving this instrumentation to the scheduler ipi
>> is that the delay during which the wakeup is sitting in the pending
>> queue is not accounted for when calculating wakeup latency.
>>
>> Therefore, move the sched_wakeup instrumentation back to the waker
>> context to fix those two shortcomings.
>
> What do you consider wakeup-latency? I don't see how moving the
> tracepoint into the caller will magically account the queue time.

That would be the delay between timestamp@wakeup (in the waker) to
timestamp@sched_switch_in (in the wakee).

ttwu_pending is going through a lockless list populated before sending
the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint
before the enqueue will account spend in the queue in the wakeup latency
time.

>
>> +++ b/kernel/sched/core.c
>> @@ -1457,7 +1457,6 @@ static void
>> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>> {
>> check_preempt_curr(rq, p, wake_flags);
>> - trace_sched_wakeup(p, true);
>>
>> p->state = TASK_RUNNING;
>> #ifdef CONFIG_SMP
>> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int
>> wake_flags)
>> if (task_on_rq_queued(p)) {
>> /* check_preempt_curr() may use rq clock */
>> update_rq_clock(rq);
>> + trace_sched_wakeup(p, true);
>> ttwu_do_wakeup(rq, p, wake_flags);
>> ret = 1;
>> }
>> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
>> {
>> struct rq *rq = cpu_rq(cpu);
>>
>> + trace_sched_wakeup(p, true);
>> #if defined(CONFIG_SMP)
>> if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
>> sched_clock_cpu(cpu); /* sync clocks x-cpu */
>
> You only need one site in try_to_wake_up(), put it right after
> success=1.

There are code paths further down the chain that check whether the task
is not fully dequeued from the runqueue (e.g. after grabbing the rq lock
in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care
about not tracing an event in those cases, or would tracing it (as per your
suggestion) be preferable ?

Also, moving the tracepoint to try_to_wake_up() would not be enough. We would
also need to instrument try_to_wake_up_local().

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/