Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint

From: Mathieu Desnoyers
Date: Fri Jun 05 2015 - 09:23:20 EST


----- On Jun 5, 2015, at 2:51 PM, null wrote:

> On Fri, 2015-06-05 at 12:32 +0000, Mathieu Desnoyers wrote:
>> ----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>>
>> > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
>> >> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
>> >> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
>> >> remote wakeups. This commit appeared upstream in Linux v3.0.
>> >>
>> >> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
>> >> tracepoint. Analyzing wakup latencies depends on getting the wakeup
>> >> chain right: which process is the waker, which is the wakee. Moving this
>> >> instrumention outside of the waker context prevents trace analysis tools
>> >> from getting the waker pid, either through "current" in the tracepoint
>> >> probe, or by deducing it using other scheduler events based on the CPU
>> >> executing the tracepoint.
>> >>
>> >> Another side-effect of moving this instrumentation to the scheduler ipi
>> >> is that the delay during which the wakeup is sitting in the pending
>> >> queue is not accounted for when calculating wakeup latency.
>> >>
>> >> Therefore, move the sched_wakeup instrumentation back to the waker
>> >> context to fix those two shortcomings.
>> >
>> > What do you consider wakeup-latency? I don't see how moving the
>> > tracepoint into the caller will magically account the queue time.
>>
>> That would be the delay between timestamp@wakeup (in the waker) to
>> timestamp@sched_switch_in (in the wakee).
>
> That's scheduling latency. Its how long it takes a runnable task to
> become running. Wakeup latency would be how long it takes from 'event'
> to the task being runnable.
>
> The unfortunate situation here is that placing that tracepoint in the
> wakee -- which makes sense -- makes measuring either of the above
> latencies impossible because it does not capture the becoming RUNNABLE
> part.
>
>> ttwu_pending is going through a lockless list populated before sending
>> the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint
>> before the enqueue will account spend in the queue in the wakeup latency
>> time.
>
> Which is wrong from all points of view latencies.

OK, so considering the definition naming feedback you provided, we
may need a 3 tracepoint if we want to calculate both wakeup latency
and scheduling latency (naming ofc open to discussion):

sched_wakeup: when try_to_wake_up{,_local} is called in the waker.
sched_activate_task: when the wakee is marked runnable.
sched_switch: when scheduling actually happens.

We can then calculate wakeup latency as

time@sched_activate - time@sched_wakeup

And scheduling latency as

time@sched_switch - time@sched_activate

In the case of critical path analysis, we don't care about this
level of granularity. What we care about is is sum of the two,
which we can express as:

time@sched_switch - time@sched_wakeup

Is there an officially blessed name for this ?

>
>> >
>> >> +++ b/kernel/sched/core.c
>> >> @@ -1457,7 +1457,6 @@ static void
>> >> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
>> >> {
>> >> check_preempt_curr(rq, p, wake_flags);
>> >> - trace_sched_wakeup(p, true);
>> >>
>> >> p->state = TASK_RUNNING;
>> >> #ifdef CONFIG_SMP
>> >> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int
>> >> wake_flags)
>> >> if (task_on_rq_queued(p)) {
>> >> /* check_preempt_curr() may use rq clock */
>> >> update_rq_clock(rq);
>> >> + trace_sched_wakeup(p, true);
>> >> ttwu_do_wakeup(rq, p, wake_flags);
>> >> ret = 1;
>> >> }
>> >> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
>> >> {
>> >> struct rq *rq = cpu_rq(cpu);
>> >>
>> >> + trace_sched_wakeup(p, true);
>> >> #if defined(CONFIG_SMP)
>> >> if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
>> >> sched_clock_cpu(cpu); /* sync clocks x-cpu */
>> >
>> > You only need one site in try_to_wake_up(), put it right after
>> > success=1.
>>
>> There are code paths further down the chain that check whether the task
>> is not fully dequeued from the runqueue (e.g. after grabbing the rq lock
>> in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care
>> about not tracing an event in those cases, or would tracing it (as per your
>> suggestion) be preferable ?
>
> Its an actual wakeup, we change task->state, therefore one should trace
> it.

Works for me!

>
>>
>> Also, moving the tracepoint to try_to_wake_up() would not be enough. We would
>> also need to instrument try_to_wake_up_local().
>
> Sure, I just objected to having 2 tracepoint in the 'normal' wakeup
> path.

Agreed. We may need to see if we want to add a "sched_activate_task" tracepoint
though.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/