Re: [RFC PATCH] sched: Fix sched_wakeup tracepoint

From: Peter Zijlstra
Date: Fri Jun 05 2015 - 08:51:48 EST


On Fri, 2015-06-05 at 12:32 +0000, Mathieu Desnoyers wrote:
> ----- On Jun 5, 2015, at 2:09 PM, Peter Zijlstra peterz@xxxxxxxxxxxxx wrote:
>
> > On Fri, Jun 05, 2015 at 01:41:49PM +0200, Mathieu Desnoyers wrote:
> >> Commit 317f394160e9 "sched: Move the second half of ttwu() to the remote cpu"
> >> moves ttwu_do_wakeup() to an IPI handler context on the remote CPU for
> >> remote wakeups. This commit appeared upstream in Linux v3.0.
> >>
> >> Unfortunately, ttwu_do_wakeup() happens to contain the "sched_wakeup"
> >> tracepoint. Analyzing wakup latencies depends on getting the wakeup
> >> chain right: which process is the waker, which is the wakee. Moving this
> >> instrumention outside of the waker context prevents trace analysis tools
> >> from getting the waker pid, either through "current" in the tracepoint
> >> probe, or by deducing it using other scheduler events based on the CPU
> >> executing the tracepoint.
> >>
> >> Another side-effect of moving this instrumentation to the scheduler ipi
> >> is that the delay during which the wakeup is sitting in the pending
> >> queue is not accounted for when calculating wakeup latency.
> >>
> >> Therefore, move the sched_wakeup instrumentation back to the waker
> >> context to fix those two shortcomings.
> >
> > What do you consider wakeup-latency? I don't see how moving the
> > tracepoint into the caller will magically account the queue time.
>
> That would be the delay between timestamp@wakeup (in the waker) to
> timestamp@sched_switch_in (in the wakee).

That's scheduling latency. Its how long it takes a runnable task to
become running. Wakeup latency would be how long it takes from 'event'
to the task being runnable.

The unfortunate situation here is that placing that tracepoint in the
wakee -- which makes sense -- makes measuring either of the above
latencies impossible because it does not capture the becoming RUNNABLE
part.

> ttwu_pending is going through a lockless list populated before sending
> the IPI, and iterated on by the IPI handler. Therefore moving the tracepoint
> before the enqueue will account spend in the queue in the wakeup latency
> time.

Which is wrong from all points of view latencies.

> >
> >> +++ b/kernel/sched/core.c
> >> @@ -1457,7 +1457,6 @@ static void
> >> ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags)
> >> {
> >> check_preempt_curr(rq, p, wake_flags);
> >> - trace_sched_wakeup(p, true);
> >>
> >> p->state = TASK_RUNNING;
> >> #ifdef CONFIG_SMP
> >> @@ -1505,6 +1504,7 @@ static int ttwu_remote(struct task_struct *p, int
> >> wake_flags)
> >> if (task_on_rq_queued(p)) {
> >> /* check_preempt_curr() may use rq clock */
> >> update_rq_clock(rq);
> >> + trace_sched_wakeup(p, true);
> >> ttwu_do_wakeup(rq, p, wake_flags);
> >> ret = 1;
> >> }
> >> @@ -1619,6 +1619,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)
> >> {
> >> struct rq *rq = cpu_rq(cpu);
> >>
> >> + trace_sched_wakeup(p, true);
> >> #if defined(CONFIG_SMP)
> >> if (sched_feat(TTWU_QUEUE) && !cpus_share_cache(smp_processor_id(), cpu)) {
> >> sched_clock_cpu(cpu); /* sync clocks x-cpu */
> >
> > You only need one site in try_to_wake_up(), put it right after
> > success=1.
>
> There are code paths further down the chain that check whether the task
> is not fully dequeued from the runqueue (e.g. after grabbing the rq lock
> in ttwu_remote()) and skips sched_wakeup event in those cases. Do we care
> about not tracing an event in those cases, or would tracing it (as per your
> suggestion) be preferable ?

Its an actual wakeup, we change task->state, therefore one should trace
it.

>
> Also, moving the tracepoint to try_to_wake_up() would not be enough. We would
> also need to instrument try_to_wake_up_local().

Sure, I just objected to having 2 tracepoint in the 'normal' wakeup
path.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/