[PATCH] perf: Timehist account sch delay for scheduled out running

From: Fernand Sieber
Date: Thu Jun 13 2024 - 14:59:34 EST


When using perf timehist, sch delay is only computed for a waking task,
not for a pre empted task. This patches addresses this problem.

Example of `perf timehist` report before the patch for `stress` task
competing with each other.

First column is wait time, second column sch delay, third column
runtime.
```
1.492060 [0000] s stress[81] 1.999 0.000 2.000 R next: stress[83]
1.494060 [0000] s stress[83] 2.000 0.000 2.000 R next: stress[81]
1.496060 [0000] s stress[81] 2.000 0.000 2.000 R next: stress[83]
1.498060 [0000] s stress[83] 2.000 0.000 1.999 R next: stress[81]
```

After the patch, it looks like this (note that all wait time is now sch
delay instead of zero):
```
1.492060 [0000] s stress[81] 1.999 1.999 2.000 R next: stress[83]
1.494060 [0000] s stress[83] 2.000 2.000 2.000 R next: stress[81]
1.496060 [0000] s stress[81] 2.000 2.000 2.000 R next: stress[83]
1.498060 [0000] s stress[83] 2.000 2.000 1.999 R next: stress[81]
```

In timehist:
* wait time represents the duration waiting for any system resource
* sch delay represents the duration waiting for cpu system resources

This is based on perf comments (dt_wait = wait time, dt_delay = sch
delay):
```
/*
* Explanation of delta-time stats:
*
* t = time of current schedule out event
* tprev = time of previous sched out event
* also time of schedule-in event for current task
* last_time = time of last sched change event for current task
* (i.e, time process was last scheduled out)
* ready_to_run = time of wakeup for current task
*
* -----|------------|------------|------------|------
* last ready tprev t
* time to run
*
* |-------- dt_wait --------|
* |- dt_delay -|-- dt_run --|
*
* dt_run = run time of current task
* dt_wait = time between last schedule out event for task and tprev
* represents time spent off the cpu
* dt_delay = time between wakeup and schedule-in of task
*/
```

The problem with the current logic is that last time is only set when
waking a task. Therefore it is not set for a pre empted task. To fix
this, we set last time to the current sample time if a scheduled out
task (on the switch tracepoint) is in state running.

Signed-off-by: Fernand Sieber <sieberf@xxxxxxxxxx>
Cc: Peter Zijlstra <peterz@xxxxxxxxxxxxx>
Cc: Ingo Molnar <mingo@xxxxxxxxxx>
Cc: Arnaldo Carvalho de Melo <acme@xxxxxxxxxx>
Cc: Alexander Shishkin <alexander.shishkin@xxxxxxxxxxxxxxx>
---
tools/perf/builtin-sched.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/tools/perf/builtin-sched.c b/tools/perf/builtin-sched.c
index 5977c49ae2c7..7422c930abaf 100644
--- a/tools/perf/builtin-sched.c
+++ b/tools/perf/builtin-sched.c
@@ -2659,7 +2659,10 @@ static int timehist_sched_change_event(struct perf_tool *tool,
tr->last_state = state;

/* sched out event for task so reset ready to run time */
- tr->ready_to_run = 0;
+ if (state == 'R')
+ tr->ready_to_run = t;
+ else
+ tr->ready_to_run = 0;
}

evsel__save_time(evsel, sample->time, sample->cpu);
--
2.40.1