Re: Scheduling latency outliers observed with perf sched record

From: Namhyung Kim
Date: Wed Mar 19 2025 - 18:41:17 EST


Hello,

On Fri, Mar 14, 2025 at 11:59:23AM +0000, Daniel Dao wrote:
> Hi,
>
> At Cloudflare with both kernel 6.6.82 and 6.12.17, we frequently
> observe scheduling
> latency outliers with `perf sched record`. For example:
>
> -------------------------------------------------------------------------------------------------------------------------------------------
> Task | Runtime ms | Count | Avg delay ms |
> Max delay ms | Max delay start | Max delay end |
> -------------------------------------------------------------------------------------------------------------------------------------------
> Pingora HTTP Pr:3133982 | 0.080 ms | 1 | avg: 881.614 ms |
> max: 881.614 ms | max start: 312881.567210 s | max end: 312882.448824
> s
>
> The scheduling timeline of the outlier is as follow
>
> time cpu task name wait time sch
> delay run time state
> [tid/pid] (msec)
> (msec) (msec)
> --------------- ------ ------------------------------ ---------
> --------- --------- -----
> 312881.567210 [0058] Pingora HTTP Pr[3045283/3045241
> awakened: Pingora HTTP Pr[3133982/3045241
> 312882.448807 [0015] Pingora HTTP Pr[3045278/3045241
> awakened: Pingora HTTP Pr[3133982/3045241
> 312882.448861 [0118] Pingora HTTP Pr[3133982/3045241
> awakened: Pingora HTTP Pr[3045278/3045241
> 312882.448911 [0118] Pingora HTTP Pr[3133982/3045241 0.000
> 881.614 0.087 S
>
> According to `perf sched record`, the task received several wakeups
> before it finally executed on CPU 118. The timeline
> on CPU 118 is as follow:
>
> 312882.448824 [0118] <idle> 0.024
> 0.024 0.416 I
> 312882.448861 [0118] Pingora HTTP Pr[3133982/3045241
> awakened: Pingora HTTP Pr[3045278/3045241
> 312882.448911 [0118] Pingora HTTP Pr[3133982/3045241 0.000
> 881.614 0.087 S
>
> This is very strange and seems impossible because the task was woken
> up before but not able to run yet while the CPU has idle time.
>
> In production, we usually do not use `perf sched record`, but using a
> ebpf program that is similar to
> https://github.com/bpftrace/bpftrace/blob/master/tools/runqlat.but
>
> tracepoint:sched:sched_wakeup, tracepoint:sched:sched_wakeup_new {
> @qtime[args.pid] = nsecs;
> }
>
> tracepoint:sched:sched_switch {
> if (args.prev_state == TASK_RUNNING) {
> @qtime[args.prev_pid] = nsecs;
> }
>
> $ns = @qtime[args.next_pid];
> if ($ns) {
> @usecs = hist((nsecs - $ns) / 1000);
> }
> delete(@qtime, args.next_pid);
> }
>
> We do not observe any outliers with this program. We figured that the
> difference with perf record is in the case of
> supposedly repeated wakeups. The ebpf script took the last wakeup,
> while `perf sched latency` took the first wakeup. Once we
> adjusted our bpftrace script to take the first wakeup, we observed the
> same outliers.
>
> Looking into this further, we adjusted the same bpftrace script to
> also print out `sum_exec_runtime` to make sure that
> the process didn't actually run i.e. we do not miss any sched_switch
> events. The adjusted script is as follow:
>
> #define TASK_RUNNING 0
> #define THRESHOLD 100
>
> rt:sched_wakeup, rt:sched_wakeup_new
> {
> $task = (struct task_struct *)arg0;
> $ns = @qtime[$task->pid];
> $sum_exec_runtime = @sum_exec_runtime[$task->pid];
> if ($ns > 0 ) {
> $ms = ((nsecs - $ns) / 1000000);
> if ($ms > 100) {
> printf("pid=%d t=%ld delay_ms=%d sum_exec_runtime=%ld
> prev_sum_exec_runtime=%ld cached_sum_exec_runtime=%ld\n", $task->pid,
> nsecs, $ms,
> $task->se.sum_exec_runtime, $task->se.prev_sum_exec_runtime,
> $sum_exec_runtime);
> }
> if ($task->se.sum_exec_runtime == $sum_exec_runtime) {
> // repeated wakeup, and task hasn't been run before, and we have
> not missed any sched_switch events, do not count
> return;
> }
> }
> @qtime[$task->pid] = nsecs;
> @sum_exec_runtime[$task->pid] = $task->se.sum_exec_runtime;
> }
>
> rt:sched_switch
> {
> $prev = (struct task_struct *)arg1;
> $next = (struct task_struct *)arg2;
> $prev_state = arg3;
> if ($prev_state == TASK_RUNNING) {
> @qtime[$prev->pid] = nsecs;
> @sum_exec_runtime[$prev->pid] = $prev->se.sum_exec_runtime;
> }
>
> $ns = @qtime[$next->pid];
> if ($ns) {
> @usecs = hist((nsecs - $ns) / 1000);
> }
> delete(@qtime[$next->pid]);
> delete(@sum_exec_runtime[$next->pid]);
> }
>
> The script gives us the following data:
>
> pid=2197474 t=298353700427405 delay_ms=1157
> sum_exec_runtime=1663205188 prev_sum_exec_runtime=1663167058
> cached_sum_exec_runtime=1663167058
> pid=2196692 t=298354187452899 delay_ms=1644
> sum_exec_runtime=1630867065 prev_sum_exec_runtime=1630824755
> cached_sum_exec_runtime=1630824755
> pid=3290499 t=298354606215376 delay_ms=1482
> sum_exec_runtime=26316767196 prev_sum_exec_runtime=26316654306
> cached_sum_exec_runtime=26316654306
>
> We can see that the script detected repeated wakeup (i.e. the program
> did not observe a sched_switch), but the task exec_runtime has
> changed.
>
> Am I understanding correctly that this means that we must have missed
> a sched_switch tracepoint event ? We didn't see any
> miss counters for our bpf program reported by bpftool. The `perf sched
> record` command also didn't report any missing events,
> and I think it must have also missed some of these sched_switch events.
>
> Is there anything else that we are missing here ?

perf can miss some events when the ring buffer is full. There should be
a warning about that. Also you can check whether the data has LOST
records by running this command.

$ perf report --stats | grep LOST

Thanks,
Namhyung