Re: [RFC PATCH 4/4] perf: add timehist command

From: Ingo Molnar
Date: Thu Jun 20 2013 - 06:47:08 EST



* David Ahern <dsahern@xxxxxxxxx> wrote:

> perf timehist provides an analysis of scheduling event data. Right now
> it uses the context-switch softwar event; it needs to be updated to use
> the scheduling tracepoints along with analysis enhancements those
> tracepoints provide. For now, the context switch event provides a
> well-tested start point.
>
> This command has been extremely helpful debugging systems with heavy
> CPU contention.
>
> Use case:
> 1. collect context switch data for the system (or cpu) along with
> the stack trace:
> perf record -e cs -c 1 -ag -- <workload>
>
> 2. run the timehist command to show the time between schedule-in
> times (ie., how long did the task go between access to the CPU)
> as well as run times (ie., how long did the task run once scheduled)
> and the stack trace of the task when scheduled out:
>
> perf timehist
>
> time cpu task [tid/pid] b/n time run time
> ------------- ---- ----------------- --------- ---------
> 227959.352628 [02] make[17133] 0.004877 0.000285 do_wait sys_wait4 ...
> 227959.352762 [03] sh[17136] 0.000000 0.000135 __cond_resched _cond_resched ...
> 227959.352777 [03] migration/3[23] 0.003163 0.000014 smpboot_thread_fn kthread ...
> 227959.352801 [02] <idle> 0.000285 0.000173 cpu_idle start_secondary
> 227959.353552 [02] sh[17136] 0.000038 0.000750 do_exit do_group_exit ...
> 227959.353564 [04] <idle> 0.000014 0.001113 cpu_idle start_secondary
> 227959.353722 [04] make[17133] 0.000936 0.000157 do_exit do_group_exit ...
> 227959.353761 [06] <idle> 0.001210 0.021919 cpu_idle start_secondary
> 227959.353900 [06] make[17127] 0.001310 0.000139 wait_for_completion_killable do_fork ...
> 227959.353921 [03] <idle> 0.000150 0.001143 cpu_idle start_secondary
> 227959.353945 [03] make[17137] 0.000000 0.000023 __cond_resched _cond_resched ...
> 227959.353962 [03] migration/3[23] 0.001167 0.000017 smpboot_thread_fn kthread ret_from_fork
> 227959.353990 [02] <idle> 0.000750 0.000438 cpu_idle start_secondary
> 227959.354034 [03] <idle> 0.000041 0.000071 cpu_idle start_secondary
> 227959.354044 [03] rcu_sched[10] 0.002931 0.000010 rcu_gp_kthread kthread ret_from_fork
> 227959.354089 [06] <idle> 0.000139 0.000189 cpu_idle start_secondary
> 227959.354096 [06] rcu_sched[10] 0.000044 0.000006 rcu_gp_kthread kthread ret_from_fork
> 227959.354110 [06] make[17127] 0.000195 0.000014 pipe_wait pipe_read ...
> 227959.354382 [00] <idle> 0.000013 0.003307 cpu_idle rest_init ...
> 227959.354452 [00] qemu-kvm[25781/25777] 0.000000 0.000069 kvm_vcpu_block kvm_arch_vcpu_ioctl_run ...
> 227959.354498 [04] <idle> 0.000157 0.000776 cpu_idle start_secondary
>
> Additional options:
> i. --cpu-visual - useful for multi-core. Adds a field to highlight
> visually which cpu had an event
>
> ii. --summary - show a run time summary of how long each task ran over
> the time interval analyzed
>
> Runtime summary (times are in seconds)
> comm parent number run-time min-run max-run avg-run stddev(%)
> ...
> qemu-kvm[25777] -1 8 0.000398 0.000037 0.000076 0.000049 8.99
> qemu-kvm[25781/25777] -1 22 0.001532 0.000040 0.000193 0.000069 10.39
> qemu-kvm[25782/25777] -1 21 0.001536 0.000038 0.000255 0.000073 14.13
> sshd[28339] -1 12 0.001337 0.000073 0.000129 0.000111 4.69
> ...
>
> Terminated tasks:
> sh[17126] 17125 2 0.000778
> make[17125] 17124 3 0.002007
> sh[17129] 17128 2 0.000945
> sh[17130] 17128 2 0.001600
> make[17128] 17127 4 0.002272
> ...
>
> iii. --tree - show parent-child summary with runtimes:
>
> Parent-child relationships
> --------------------------
> ...
> perf[17124] 0.002964
> make[17125] 0.002007
> sh[17126] 0.000778
> make[17127] 0.101272
> make[17128] 0.002272
> sh[17129] 0.000945
> sh[17130] 0.001600
> make[17131] 0.001888
> make[17132] 0.000715
> make[17133] 0.002273
> sh[17134] 0.002230
> gcc[17135] 0.002527
> sh[17136] 0.000886
> make[17137] 0.002408
> sh[17138] 0.002420
> gcc[17139] 0.010634
> gcc[17140] 0.006208
>
> This command has been a work in progress for years and many more options
> and features will be added in time.

Looks very useful!

This is the kind of analysis I thought 'perf sched' might one day end up
doing.

Do you plan to submit it non-RFC, to be merged?

Your tool could take over the somewhat stale 'perf sched' command as well
btw., while possibly carrying over some of the perf sched features like
'replay'.

'perf sched record' enough events to support the ones you need:

> perf record -e cs -c 1 -ag -- <workload>

plus:

'perf sched sum[mary]' # show the summary
'perf sched tree' # could show the tree.
'perf sched hist[ogram]' # would show the histogram

i.e. feel free to take over 'perf sched' to suit your 'timehist' workflow
and logic. As the main (and somewhat overworked) author of 'perf sched' I
have no problem with you taking it over, at all.

You could also add a 'perf timehist' alias to the new 'perf sched'
command, to keep your original (and well established?) naming.

Thanks,

Ingo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/