Re: [Question] How does perf still record the stack of a specified pid even when that process is interrupted and CPU is scheduled to other process

From: Rabin Vincent
Date: Tue May 05 2015 - 17:53:49 EST


On Sat, Apr 25, 2015 at 09:53:57AM -0600, David Ahern wrote:
> On 4/25/15 8:05 AM, Yunlong Song wrote:
> >But this only shows the system call like strace, but we want the call
> >stack of kernel functions in fact.
> >
> We haven't added the callchain option yet; on the to-do list.
>
> perf trace record -g -- iozone ...
> perf trace -i perf.data -s
> --> summary of system calls, max/min/average times
>
> perf trace -i perf.data --duration 10.0 -T
> --> note the timestamp where the write took a "long" time
>
> perf script
> --> search down to *around* the time of interest; you want the syscall
> entry; timestamp is for exit

Now if I understood the use case right, what Yulong Song wants to know
is what the iozone process is doing in the kernel (i.e. the stacktrace of why
exactly it goes to sleep / what it's waiting for) during these
sys_writes which take a long time.

The commands above will identify the sys_write which takes time but only
provide the stacktrace at the entry and exit of the syscall, but this do
not show why the process blocked or what it did inside the system call.

So a way to get what is required for the use case would be to make the
following changes to the above sequence:

(1) include the sched:* events when perf trace record is run

(2) around the time of interest, look at the kernel stack st the sched:switch
events between the entry and the exit. This will show what the process was
waiting for when it when it blocked. The stacktraces at the
stat_runtime events in the process may also be useful to understand what
was going on.

Example:

$ perf trace record -g -e sched:* -- dd if=/dev/zero of=/x bs=10M count=100 conv=fsync
$ perf trace -i perf.data -s

dd (147), 364 events, 94.3%, 0.000 msec

syscall calls min avg max stddev
--------------- -------- --------- --------- --------- ------
write 63 266.019 316.896 963.413 4.69%
...

$ perf trace -i perf.data --duration 960 -T
91916.354 (963.413 ms): dd/147 write(arg0: 1, arg1: 139729327423488, arg2: 10485760, arg3: 582, arg4: 100, arg5: 72340172838076673) = 10485760

$ perf script
...

dd 147 [002] 90.952941: raw_syscalls:sys_enter: NR 1 (1, 7f1544ed0000, a00000, 246, 64, 101010101010101)
211a00 syscall_trace_enter_phase2 ([kernel.kallsyms])
7d79c3 tracesys_phase2 ([kernel.kallsyms])
ed20 [unknown] (/lib/libpthread-2.16.so)

(... trimmed everything except the longest block: ...)

dd 147 [002] 91.204723: sched:sched_switch: prev_comm=dd prev_pid=147 prev_prio=120 prev_state=D ==> next_comm=swapper/2 next_pid=0 next_prio=120
7d1121 __schedule ([kernel.kallsyms])
7d1817 schedule ([kernel.kallsyms])
4578a5 jbd2_log_wait_commit ([kernel.kallsyms])
4546ff jbd2_log_do_checkpoint ([kernel.kallsyms])
45489f __jbd2_log_wait_for_space ([kernel.kallsyms])
44c2d8 start_this_handle ([kernel.kallsyms])
44c7b6 jbd2__journal_start ([kernel.kallsyms])
42fc3b __ext4_journal_start_sb ([kernel.kallsyms])
4012e9 ext4_write_begin ([kernel.kallsyms])
33795f generic_perform_write ([kernel.kallsyms])
338b68 __generic_file_write_iter ([kernel.kallsyms])
3f4371 ext4_file_write_iter ([kernel.kallsyms])
381ae1 __vfs_write ([kernel.kallsyms])
381f07 vfs_write ([kernel.kallsyms])
3825dd sys_write ([kernel.kallsyms])
7d7a1a tracesys_phase2 ([kernel.kallsyms])
ed20 [unknown] (/lib/libpthread-2.16.so)

swapper 0 [002] 91.875573: sched:sched_wakeup: comm=dd pid=147 prio=120 success=1 target_cpu=002
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/