Re: [PATCH RFC] perf lock: Add new event "lock_acquired_waittime"for contention analysis

From: Hitoshi Mitake
Date: Sun Aug 01 2010 - 01:36:51 EST

On 07/29/10 01:39, Frederic Weisbecker wrote:
> On Sun, Jul 25, 2010 at 01:55:29AM +0900, Hitoshi Mitake wrote:
>> Recording every lock events produces too large overhead
>> both in CPU time and memory space.
>> This overhead is a big problem which should be solved,
>> but locks are important factor today.
>> So I'd like to add a new lock event "lock_acquired_waittime"
>> for contention analysis. lock_acquired() calculates time consumed
>> for waiting the locks with held_lock->waittime_stamp set by
>> If this value is 0, no contention occured. If this value is not 0,
>> there is occurence of contention.
>> So I inserted the branch based on this value into lock_acquired().
>> New lock_acquired() calls trace_lock_acquired() when waittime is 0,
>> and calls trace_lock_acquired_waittime(), function for the new event,
>> when waittime is not 0.
>> The traditional event, lock_acquired, is too high frequency event.
>> But the event lock_acquired_waittime is not so high frequency,
>> so overhead produced by tracing this event is not so large.
>> This is an example:
>> | mitake@X201i:~/linux/.../tools/perf% ./perf bench sched messaging
>> | # Running sched/messaging benchmark...
>> | # 20 sender and receiver processes per group
>> | # 10 groups == 400 processes run
>> |
>> | Total time: 1.712 [sec]
>> | mitake@X201i:~/linux/.../tools/perf% sudo ./perf record -R -f -m
1024 -c 1 -e lock:lock_acquired_waittime:r ./perf bench sched messaging
>> | # Running sched/messaging benchmark...
>> | # 20 sender and receiver processes per group
>> | # 10 groups == 400 processes run
>> |
>> | Total time: 1.754 [sec]
>> | [ perf record: Woken up 1 times to write data ]
>> | [ perf record: Captured and wrote 0.958 MB (~41837
samples) ]
>> "perf bench sched messaging" is a kernel bound program, but this result
>> shows that overhead is, at least on my Core i3 laptop, small enough
for discarding
>> especially from CPU time perspective.
>> This new event is not enough for analyzing complete lock behaviour.
>> But measuring quantity of time consumed for lock contention will be
>> So I think this patch is useful for real world lock analysis.
>> I'd like to write in new analyzer which use this event in Python.
>> Cc: Ingo Molnar<mingo@xxxxxxx>
>> Cc: Peter Zijlstra<a.p.zijlstra@xxxxxxxxx>
>> Cc: Paul Mackerras<paulus@xxxxxxxxx>
>> Cc: Arnaldo Carvalho de Melo<acme@xxxxxxxxxx>
>> Cc: Jens Axboe<jens.axboe@xxxxxxxxxx>
>> Cc: Jason Baron<jbaron@xxxxxxxxxx>
>> Cc: Xiao Guangrong<xiaoguangrong@xxxxxxxxxxxxxx>
>> Signed-off-by: Hitoshi Mitake<mitake@xxxxxxxxxxxxxxxxxxxxx>
>> ---
> But those who only want to trace the time waited can just trace
> lock_acquire and lock_acquired event. It seems it would be enough
> to reduce the lock tracing overhead a bit.
> Also the problem with:
> if (cond)
> trace_x()
> else
> trace_y()
> It adds a new condition, only used by tracing, that will always be
> checked, whenever the tracing is on or off. So we lose a cool feature
> of tracepoint here: any tracing code is hidden behind a boolean wall,
> so that the tracing off path is optimized with a single branch check:
> static inline void trace_##name(proto) \
> { \
> if (unlikely(__tracepoint_##name.state)) \
> __DO_TRACE(&__tracepoint_##name, \
> TP_PROTO(data_proto), \
> TP_ARGS(data_args)); \
> } \
> When tracing is off, ie: __tracepoint_##name.state == 0, the only
> extra code executed is the check of __tracepoint_##name.state.
> With your new tracepoint, you've added another branch that is
> always checked (unless !CONFIG_TRACEPOINTS, due to compiler
> optimizations).

Sorry, this branch is not a good choice.
But for my purpose, this tracing enable/disable feature is not enough.
Because I wanted to suppress the frequency of lock event for contention analysis,
so branch for judging occurence of contention had to be added.

> But well, that's not very important. Lockdep is already debugging
> code not enabled in production kernel (at least not supposed to).

I want to make lockdep production kernel feature with dynamic patching.
Current lock event tracing and in kernel lock statistics (/proc/lock_stat)
depend on lockdep. Overhead of these features must be devided into individual.

# BTW, I completely misunderstood the principle of lockdep,
# I might understood it recently :( So I can work on it now.

> The more important point is: do we really need this? Have you observed
> lost events with only lock_acquire + lock_acquired events?
> You can check lost events with: perf trace -d
> You can also tune the size of the buffers with the -m option in
> perf record. The default size of perf record is not enough for lock
> events.

The purpose of this patch is suppressing the frequency of events,
lost of events is out of scope of this patch.

> Optimizing the size of the records is definetly something we want. But
> I'm not sure this is the good way.
> Something like tuning the fields we want in a record seems to me a
> more generic and wide tracing optimization, a gain that would be
> propagated in every tracepoints and not only a specific and precise
> tracing workload on specific tracepoints.
> Look at this example:
> fweisbec@nowhere:~$ cat /debug/tracing/events/sched/sched_switch/format
> name: sched_switch
> ID: 29
> format:
> field:unsigned short common_type; offset:0; size:2; signed:0;
> field:unsigned char common_flags; offset:2; size:1; signed:0;
> field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
> field:int common_pid; offset:4; size:4; signed:1;
> field:int common_lock_depth; offset:8; size:4; signed:1;
> field:char prev_comm[TASK_COMM_LEN]; offset:12; size:16; signed:1;
> field:pid_t prev_pid; offset:28; size:4; signed:1;
> field:int prev_prio; offset:32; size:4; signed:1;
> field:long prev_state; offset:40; size:8; signed:1;
> field:char next_comm[TASK_COMM_LEN]; offset:48; size:16; signed:1;
> field:pid_t next_pid; offset:64; size:4; signed:1;
> field:int next_prio; offset:68; size:4; signed:1;
> print fmt: "prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s ==>
next_comm=%s next_pid=%d next_prio=%d",
> REC->prev_comm, REC->prev_pid, REC->prev_prio, REC->prev_state ?
__print_flags(REC->prev_state, "|", { 1, "S"} , {
> 2, "D" }, { 4, "T" }, { 8, "t" }, { 16, "Z" }, { 32, "X" }, { 64, "x"
}, { 128, "W" }) : "R", REC->next_comm,
> REC->next_pid, REC->next_prio
> In typical scheduler tracing workloads, we don't need:
> - common_flags (irq flags)
> - common_preempt_count (if preempt_count != 0 on sched switch, it's
a bug)
> - common_pid (prev_pid == common_pid, may be prev_pid should be
dropped in fact)
> - common_lock_depth (not used yet in perf tools, and the BKL is dying)
> - prio (not used yet in perf sched)
> - comm (we already have some comm resolution in perf)
> I just talking about some tracing workloads, typically perf sched
won't use the
> aboves. But other kinds of instrumentation may need the prio for example.
> I let you imagine how much space we can save if we can bring some
choice on the
> fields we want, depending on what the user wants and so.

Yeah, this will contribute to reducing overhead of tracing.

> If someone is willing to work on that, it would be definitely welcome.
> I think we could use a kind of mask of fields we want.
> Note I think that's easy for common_* fields as they are always the same
> and filled from tracing_generic_entry_update(), for the others it doesn't
> looks easy, we would need to add more macros inside TP_fast_assign().
> But having the support for common_* fields would be already a good
> thing.

After I finish current pending works like lockdep enabling/disabling,
and anyone don't do it, I'd like to work on it.
Because tracing overhead is really critical problem for lock events.
I want to try any method for reducing overhead.
# But I don't have enough power, this will be very slow :(
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at