Re: [PATCH v0] bpf: BPF based latency tracing

From: Daniel Wagner
Date: Fri Jun 19 2015 - 02:07:53 EST


On 06/18/2015 07:06 PM, Alexei Starovoitov wrote:
> On 6/18/15 4:40 AM, Daniel Wagner wrote:
>> BPF offers another way to generate latency histograms. We attach
>> kprobes at trace_preempt_off and trace_preempt_on and calculate the
>> time it takes to from seeing the off/on transition.
>>
>> The first array is used to store the start time stamp. The key is the
>> CPU id. The second array stores the log2(time diff). We need to use
>> static allocation here (array and not hash tables). The kprobes
>> hooking into trace_preempt_on|off should not calling any dynamic
>> memory allocation or free path. We need to avoid recursivly
>> getting called. Besides that, it reduces jitter in the measurement.
>>
>> CPU 0
>> latency : count distribution
>> 1 -> 1 : 0
>> | |
>> 2 -> 3 : 0
>> | |
>> 4 -> 7 : 0
>> | |
>> 8 -> 15 : 0
>> | |
>> 16 -> 31 : 0
>> | |
>> 32 -> 63 : 0
>> | |
>> 64 -> 127 : 0
>> | |
>> 128 -> 255 : 0
>> | |
>> 256 -> 511 : 0
>> | |
>> 512 -> 1023 : 0
>> | |
>> 1024 -> 2047 : 0
>> | |
>> 2048 -> 4095 : 166723
>> |*************************************** |
>> 4096 -> 8191 : 19870
>> |*** |
>> 8192 -> 16383 : 6324
>> | |
>> 16384 -> 32767 : 1098
>> | |
>
> nice useful sample indeed!
> The numbers are non-JITed, right?

You are right. I forgot to enable the JIT compiler.

> JIT should reduce the measurement cost 2-3x, but preempt_on/off
> latency probably will stay in 2k range.

Here is an example output with JIT enabled after a few seconds.

latency : count distribution
1 -> 1 : 0 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 0 | |
32 -> 63 : 0 | |
64 -> 127 : 0 | |
128 -> 255 : 0 | |
256 -> 511 : 0 | |
512 -> 1023 : 0 | |
1024 -> 2047 : 0 | |
2048 -> 4095 : 11595 |*************************************** |
4096 -> 8191 : 3086 |********* |
8192 -> 16383 : 392 | |
16384 -> 32767 : 114 | |
32768 -> 65535 : 16 | |
65536 -> 131071 : 1 | |
131072 -> 262143 : 0 | |
262144 -> 524287 : 0 | |
524288 -> 1048575 : 34 | |

As you predicted it wont go lower than 2k range.

>> I am not sure if it is really worth spending more time getting
>> the hash table working for the trace_preempt_[on|off] kprobes.
>> There are so many things which could go wrong, so going with
>> a static version seems for me the right choice.
>
> agree. for this use case arrays are better choice anyway.
> But I'll keep working on getting hash tables working even
> in this extreme conditions. bpf should be always rock solid.
>
> I'm only a bit suspicious of kprobes, since we have:
> NOKPROBE_SYMBOL(preempt_count_sub)
> but trace_preemp_on() called by preempt_count_sub()
> don't have this mark...

The original commit indicates that anything called from
preempt_disable() should also be marked as NOKPROBE_SYMBOL:

commit 43627582799db317e966ecb0002c2c3c9805ec0f
Author: Srinivasa Ds <srinivasa@xxxxxxxxxx> Sun Feb 24 00:24:04 2008
Committer: Linus Torvalds <torvalds@xxxxxxxxxxxxxxxxxxxxxxxxxx> Sun Feb 24 02:13:24 2008
Original File: kernel/sched.c

kprobes: refuse kprobe insertion on add/sub_preempt_counter()

Kprobes makes use of preempt_disable(),preempt_enable_noresched() and these
functions inturn call add/sub_preempt_count(). So we need to refuse user from
inserting probe in to these functions.

This patch disallows user from probing add/sub_preempt_count().


Obviously, this would render this patch useless.

>> +SEC("kprobe/trace_preempt_off")

BTW, is there a reason why not supporting build-in
tracepoints/events? It looks like it is only an artificial
limitation of bpf_helpers.

>> +int bpf_prog1(struct pt_regs *ctx)
>> +{
>> + int cpu = bpf_get_smp_processor_id();
>> + u64 *ts = bpf_map_lookup_elem(&my_map, &cpu);
>> +
>> + if (ts)
>> + *ts = bpf_ktime_get_ns();
>
> btw, I'm planning to add native per-cpu maps which will
> speed up things more and reduce measurement overhead.

Funny I was about to suggest something like this :)

> I think you can retarget this patch to net-next and send
> it to netdev. It's not too late for this merge window.

I'll rebase it to net-next.

cheers,
daniel

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/