Re: [RFC PATCH v4 10/29] bpf tools: Collect map definitions from 'maps' section

From: Alexei Starovoitov
Date: Thu May 28 2015 - 23:35:31 EST


On Thu, May 28, 2015 at 03:14:44PM +0800, Wangnan (F) wrote:
>
>
> On 2015/5/28 14:09, Alexei Starovoitov wrote:
> >On Thu, May 28, 2015 at 11:09:50AM +0800, Wangnan (F) wrote:
> >>However this breaks a law in current design that opening phase doesn't
> >>talk to kernel with sys_bpf() at all. All related staff is done in loading
> >>phase. This principle ensures that in every systems, no matter it support
> >>sys_bpf() or not, can read eBPF object without failure.
> >I see, so you want 'parse elf' and 'create maps + load programs'
> >to be separate phases?
> >Fair enough. Then please add a call to release the information
> >collected from elf after program loading is done.
> >relocations and other things are not needed at that point.
>
> What about appending a flag into bpf_object__load() to let it know
> whether to cleanup resource it taken or not? for example:
>
> int bpf_object__load(struct bpf_object *obj, bool clean);
>
> then we can further wrap it by a macro:
>
> #define bpf_object__load_clean(o) bpf_object__load(o, true)
>
> If 'clear' is true, after loading resources will be freed, and the same
> object will be unable to reload again after unload. B doing this we can
> avoid adding a new function.

imo that would be an ugly API. You only want to do that to have
one less library API function? I think it's cleaner to let user of
the library call it when necessary.
Or do cleaning unconditionally. I don't see a use case when the
same set of maps and programs would need to be loaded twice into the kernel.

> >>Moreover, we are planning to introduce hardware PMU to eBPF in the way like
> >>maps,
> >>to give eBPF programs the ability to access hardware PMU counter. I haven't
> >that's very interesting. Please share more info when you can :)
> >If I understood it right, you want in-kernel bpf to do aggregation
> >and filtering of pmu counters ?
> >And computing a number of cache misses between two kprobe events?
> >I can see how I can use that to measure not only time
> >taken by syscall, but number of cache misses occurred due
> >to syscall. Sounds very useful!
>
> I'm glad to see you are also interested with it.
>
> Of course, filtering and aggregation based on PMU counter will be useful,
> but
> this is only our first goal.
>
> You know there are many useful PMU provided by x86 and ARM64. Many people
> ask
> me if there is a way to record absolute PMU counter value when sampling, so
> they can measure IPC changing, cache miss rate, page faults and so on.
> Currently 'perf state' is able to read PMU counter, but the cost is
> relatively high.
>
> For me, enable eBPF program to read PMU counter is the first thing need to
> be done.
> The other thing is enabling eBPF programs to bring some information to perf
> sample.
>
> Here is an example to show my idea.
>
> I have a program which:
>
> int main()
> {
> while(1) {
> read(...);
> /* do A */
> write(...);
> /* do B */
> }
> }
>
> Then by using following script:
>
> SEC("enter=sys_write $outdata:u64")
> int enter_sys_write(...) {
> u64 cycles_cnt = bpf_read_pmu(&cycles_pmu);
> bpf_store_value(cycles_cnt);
> return 1;
> }
>
> SEC("enter=sys_read $outdata:u64")
> int enter_sys_read(...) {
> u64 cycles_cnt = bpf_read_pmu(&cycles_pmu);
> bpf_store_value(cycles_cnt);
> return 1;
> }
>
> by 'perf script', we can check the counter of cycles at each points, then we
> are allowed
> to compute the number of cycles between any two sampling points. This way we
> can compute
> how many cycles taken by A and B. If instruction counter is also recorded,
> we will know
> the IPC of A and B.

Agree. That's useful. That's exactly what I meant by
"compute a number of cache misses between two kprobe events".
The overhead is less when bpf program computes the cycle and instruction
delta, computes IPC and passes only final IPC numbers to the user space.
It can even average IPC over time.
For some very frequent events it can read cycle_cnt on sys_entry_read,
then read it on sys_exit_read, compute delta and average it into the map.
User space can read the map every second or every 10 seconds and print
nice graph.
As far as 'bpf_store_value' goes... I was thinking to expose perf ring_buffer
to bpf programs, so that program can stream any data to perf that receives
it via mmap. Then you don't need this '$outdata' hack.

> Above is still a casual idea. Currently I focus on bring eBPF to perf. This
> should be the base for all other interesting stuffs.

Yes. I think this first step is almost ready.
Looking forward to your next set of patches.
Thanks!

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/