[EXPERIENCE] My experience on using perf record BPF filter on a real usecase

From: Wangnan (F)
Date: Thu Jun 04 2015 - 06:21:18 EST

Next message: Christoffer Dall: "Re: [PATCH v5 08/12] KVM: arm64: re-factor hyp.S debug register code"
Previous message: David Jander: "[RFC PATCH 2/2] mmc: core.c: Add comment to clarify special cases of ERASE/TRIM"
In reply to: Arnaldo Carvalho de Melo: "Re: [GIT PULL 0/6] perf/core improvements and fixes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi all,

I'd like to share my exprience on using 'perf record' BPF filter in a
real usecase to show the power and shortcome in my patch series:

https://lkml.kernel.org/r/1433144296-74992-1-git-send-email-wangnan0@xxxxxxxxxx

and other works on eBPF.

My usecase shows that such filter is useful. Also, I hope it can help us
to find way to further improve it.

My task is to find the reason why iozone test result is bad on some
specific cases. The development environment is a x86_64 server, the target
machine is a smartphone with Android. By previous analysis I have
already got some useful information:

1. iozone computes bandwidth by averaging time of each sys_write.

2. In our case, 1% sys_write takes 75% of total time, so what I need
to do now should be finding the reason why those sys_write take so
long.

3. By sampling call stack on sched:sched_switch, I find that those
sys_write calls lock_page() and blocks on it.

I decide to use BPF filter to find the other side of this locking
contention. The idea is simple:

1. For all calls of lock_page(), probe at entry and exit points of
it. Measure the execution time of the lock_page() call. If it takes
too long (longer than 0.1 second) then there should have a lock
contention. Take the sample at exit point.

2. For all calls of unlock_page(), if the page is acquiring by other
on at least 0.1 second before, take a sample at this point.

Currently making the above idea work is possible but not very
straightforward. One problem I can identify is:

Doesn't like ftrace, there is no way for eBPF program to access call
stack information. Without extra information, eBPF programs are
unable to match lock_page events and corresponding lock_page%return
events. Currently the only way for passing information between
programs are maps. To simulate call stack matching, I create a
BPF_FUNC_git_tid() which returns current->pid, and a
proc_locking_page_map map which records the acquired page and time
of calling lock_page.

Another problem is: at the entry of lock_page() and
unlock_page(), for fetching the page pointer I have to directly
use 'ctx->regs[0]' (I am on aarch64). Which is not protable.

The final program I used is attached at the bottom of this email. It
costs more than 100 lines of code. I have to do some debugging to
ensure it works correctly on a virtual machine.

It is compiled using:

# $CLANG ${INCLUDE} -D__KERNEL__ -Wno-unused-value -Wno-pointer-sign -O2 \
-emit-llvm -c lock_page.c -o - | $LLC -march=bpf -filetype=obj -o \
lock_page.o

Then the lock_page.o is transfered onto target system.

I loaded it using following command:

# perf record -e syscalls:sys_enter_write -e syscalls:sys_exit_write \
-e lock_page.o -a iozone ...

Here is another inconvenience. Currently I only concern on write
syscall issued by iozone. However, without '-a' I'm unable to collect
information of the locker. If I want to filter sys_{enter,exit}_write
belong to iozone out using eBPF, I need to implement another function
like BPF_FUNC_git_comm. Another method is to use perf '--filter' after
the two events. However it looks strange to use two filter mechanisms
together. This time I choose to do filtering offline using perf script.

The result is resonable. Finaly I found the two side of lock contention.
It shows the way to improve. I'm sorry I can't share the call stack in
this list.

One inconvenience in this stage is: the information is
printed into ring buffer while the samples are stored into perf.data.
By analysing perf.data without ftrace ring buffer I don't know how long
the lock_page() cost becasue I don't sample at the entry of
lock_page().

The final part is the BPF program I used. I think there should have
better way to do it. If any know how to make it shorter please let me
know.

Thank you.

/* ------------- START OF BPF PROGRAM ------------- */
/* __lock_page pass to unlock_page, key is pid */
struct proc_locking_page {
unsigned long page;
unsigned long time;
};

struct bpf_map_def SEC("maps") proc_locking_page_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(unsigned long),
.value_size = sizeof(struct proc_locking_page),
.max_entries = 1000000,
};

/* from page get pid */
struct page_being_locked_by_proc {
unsigned long tid;
unsigned long time;
};

struct bpf_map_def SEC("maps") page_being_locked_by_proc_map = {
.type = BPF_MAP_TYPE_HASH,
.key_size = sizeof(unsigned long),
.value_size = sizeof(struct page_being_locked_by_proc),
.max_entries = 1000000,
};

SEC("lock_page=__lock_page")
int lock_page_recorder(struct pt_regs *ctx)
{
unsigned long tid = bpf_get_tid();
unsigned long page = ctx->regs[0];
unsigned long curr_ns = bpf_ktime_get_ns();

struct proc_locking_page locking_page;
struct page_being_locked_by_proc being_locked;

locking_page.page = page;
locking_page.time = curr_ns;
being_locked.tid = tid;
being_locked.time = curr_ns;

bpf_map_update_elem(&proc_locking_page_map, &tid, &locking_page, BPF_ANY);
bpf_map_update_elem(&page_being_locked_by_proc_map, &page, &being_locked, BPF_ANY);
return 0;
}

SEC("lock_page_ret=__lock_page%return")
int lock_page_return_recorder(struct pt_regs *ctx)
{
unsigned long tid = bpf_get_tid();
unsigned long curr_ns = bpf_ktime_get_ns();
unsigned long page;
unsigned long diff_time;
struct proc_locking_page *p_locking_page;

p_locking_page = bpf_map_lookup_elem(&proc_locking_page_map, &tid);

/* BAD!! */
if (!p_locking_page)
return 0;

page = p_locking_page->page;
diff_time = curr_ns - p_locking_page->time;
bpf_map_delete_elem(&proc_locking_page_map, &tid);
bpf_map_delete_elem(&page_being_locked_by_proc_map, &page);

if (diff_time > 10000000) {
char fmt[] = "tid %d get page %lx using %d ns\n";
bpf_trace_printk(fmt, sizeof(fmt), tid, page, diff_time);
return 1;
}

return 0;
}

SEC("unlock_page=unlock_page")
int unlock_page_recorder(struct pt_regs *ctx)
{
unsigned long tid = bpf_get_tid();
unsigned long page = ctx->regs[0];
unsigned long time = bpf_ktime_get_ns();
unsigned long diff_time;
struct page_being_locked_by_proc *p_being_locked;
char fmt[] = "%d vs %d, %d ns\n";

p_being_locked = bpf_map_lookup_elem(&page_being_locked_by_proc_map, &page);
if (!p_being_locked)
return 0;
diff_time = time - p_being_locked->time;
if (diff_time > 10000000) {
bpf_trace_printk(fmt, sizeof(fmt), tid, p_being_locked->tid, diff_time);
return 1;
}
return 0;
}
/* ------------- END OF BPF PROGRAM ------------- */

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Christoffer Dall: "Re: [PATCH v5 08/12] KVM: arm64: re-factor hyp.S debug register code"
Previous message: David Jander: "[RFC PATCH 2/2] mmc: core.c: Add comment to clarify special cases of ERASE/TRIM"
In reply to: Arnaldo Carvalho de Melo: "Re: [GIT PULL 0/6] perf/core improvements and fixes"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]