Re: [PATCH] bpf: Separate bpf_local_storage_lookup() fast and slow paths

From: Marco Elver
Date: Wed Feb 07 2024 - 04:56:40 EST


On Tue, Feb 06, 2024 at 05:22PM -0800, Martin KaFai Lau wrote:
> On 2/6/24 9:04 AM, Marco Elver wrote:
> > On Mon, Feb 05, 2024 at 03:24PM -0800, Martin KaFai Lau wrote:
> > [...]
> > > > Or can you suggest different functions to hook to for the recursion test?
> > >
> > > I don't prefer to add another tracepoint for the selftest.
> >
> > Ok - I also checked, even though it should be a no-op, it wasn't
> > (compiler generated worse code).
>
> I am interested to how the tracepoint generates worse code. Can you share
> some details ?

My guess is that it produces enough code that some inlinable functions
are no longer being inlined. Specifically __bpf_task_storage_get().

> >
> > > The test in "SEC("fentry/bpf_local_storage_lookup")" is testing that the
> > > initial bpf_local_storage_lookup() should work and the immediate recurred
> > > bpf_task_storage_delete() will fail.
> > >
> > > Depends on how the new slow path function will look like in v2. The test can
> > > probably be made to go through the slow path, e.g. by creating a lot of task
> > > storage maps before triggering the lookup.
[...]
> > Could you suggest how we can fix up the tests? I'm a little stuck
> > because there's not much we can hook to left.
>
> I don't see a solution either if only the cache insertion code path is in a
> traceable function.
>
> The prog->active counter has already been covered in another test. This test
> is mostly only covering the lookup => delete recur case and the code path is
> contained within the bpf storage logic. The future code review should be
> able to cover. I would make an exception here and remove this test case
> considering anything (e.g. tracepoint) we do here is likely to make it
> worse. (more on the test removal below).
>
> >
> > Thanks,
> > -- Marco
> >
> > ------ >8 ------
> >
> > From: Marco Elver <elver@xxxxxxxxxx>
> > Date: Tue, 30 Jan 2024 17:57:45 +0100
> > Subject: [PATCH v2] bpf: Allow compiler to inline most of
> > bpf_local_storage_lookup()
> >
> > In various performance profiles of kernels with BPF programs attached,
> > bpf_local_storage_lookup() appears as a significant portion of CPU
> > cycles spent. To enable the compiler generate more optimal code, turn
> > bpf_local_storage_lookup() into a static inline function, where only the
> > cache insertion code path is outlined (call instruction can be elided
> > entirely if cacheit_lockit is a constant expression).
>
> Can you share more why only putting the cache insertion code to a function
> improves the larger number of maps case. In the benchmark, cacheit_lockit
> should always be true and __bpf_local_storage_insert_cache() should always
> be called.

Keeping bpf_local_storage_lookup() smaller (even if just outlining the
cache insertion) makes a difference as it allows the compiler generate
more optimal code, specifically we avoid duplicating setting up calls to
_raw_spin_lock/unlock. E.g. __bpf_task_storage_get is not being inlined
anymore if bpf_local_storage_lookup() becomes too large (i.e. everything
is up for inlining incl. cache insertion).

Also, on x86 preempt builds, spin_lock/unlock aren't inlinable, so we
have to pay the price of 2 calls regardless: previously for calls to
_raw_spin_lock_irqsave and to _raw_spin_unlock_irqsave. However, with
the version of __bpf_local_storage_insert_cache in my patch, the call to
_raw_spin_unlock_irqsave is tail called, which allows the compiler to
perform TCO, i.e. we still only pay the price of 2 calls: one to
__bpf_local_storage_insert_cache and to _raw_spin_lock_irqsave (but no
call to _raw_spin_unlock_irqsave, which can just be jumped to):

<__bpf_local_storage_insert_cache>:
endbr64
nopl 0x0(%rax,%rax,1)
push %r15
push %r14
push %r12
push %rbx
mov %rdx,%rbx
mov %rsi,%r12
mov %rdi,%r15
lea 0xa8(%rdi),%r14
mov %r14,%rdi
call ffffffff82323650 <_raw_spin_lock_irqsave>
cmpq $0x0,0x18(%rbx)
je ffffffff8127ea80 <__bpf_local_storage_insert_cache+0x40>
add $0x40,%rbx
movzwl 0x10e(%r12),%ecx

mov %rbx,(%r15,%rcx,8)
mov %r14,%rdi
mov %rax,%rsi
pop %rbx
pop %r12
pop %r14
pop %r15
jmp ffffffff823237d0 <_raw_spin_unlock_irqrestore> <--- TCO

I also compared a version where _everything_ is inlined vs. the one with
__bpf_local_storage_insert_cache outlined: the one where everything is
inlined nullifies any performance improvements and is significantly
worse than the one with __bpf_local_storage_insert_cache outlined.

[...]
> > -SEC("fentry/bpf_local_storage_lookup")
> > +SEC("fentry/??????????????????????????") > int BPF_PROG(on_lookup)
>
> Remove this BPF_PROG.
>
> > {
> > struct task_struct *task = bpf_get_current_task_btf();
> > diff --git a/tools/testing/selftests/bpf/progs/task_ls_recursion.c b/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > index 4542dc683b44..d73b33a4c153 100644
> > --- a/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > +++ b/tools/testing/selftests/bpf/progs/task_ls_recursion.c
> > @@ -27,7 +27,7 @@ struct {
> > __type(value, long);
> > } map_b SEC(".maps");
> > -SEC("fentry/bpf_local_storage_lookup")
> > +SEC("fentry/??????????????????????????")
>
> Same here. The checks related to on_lookup in
> prog_tests/task_local_storage.c need to be removed also.
>
> > int BPF_PROG(on_lookup)
> > {
> > struct task_struct *task = bpf_get_current_task_btf();
>

Thanks,
-- Marco