Re: [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats
From: Song Liu
Date: Fri Mar 13 2020 - 23:48:19 EST
> On Mar 13, 2020, at 7:43 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Fri, Mar 13, 2020 at 05:35:16PM -0700, Song Liu wrote:
>> Motivation (copied from 2/2):
>>
>> ======================= 8< =======================
>> Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
>> Typical userspace tools use kernel.bpf_stats_enabled as follows:
>>
>> 1. Enable kernel.bpf_stats_enabled;
>> 2. Check program run_time_ns;
>> 3. Sleep for the monitoring period;
>> 4. Check program run_time_ns again, calculate the difference;
>> 5. Disable kernel.bpf_stats_enabled.
>>
>> The problem with this approach is that only one userspace tool can toggle
>> this sysctl. If multiple tools toggle the sysctl at the same time, the
>> measurement may be inaccurate.
>>
>> To fix this problem while keep backward compatibility, introduce
>> /dev/bpf_stats. sysctl kernel.bpf_stats_enabled will only change the
>> lowest bit of the static key. /dev/bpf_stats, on the other hand, adds 2
>> to the static key for each open fd. The runtime stats is enabled when
>> kernel.bpf_stats_enabled == 1 or there is open fd to /dev/bpf_stats.
>>
>> With /dev/bpf_stats, user space tool would have the following flow:
>>
>> 1. Open a fd to /dev/bpf_stats;
>> 2. Check program run_time_ns;
>> 3. Sleep for the monitoring period;
>> 4. Check program run_time_ns again, calculate the difference;
>> 5. Close the fd.
>> ======================= 8< =======================
>>
>> 1/2 adds a few new API to jump_label.
>> 2/2 adds the /dev/bpf_stats and adjust kernel.bpf_stats_enabled handler.
>>
>> Please share your comments.
>
> Conceptually makes sense to me. Few comments:
> 1. I don't understand why +2 logic is necessary.
> Just do +1 for every FD and change proc_do_static_key() from doing
> explicit enable/disable to do +1/-1 as well on transition from 0->1 and 1->0.
> The handler would need to check that 1->1 and 0->0 is a nop.
With the +2/-2 logic, we use the lowest bit of the counter to remember
the value of the sysctl. Otherwise, we cannot tell whether we are making
0->1 transition or 1->1 transition.
>
> 2. /dev is kinda awkward. May be introduce a new bpf command that returns fd?
Yeah, I also feel /dev is awkward. fd from bpf command sounds great.
>
> 3. Instead of 1 and 2 tweak sysctl to do ++/-- unconditionally?
> Like repeated sysctl kernel.bpf_stats_enabled=1 will keep incrementing it
> and would need equal amount of sysctl kernel.bpf_stats_enabled=0 to get
> it back to zero where it will stay zero even if users keep spamming
> sysctl kernel.bpf_stats_enabled=0.
> This way current services that use sysctl will keep working as-is.
> Multiple services that currently collide on sysctl will magically start
> working without any changes to them. It is still backwards compatible.
I think this is not fully backwards compatible. With current logic, the
following sequence disables stats eventually.
sysctl kernel.bpf_stats_enabled=1
sysctl kernel.bpf_stats_enabled=1
sysctl kernel.bpf_stats_enabled=0
The same sequence will not disable stats with the ++/-- sysctl.
Thanks,
Song