Re: [RFC bpf-next 0/2] sharing bpf runtime stats with /dev/bpf_stats
From: Song Liu
Date: Mon Mar 16 2020 - 01:52:52 EST
> On Mar 14, 2020, at 8:57 AM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
>
> On Sat, Mar 14, 2020 at 03:47:50AM +0000, Song Liu wrote:
>>
>>
>>> On Mar 13, 2020, at 7:43 PM, Alexei Starovoitov <alexei.starovoitov@xxxxxxxxx> wrote:
>>>
>>> On Fri, Mar 13, 2020 at 05:35:16PM -0700, Song Liu wrote:
>>>> Motivation (copied from 2/2):
>>>>
>>>> ======================= 8< =======================
>>>> Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
>>>> Typical userspace tools use kernel.bpf_stats_enabled as follows:
>>>>
>>>> 1. Enable kernel.bpf_stats_enabled;
>>>> 2. Check program run_time_ns;
>>>> 3. Sleep for the monitoring period;
>>>> 4. Check program run_time_ns again, calculate the difference;
>>>> 5. Disable kernel.bpf_stats_enabled.
>>>>
>>>> The problem with this approach is that only one userspace tool can toggle
>>>> this sysctl. If multiple tools toggle the sysctl at the same time, the
>>>> measurement may be inaccurate.
>>>>
>>>> To fix this problem while keep backward compatibility, introduce
>>>> /dev/bpf_stats. sysctl kernel.bpf_stats_enabled will only change the
>>>> lowest bit of the static key. /dev/bpf_stats, on the other hand, adds 2
>>>> to the static key for each open fd. The runtime stats is enabled when
>>>> kernel.bpf_stats_enabled == 1 or there is open fd to /dev/bpf_stats.
>>>>
>>>> With /dev/bpf_stats, user space tool would have the following flow:
>>>>
>>>> 1. Open a fd to /dev/bpf_stats;
>>>> 2. Check program run_time_ns;
>>>> 3. Sleep for the monitoring period;
>>>> 4. Check program run_time_ns again, calculate the difference;
>>>> 5. Close the fd.
>>>> ======================= 8< =======================
>>>>
>>>> 1/2 adds a few new API to jump_label.
>>>> 2/2 adds the /dev/bpf_stats and adjust kernel.bpf_stats_enabled handler.
>>>>
>>>> Please share your comments.
>>>
>>> Conceptually makes sense to me. Few comments:
>>> 1. I don't understand why +2 logic is necessary.
>>> Just do +1 for every FD and change proc_do_static_key() from doing
>>> explicit enable/disable to do +1/-1 as well on transition from 0->1 and 1->0.
>>> The handler would need to check that 1->1 and 0->0 is a nop.
>>
>> With the +2/-2 logic, we use the lowest bit of the counter to remember
>> the value of the sysctl. Otherwise, we cannot tell whether we are making
>> 0->1 transition or 1->1 transition.
>
> that can be another static int var in the handler.
> and no need for patch 1.
Good idea!
>
>>>
>>> 2. /dev is kinda awkward. May be introduce a new bpf command that returns fd?
>>
>> Yeah, I also feel /dev is awkward. fd from bpf command sounds great.
>>
>>>
>>> 3. Instead of 1 and 2 tweak sysctl to do ++/-- unconditionally?
>>> Like repeated sysctl kernel.bpf_stats_enabled=1 will keep incrementing it
>>> and would need equal amount of sysctl kernel.bpf_stats_enabled=0 to get
>>> it back to zero where it will stay zero even if users keep spamming
>>> sysctl kernel.bpf_stats_enabled=0.
>>> This way current services that use sysctl will keep working as-is.
>>> Multiple services that currently collide on sysctl will magically start
>>> working without any changes to them. It is still backwards compatible.
>>
>> I think this is not fully backwards compatible. With current logic, the
>> following sequence disables stats eventually.
>>
>> sysctl kernel.bpf_stats_enabled=1
>> sysctl kernel.bpf_stats_enabled=1
>> sysctl kernel.bpf_stats_enabled=0
>>
>> The same sequence will not disable stats with the ++/-- sysctl.
>
> sure, but if a process holding an fd 'sysctl kernel.bpf_stats_enabled=0'
> won't disable stats either. So it's also not backwards compatible. imo it's a
> change in behavior whichever way, but either approach doesn't break user space.
> An advantage of not doing an fd is that some user that really wants to have
> stats disabled for performance benchmarking can do
> 'sysctl kernel.bpf_stats_enabled=0' few times and the stats will be off.
> We can also make 'sysctl kernel.bpf_stats_enabled' to return current counter,
> so humans can see how many daemons are doing stats collection at that very
> moment.
> We can also do both new fd via bpf syscall and ++/-- via sysctl, but imo
> ++/-- via sysctl is enough to address the issue of multiple stats collecting
> daemons. The patch would be small enough that we can push it via bpf tree
> and into older kernels as arguable 'fix'.
For multiple tools to share run_time_ns stats, I think we need to address two
issues:
1. run_time_ns is turned off when the tool is monitoring;
2. run_time_ns is left on when no one is using it.
On the the other hand, when the tool is not monitoring run_time_ns, it should
not care whether the kernel is counting it.
Currently, we have both problem #1 and #2. ++/-- sysctl will help solve #1, but
won't help #2, e.g., when the tool crashes. fd solution also solves #2.
Also, I think sysctl should not do ++/--? Most (all?) sysctl just does "set the
value", no?
Eventually, we should ask all tools to use the fd interface, and deprecate the
sysctl. For backward compatibility, we can have one legacy tool (using sysctl)
and multiple new tools (using fd interface) share run_time_ns. I feel this is
almost fully backward compatible, because when the tool is not monitoring it
should not care whether run_time_ns is on.
Thanks,
Song