Re: [PATCH bpf v2] bpf: preallocate a perf_sample_data per event fd

From: Matt Mullins
Date: Mon Jun 03 2019 - 23:23:00 EST


On Tue, 2019-06-04 at 02:43 +0200, Daniel Borkmann wrote:
> On 06/04/2019 01:54 AM, Alexei Starovoitov wrote:
> > On Mon, Jun 3, 2019 at 4:48 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:
> > > On 06/04/2019 01:27 AM, Alexei Starovoitov wrote:
> > > > On Mon, Jun 3, 2019 at 3:59 PM Matt Mullins <mmullins@xxxxxx> wrote:
> > > > >
> > > > > If these are invariably non-nested, I can easily keep bpf_misc_sd when
> > > > > I resubmit. There was no technical reason other than keeping the two
> > > > > codepaths as similar as possible.
> > > > >
> > > > > What resource gives you worry about doing this for the networking
> > > > > codepath?
> > > >
> > > > my preference would be to keep tracing and networking the same.
> > > > there is already minimal nesting in networking and probably we see
> > > > more when reuseport progs will start running from xdp and clsbpf
> > > >
> > > > > > Aside from that it's also really bad to miss events like this as exporting
> > > > > > through rb is critical. Why can't you have a per-CPU counter that selects a
> > > > > > sample data context based on nesting level in tracing? (I don't see a discussion
> > > > > > of this in your commit message.)
> > > > >
> > > > > This change would only drop messages if the same perf_event is
> > > > > attempted to be used recursively (i.e. the same CPU on the same
> > > > > PERF_EVENT_ARRAY map, as I haven't observed anything use index !=
> > > > > BPF_F_CURRENT_CPU in testing).
> > > > >
> > > > > I'll try to accomplish the same with a percpu nesting level and
> > > > > allocating 2 or 3 perf_sample_data per cpu. I think that'll solve the
> > > > > same problem -- a local patch keeping track of the nesting level is how
> > > > > I got the above stack trace, too.
> > > >
> > > > I don't think counter approach works. The amount of nesting is unknown.
> > > > imo the approach taken in this patch is good.
> > > > I don't see any issue when event_outputs will be dropped for valid progs.
> > > > Only when user called the helper incorrectly without BPF_F_CURRENT_CPU.
> > > > But that's an error anyway.
> > >
> > > My main worry with this xchg() trick is that we'll miss to export crucial
> > > data with the EBUSY bailing out especially given nesting could increase in
> > > future as you state, so users might have a hard time debugging this kind of
> > > issue if they share the same perf event map among these programs, and no
> > > option to get to this data otherwise. Supporting nesting up to a certain
> > > level would still be better than a lost event which is also not reported
> > > through the usual way aka perf rb.

Tracing can already be lossy: trace_call_bpf() silently simply doesn't
call the prog and instead returns zero if bpf_prog_active != 1.

> >
> > I simply don't see this 'miss to export data' in all but contrived conditions.
> > Say two progs share the same perf event array.
> > One prog calls event_output and while rb logic is working
> > another prog needs to start executing and use the same event array
>
> Correct.
>
> > slot. Today it's only possible for tracing prog combined with networking,
> > but having two progs use the same event output array is pretty much
> > a user bug. Just like not passing BPF_F_CURRENT_CPU.
>
> I don't see the user bug part, why should that be a user bug? It's the same
> as if we would say that sharing a BPF hash map between networking programs
> attached to different hooks or networking and tracing would be a user bug
> which it is not. One concrete example would be cilium monitor where we
> currently expose skb trace and drop events a well as debug data through
> the same rb. This should be usable from any type that has perf_event_output
> helper enabled (e.g. XDP and tc/BPF) w/o requiring to walk yet another per
> cpu mmap rb from user space.

Neither of these solutions would affect the behavior of sharing the
perf array between networking programs -- since they're never called in
a nested fashion, then you'll never hit the "xchg() returned NULL" at
all.

That said, I think I can logically limit nesting in tracing to 3
levels:
- a kprobe or raw tp or perf event,
- another one of the above that irq context happens to run, and
- if we're really unlucky, the same in nmi
at most one of which can be a kprobe or perf event.

There is also a comment

/*
* bpf_raw_tp_regs are separate from bpf_pt_regs used from skb/xdp
* to avoid potential recursive reuse issue when/if tracepoints are added
* inside bpf_*_event_output, bpf_get_stackid and/or bpf_get_stack
*/

that suggests that one day bpf_perf_event_output might grow a static
tracepoint. However, if the program attached to such a hypothetical
tracepoint were to call bpf_perf_event_output, that would infinitely
recurse ... it seems fine to let that case return -EBUSY as well. It
does make me wonder if I should do the same nesting for the pt_regs.

I've now got an experiment running with the counter approach, and the
workload that I hit the original crash with seems to be fine with 2
layers' worth -- though if we decide that's the one I should move
forward with, I'll probably bump it to a third to be safe for an NMI.