Re: [PATCH v3 2/4] bpf: Add bpf_user_ringbuf_drain() helper

From: Andrii Nakryiko
Date: Fri Sep 09 2022 - 18:45:40 EST


On Tue, Aug 30, 2022 at 6:28 AM David Vernet <void@xxxxxxxxxxxxx> wrote:
>
> On Wed, Aug 24, 2022 at 02:22:44PM -0700, Andrii Nakryiko wrote:
> > > +/* Maximum number of user-producer ringbuffer samples that can be drained in
> > > + * a call to bpf_user_ringbuf_drain().
> > > + */
> > > +#define BPF_MAX_USER_RINGBUF_SAMPLES BIT(17)
> >
> > nit: I don't think using BIT() is appropriate here. 128 * 1024 would
> > be better, IMO. This is not inherently required to be a single bit
> > constant.
>
> No problem, updated.
>
> > > +
> > > static inline u32 bpf_map_flags_to_cap(struct bpf_map *map)
> > > {
> > > u32 access_flags = map->map_flags & (BPF_F_RDONLY_PROG | BPF_F_WRONLY_PROG);
> > > @@ -2411,6 +2417,7 @@ extern const struct bpf_func_proto bpf_loop_proto;
> > > extern const struct bpf_func_proto bpf_copy_from_user_task_proto;
> > > extern const struct bpf_func_proto bpf_set_retval_proto;
> > > extern const struct bpf_func_proto bpf_get_retval_proto;
> > > +extern const struct bpf_func_proto bpf_user_ringbuf_drain_proto;
> > >

[...]

> > > +
> > > +static void __bpf_user_ringbuf_sample_release(struct bpf_ringbuf *rb, size_t size, u64 flags)
> > > +{
> > > + u64 producer_pos, consumer_pos;
> > > +
> > > + /* Synchronizes with smp_store_release() in user-space producer. */
> > > + producer_pos = smp_load_acquire(&rb->producer_pos);
> > > +
> > > + /* Using smp_load_acquire() is unnecessary here, as the busy-bit
> > > + * prevents another task from writing to consumer_pos after it was read
> > > + * by this task with smp_load_acquire() in __bpf_user_ringbuf_peek().
> > > + */
> > > + consumer_pos = rb->consumer_pos;
> > > + /* Synchronizes with smp_load_acquire() in user-space producer. */
> > > + smp_store_release(&rb->consumer_pos, consumer_pos + size + BPF_RINGBUF_HDR_SZ);
> > > +
> > > + /* Prevent the clearing of the busy-bit from being reordered before the
> > > + * storing of the updated rb->consumer_pos value.
> > > + */
> > > + smp_mb__before_atomic();
> > > + atomic_set(&rb->busy, 0);
> > > +
> > > + if (!(flags & BPF_RB_NO_WAKEUP)) {
> > > + /* As a heuristic, if the previously consumed sample caused the
> > > + * ringbuffer to no longer be full, send an event notification
> > > + * to any user-space producer that is epoll-waiting.
> > > + */
> > > + if (producer_pos - consumer_pos == ringbuf_total_data_sz(rb))
> >
> > I'm a bit confused here. This will be true only if user-space producer
> > filled out entire ringbuf data *exactly* to the last byte with a
> > single record. Or am I misunderstanding this?
>
> I think you're misunderstanding. This will indeed only be true if the ring
> buffer was full (to the last byte as you said) before the last sample was
> consumed, but it doesn't have to have been filled with a single record.
> We're just checking that producer_pos - consumer_pos is the total size of
> the ring buffer, but there can be many samples between consumer_pos and
> producer_pos for that to be the case.

you are right, never mind about single sample part, but I don't think
that's the important part (just something that surprised me making
everything even less realistic)

>
> > If my understanding is correct, how is this a realistic use case and
> > how does this heuristic help at all?
>
> Though I think you may have misunderstood the heuristic, some more
> explanation is probably warranted nonetheless. This heuristic being useful
> relies on two assumptions:
>
> 1. It will be common for user-space to publish statically sized samples.
>
> I think this one is pretty unambiguously true, especially considering that
> BPF_MAP_TYPE_RINGBUF was put to great use with statically sized samples for
> quite some time. I'm open to hearing why that might not be the case.

True, majority of use cases for BPF ringubf were fixed-sized, thanks
to convenience of reserve/commit API. But data structure itself allows
variable-sized and there are use cases doing this, plus with dynptr
now it's easier to do variable-sized efficiently. So special-casing
for fixed-sized sample a bit off, especially considering #2

>
> 2. The size of the ring buffer is a multiple of the size of a sample.
>
> This one I think is a bit less clear. Users can always size the ring buffer
> to make sure this will be the case, but whether or not that will be
> commonly done is another story.

so I'm almost certain this won't be the case. I don't think anyone is
going to be tracking exact size of sample's struct (and it will most
probably change with time) and then sizing ringbuf to be both
power-of-2 of page_size *and* multiple of sizeof(struct
my_ringbuf_sample) is something I don't see anyone doing.

>
> I'm fine with removing this heuristic for now if it's unclear that it's
> serving a common use-case. We can always add it back in later if we want
> to.

Yes, this looks quite out of place with a bunch of optimistic but
unrealistic assumptions. Doing one notification after drain will be
fine for now, IMO.

>
> > > + irq_work_queue(&rb->work);
> > > +
> > > + }
> > > +}
> > > +
> > > +BPF_CALL_4(bpf_user_ringbuf_drain, struct bpf_map *, map,
> > > + void *, callback_fn, void *, callback_ctx, u64, flags)
> > > +{
> > > + struct bpf_ringbuf *rb;
> > > + long num_samples = 0, ret = 0;
> > > + bpf_callback_t callback = (bpf_callback_t)callback_fn;
> > > + u64 wakeup_flags = BPF_RB_NO_WAKEUP;
> > > +
> > > + if (unlikely(flags & ~wakeup_flags))
> >

[...]