Re: [PATCH v3 3/4] bpf: Add libbpf logic for user-space ring buffer
From: Andrii Nakryiko
Date: Fri Sep 09 2022 - 18:51:04 EST
On Tue, Aug 30, 2022 at 6:42 AM David Vernet <void@xxxxxxxxxxxxx> wrote:
>
> On Wed, Aug 24, 2022 at 02:58:31PM -0700, Andrii Nakryiko wrote:
>
> [...]
>
> > > +LIBBPF_API struct user_ring_buffer *
> > > +user_ring_buffer__new(int map_fd, const struct user_ring_buffer_opts *opts);
> > > +LIBBPF_API void *user_ring_buffer__reserve(struct user_ring_buffer *rb,
> > > + __u32 size);
> > > +
> > > +LIBBPF_API void *user_ring_buffer__reserve_blocking(struct user_ring_buffer *rb,
> > > + __u32 size,
> > > + int timeout_ms);
> > > +LIBBPF_API void user_ring_buffer__submit(struct user_ring_buffer *rb,
> > > + void *sample);
> > > +LIBBPF_API void user_ring_buffer__discard(struct user_ring_buffer *rb,
> > > + void *sample);
> > > +LIBBPF_API void user_ring_buffer__free(struct user_ring_buffer *rb);
> > > +
[...]
> > > +void *user_ring_buffer__reserve_blocking(struct user_ring_buffer *rb, __u32 size, int timeout_ms)
> > > +{
> > > + int ms_elapsed = 0, err;
> > > + struct timespec start;
> > > +
> > > + if (timeout_ms < 0 && timeout_ms != -1)
> > > + return errno = EINVAL, NULL;
> > > +
> > > + if (timeout_ms != -1) {
> > > + err = clock_gettime(CLOCK_MONOTONIC, &start);
> > > + if (err)
> > > + return NULL;
> > > + }
> > > +
> > > + do {
> > > + int cnt, ms_remaining = timeout_ms - ms_elapsed;
> >
> > let's max(0, timeout_ms - ms_elapsed) to avoid negative ms_remaining
> > in some edge timing cases
>
> We actually want to have a negative ms_remaining if timeout_ms is -1. -1
> in epoll_wait() specifies an infinite timeout. If we were to round up to
> 0, it wouldn't block at all.
then I think it's better to special case timeout_ms == -1. My worry
here as I mentioned is edge case timing where ms_elapsed is bigger
than our remaining timeout_ms and we go into <0 and stay blocked for
long time.
So I think it's best to pass `timeout_ms < 0 ? -1 : ms_remaining` and
still do max. But I haven't checked v5 yet, so if you already
addressed this, it's fine.
>
> > > + void *sample;
> > > + struct timespec curr;
> > > +
> > > + sample = user_ring_buffer__reserve(rb, size);
> > > + if (sample)
> > > + return sample;
> > > + else if (errno != ENODATA)
> > > + return NULL;
> > > +
> > > + /* The kernel guarantees at least one event notification
> > > + * delivery whenever at least one sample is drained from the
> > > + * ringbuffer in an invocation to bpf_ringbuf_drain(). Other
> > > + * additional events may be delivered at any time, but only one
> > > + * event is guaranteed per bpf_ringbuf_drain() invocation,
> > > + * provided that a sample is drained, and the BPF program did
> > > + * not pass BPF_RB_NO_WAKEUP to bpf_ringbuf_drain().
> > > + */
> > > + cnt = epoll_wait(rb->epoll_fd, &rb->event, 1, ms_remaining);
> > > + if (cnt < 0)
> > > + return NULL;
> > > +
> > > + if (timeout_ms == -1)
> > > + continue;
> > > +
> > > + err = clock_gettime(CLOCK_MONOTONIC, &curr);
> > > + if (err)
> > > + return NULL;
> > > +
> > > + ms_elapsed = ms_elapsed_timespec(&start, &curr);
> > > + } while (ms_elapsed <= timeout_ms);
> >
> > let's simplify all the time keeping to use nanosecond timestamps and
> > only convert to ms when calling epoll_wait()? Then you can just have a
> > tiny helper to convert timespec to nanosecond ts ((u64)ts.tv_sec *
> > 1000000000 + ts.tv_nsec) and compare u64s directly. WDYT?
>
> Sounds like an improvement to me!
>
> Thanks,
> David