Re: [PATCH 1/3] ptr_ring: batch ring zeroing

From: Michael S. Tsirkin
Date: Tue May 09 2017 - 09:33:29 EST


On Sat, Apr 08, 2017 at 02:14:08PM +0200, Jesper Dangaard Brouer wrote:
> On Fri, 7 Apr 2017 08:49:57 +0300
> "Michael S. Tsirkin" <mst@xxxxxxxxxx> wrote:
>
> > A known weakness in ptr_ring design is that it does not handle well the
> > situation when ring is almost full: as entries are consumed they are
> > immediately used again by the producer, so consumer and producer are
> > writing to a shared cache line.
> >
> > To fix this, add batching to consume calls: as entries are
> > consumed do not write NULL into the ring until we get
> > a multiple (in current implementation 2x) of cache lines
> > away from the producer. At that point, write them all out.
> >
> > We do the write out in the reverse order to keep
> > producer from sharing cache with consumer for as long
> > as possible.
> >
> > Writeout also triggers when ring wraps around - there's
> > no special reason to do this but it helps keep the code
> > a bit simpler.
> >
> > What should we do if getting away from producer by 2 cache lines
> > would mean we are keeping the ring moe than half empty?
> > Maybe we should reduce the batching in this case,
> > current patch simply reduces the batching.
> >
> > Notes:
> > - it is no longer true that a call to consume guarantees
> > that the following call to produce will succeed.
> > No users seem to assume that.
> > - batching can also in theory reduce the signalling rate:
> > users that would previously send interrups to the producer
> > to wake it up after consuming each entry would now only
> > need to do this once in a batch.
> > Doing this would be easy by returning a flag to the caller.
> > No users seem to do signalling on consume yet so this was not
> > implemented yet.
> >
> > Signed-off-by: Michael S. Tsirkin <mst@xxxxxxxxxx>
> > ---
> >
> > Jason, I am curious whether the following gives you some of
> > the performance boost that you see with vhost batching
> > patches. Is vhost batching on top still helpful?
> >
> > include/linux/ptr_ring.h | 63 +++++++++++++++++++++++++++++++++++++++++-------
> > 1 file changed, 54 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/linux/ptr_ring.h b/include/linux/ptr_ring.h
> > index 6c70444..6b2e0dd 100644
> > --- a/include/linux/ptr_ring.h
> > +++ b/include/linux/ptr_ring.h
> > @@ -34,11 +34,13 @@
> > struct ptr_ring {
> > int producer ____cacheline_aligned_in_smp;
> > spinlock_t producer_lock;
> > - int consumer ____cacheline_aligned_in_smp;
> > + int consumer_head ____cacheline_aligned_in_smp; /* next valid entry */
> > + int consumer_tail; /* next entry to invalidate */
> > spinlock_t consumer_lock;
> > /* Shared consumer/producer data */
> > /* Read-only by both the producer and the consumer */
> > int size ____cacheline_aligned_in_smp; /* max entries in queue */
> > + int batch; /* number of entries to consume in a batch */
> > void **queue;
> > };
> >
> > @@ -170,7 +172,7 @@ static inline int ptr_ring_produce_bh(struct ptr_ring *r, void *ptr)
> > static inline void *__ptr_ring_peek(struct ptr_ring *r)
> > {
> > if (likely(r->size))
> > - return r->queue[r->consumer];
> > + return r->queue[r->consumer_head];
> > return NULL;
> > }
> >
> > @@ -231,9 +233,38 @@ static inline bool ptr_ring_empty_bh(struct ptr_ring *r)
> > /* Must only be called after __ptr_ring_peek returned !NULL */
> > static inline void __ptr_ring_discard_one(struct ptr_ring *r)
> > {
> > - r->queue[r->consumer++] = NULL;
> > - if (unlikely(r->consumer >= r->size))
> > - r->consumer = 0;
> > + /* Fundamentally, what we want to do is update consumer
> > + * index and zero out the entry so producer can reuse it.
> > + * Doing it naively at each consume would be as simple as:
> > + * r->queue[r->consumer++] = NULL;
> > + * if (unlikely(r->consumer >= r->size))
> > + * r->consumer = 0;
> > + * but that is suboptimal when the ring is full as producer is writing
> > + * out new entries in the same cache line. Defer these updates until a
> > + * batch of entries has been consumed.
> > + */
> > + int head = r->consumer_head++;
> > +
> > + /* Once we have processed enough entries invalidate them in
> > + * the ring all at once so producer can reuse their space in the ring.
> > + * We also do this when we reach end of the ring - not mandatory
> > + * but helps keep the implementation simple.
> > + */
> > + if (unlikely(r->consumer_head - r->consumer_tail >= r->batch ||
> > + r->consumer_head >= r->size)) {
> > + /* Zero out entries in the reverse order: this way we touch the
> > + * cache line that producer might currently be reading the last;
> > + * producer won't make progress and touch other cache lines
> > + * besides the first one until we write out all entries.
> > + */
> > + while (likely(head >= r->consumer_tail))
> > + r->queue[head--] = NULL;
> > + r->consumer_tail = r->consumer_head;
> > + }
> > + if (unlikely(r->consumer_head >= r->size)) {
> > + r->consumer_head = 0;
> > + r->consumer_tail = 0;
> > + }
> > }
>
> I love this idea. Reviewed and discussed the idea in-person with MST
> during netdevconf[1] at this laptop. I promised I will also run it
> through my micro-benchmarking[2] once I return home (hint ptr_ring gets
> used in network stack as skb_array).

I'm merging this through my tree. Any objections?

> Reviewed-by: Jesper Dangaard Brouer <brouer@xxxxxxxxxx>
>
> [1] http://netdevconf.org/2.1/
> [2] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/lib/skb_array_bench01.c
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer