Re: [PATCH 1/3] ptr_ring: batch ring zeroing

From: Jason Wang
Date: Mon Apr 17 2017 - 22:17:38 EST




On 2017å04æ15æ 05:00, Michael S. Tsirkin wrote:
On Fri, Apr 14, 2017 at 03:52:23PM +0800, Jason Wang wrote:

On 2017å04æ12æ 16:03, Jason Wang wrote:

On 2017å04æ07æ 13:49, Michael S. Tsirkin wrote:
A known weakness in ptr_ring design is that it does not handle well the
situation when ring is almost full: as entries are consumed they are
immediately used again by the producer, so consumer and producer are
writing to a shared cache line.

To fix this, add batching to consume calls: as entries are
consumed do not write NULL into the ring until we get
a multiple (in current implementation 2x) of cache lines
away from the producer. At that point, write them all out.

We do the write out in the reverse order to keep
producer from sharing cache with consumer for as long
as possible.

Writeout also triggers when ring wraps around - there's
no special reason to do this but it helps keep the code
a bit simpler.

What should we do if getting away from producer by 2 cache lines
would mean we are keeping the ring moe than half empty?
Maybe we should reduce the batching in this case,
current patch simply reduces the batching.

Notes:
- it is no longer true that a call to consume guarantees
that the following call to produce will succeed.
No users seem to assume that.
- batching can also in theory reduce the signalling rate:
users that would previously send interrups to the producer
to wake it up after consuming each entry would now only
need to do this once in a batch.
Doing this would be easy by returning a flag to the caller.
No users seem to do signalling on consume yet so this was not
implemented yet.

Signed-off-by: Michael S. Tsirkin<mst@xxxxxxxxxx>
---

Jason, I am curious whether the following gives you some of
the performance boost that you see with vhost batching
patches. Is vhost batching on top still helpful?
The patch looks good to me, will have a test for vhost batching patches.

Thanks
Still helpful:

before this patch: 1.84Mpps
with this patch: 2.00Mpps
with batch dequeuing: 2.30Mpps

Acked-by: Jason Wang <jasowang@xxxxxxxxxx>

Thanks
Fascinating. How do we explain the gain with batch dequeue?

I count the drop rate (pktgen on tun and count tun tx) and maybe it can explain more or less:

Before this patch: TX xmit 1.8Mpps Tx dropped 0.23Mpps Tx total 2.04Mpps 11% dropped
After this patch: Tx xmit 1.95Mpps Tx dropped 0.33Mpps Tx total 2.28Mpps 14% dropped
With batched dequeuing: Tx xmit 2.24Mpps Tx dropped 0.01Mpps Tx total 2.26Mpps ~0% dropped

With this patch, we remove the cache contention by blocking the producer more or less. With batch dequeuing, the ring is not full in 99% of the cases which probably means the producer is not blocked for most of the time.

Is it just the lock overhead?

I remove the spinlocks for peeking and dequeuing on top of this patch. The tx pps were increased from ~2Mpps to ~2.08Mpps. So it was not only the lock overhead.

Can you pls try to replace
the lock with a simple non-fair atomic and see what happens?


Not sure I get the idea, we are going for fast path of spinlocks for sure which is just a cmpxchg().

Thanks