Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version

From: Jason Wang
Date: Mon Jul 20 2020 - 22:55:30 EST

Next message: Rob Clark: "Re: [PATCH] drm/msm/dp: Add DP compliance tests on Snapdragon Chipsets"
Previous message: Sergey Senozhatsky: "Re: [PATCH][next] printk: ringbuffer: support dataless records"
In reply to: Michael S. Tsirkin: "Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2020/7/20 äå7:16, Eugenio PÃrez wrote:

On Mon, Jul 20, 2020 at 11:27 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:

On Thu, Jul 16, 2020 at 07:16:27PM +0200, Eugenio Perez Martin wrote:

On Fri, Jul 10, 2020 at 7:58 AM Michael S. Tsirkin <mst@xxxxxxxxxx> wrote:

On Fri, Jul 10, 2020 at 07:39:26AM +0200, Eugenio Perez Martin wrote:

How about playing with the batch size? Make it a mod parameter instead
of the hard coded 64, and measure for all values 1 to 64 ...

Right, according to the test result, 64 seems to be too aggressive in
the case of TX.

Got it, thanks both!

In particular I wonder whether with batch size 1
we get same performance as without batching
(would indicate 64 is too aggressive)
or not (would indicate one of the code changes
affects performance in an unexpected way).

--
MST

Hi!

Varying batch_size as drivers/vhost/net.c:VHOST_NET_BATCH,

sorry this is not what I meant.

I mean something like this:

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 0b509be8d7b1..b94680e5721d 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1279,6 +1279,10 @@ static void handle_rx_net(struct vhost_work *work)
handle_rx(net);
}

+MODULE_PARM_DESC(batch_num, "Number of batched descriptors. (offset from 64)");
+module_param(batch_num, int, 0644);
+static int batch_num = 0;
+
static int vhost_net_open(struct inode *inode, struct file *f)
{
struct vhost_net *n;
@@ -1333,7 +1337,7 @@ static int vhost_net_open(struct inode *inode, struct file *f)
vhost_net_buf_init(&n->vqs[i].rxq);
}
vhost_dev_init(dev, vqs, VHOST_NET_VQ_MAX,
- UIO_MAXIOV + VHOST_NET_BATCH,
+ UIO_MAXIOV + VHOST_NET_BATCH + batch_num,
VHOST_NET_PKT_WEIGHT, VHOST_NET_WEIGHT, true,
NULL);

then you can try tweaking batching and playing with mod parameter without
recompiling.

VHOST_NET_BATCH affects lots of other things.

Ok, got it. Since they were aligned from the start, I thought it was a good idea to maintain them in-sync.

and testing
the pps as previous mail says. This means that we have either only
vhost_net batching (in base testing, like previously to apply this
patch) or both batching sizes the same.

I've checked that vhost process (and pktgen) goes 100% cpu also.

For tx: Batching decrements always the performance, in all cases. Not
sure why bufapi made things better the last time.

Batching makes improvements until 64 bufs, I see increments of pps but like 1%.

For rx: Batching always improves performance. It seems that if we
batch little, bufapi decreases performance, but beyond 64, bufapi is
much better. The bufapi version keeps improving until I set a batching
of 1024. So I guess it is super good to have a bunch of buffers to
receive.

Since with this test I cannot disable event_idx or things like that,
what would be the next step for testing?

Thanks!

--
Results:
# Buf size: 1,16,32,64,128,256,512

# Tx
# ===
# Base
2293304.308,3396057.769,3540860.615,3636056.077,3332950.846,3694276.154,3689820
# Batch
2286723.857,3307191.643,3400346.571,3452527.786,3460766.857,3431042.5,3440722.286
# Batch + Bufapi
2257970.769,3151268.385,3260150.538,3379383.846,3424028.846,3433384.308,3385635.231,3406554.538

# Rx
# ==
# pktgen results (pps)
1223275,1668868,1728794,1769261,1808574,1837252,1846436
1456924,1797901,1831234,1868746,1877508,1931598,1936402
1368923,1719716,1794373,1865170,1884803,1916021,1975160

# Testpmd pps results
1222698.143,1670604,1731040.6,1769218,1811206,1839308.75,1848478.75
1450140.5,1799985.75,1834089.75,1871290,1880005.5,1934147.25,1939034
1370621,1721858,1796287.75,1866618.5,1885466.5,1918670.75,1976173.5,1988760.75,1978316

pktgen was run again for rx with 1024 and 2048 buf size, giving
1988760.75 and 1978316 pps. Testpmd goes the same way.

Don't really understand what does this data mean.
Which number of descs is batched for each run?

Sorry, I should have explained better. I will expand here, but feel free to skip it since we are going to discard the
data anyway. Or to propose a better way to tell them.

Is a CSV with the values I've obtained, in pps, from pktgen and testpmd. This way is easy to plot them.

Maybe is easier as tables, if mail readers/gmail does not misalign them.

# Tx
# ===

Base: With the previous code, not integrating any patch. testpmd is txonly mode, tap interface is XDP_DROP everything.
We vary VHOST_NET_BATCH (1, 16, 32, ...). As Jason put in a previous mail:

TX: testpmd(txonly) -> virtio-user -> vhost_net -> XDP_DROP on TAP

1 | 16 | 32 | 64 | 128 | 256 | 512 |
2293304.308| 3396057.769| 3540860.615| 3636056.077| 3332950.846| 3694276.154| 3689820|

If we add the batching part of the series, but not the bufapi:

1 | 16 | 32 | 64 | 128 | 256 | 512 |
2286723.857 | 3307191.643| 3400346.571| 3452527.786| 3460766.857| 3431042.5 | 3440722.286|

And if we add the bufapi part, i.e., all the series:

1 | 16 | 32 | 64 | 128 | 256 | 512 | 1024
2257970.769| 3151268.385| 3260150.538| 3379383.846| 3424028.846| 3433384.308| 3385635.231| 3406554.538

For easier treatment, all in the same table:

1 | 16 | 32 | 64 | 128 | 256 | 512 | 1024
------------+-------------+-------------+-------------+-------------+-------------+------------+------------
2293304.308 | 3396057.769 | 3540860.615 | 3636056.077 | 3332950.846 | 3694276.154 | 3689820 |
2286723.857 | 3307191.643 | 3400346.571 | 3452527.786 | 3460766.857 | 3431042.5 | 3440722.286|
2257970.769 | 3151268.385 | 3260150.538 | 3379383.846 | 3424028.846 | 3433384.308 | 3385635.231| 3406554.538

# Rx
# ==

The rx tests are done with pktgen injecting packets in tap interface, and testpmd in rxonly forward mode. Again, each
column is a different value of VHOST_NET_BATCH, and each row is base, +batching, and +buf_api:

# pktgen results (pps)

(Didn't record extreme cases like >512 bufs batching)

1 | 16 | 32 | 64 | 128 | 256 | 512
-------+--------+--------+--------+--------+--------+--------
1223275| 1668868| 1728794| 1769261| 1808574| 1837252| 1846436
1456924| 1797901| 1831234| 1868746| 1877508| 1931598| 1936402
1368923| 1719716| 1794373| 1865170| 1884803| 1916021| 1975160

# Testpmd pps results

1 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 | 2048
------------+------------+------------+-----------+-----------+------------+------------+------------+---------
1222698.143 | 1670604 | 1731040.6 | 1769218 | 1811206 | 1839308.75 | 1848478.75 |
1450140.5 | 1799985.75 | 1834089.75 | 1871290 | 1880005.5 | 1934147.25 | 1939034 |
1370621 | 1721858 | 1796287.75 | 1866618.5 | 1885466.5 | 1918670.75 | 1976173.5 | 1988760.75 | 1978316

The last extreme cases (>512 bufs batched) were recorded just for the bufapi case.

Does that make sense now?

Thanks!

I wonder why we saw huge difference between TX and RX pps. Have you used samples/pktgen/XXX for doing the test? Maybe you can paste the perf record result for the pktgen thread + vhost thread.

Thanks

Next message: Rob Clark: "Re: [PATCH] drm/msm/dp: Add DP compliance tests on Snapdragon Chipsets"
Previous message: Sergey Senozhatsky: "Re: [PATCH][next] printk: ringbuffer: support dataless records"
In reply to: Michael S. Tsirkin: "Re: [PATCH RFC v8 02/11] vhost: use batched get_vq_desc version"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]