Re: [PATCH] io_uring: reduce latency by reissueing the operation

From: Olivier Langlois
Date: Thu Jun 10 2021 - 23:55:42 EST


On Thu, 2021-06-10 at 20:32 +0100, Pavel Begunkov wrote:
> On 6/10/21 6:56 PM, Olivier Langlois wrote:
> >
> >
> > Can you think of other numbers that would be useful to know to
> > evaluate
> > the patch performance?
>
> If throughput + latency (avg + several nines) are better (or any
> other measurable improvement), it's a good enough argument to me,
> but not sure what test case you're looking at. Single threaded?
> Does it saturate your CPU?
>
I don't know what are the ideal answers to your 2 last questions ;-)

I have several possible configurations to my application.

The most complex one is a 2 threads setup each having their own
io_uring instance where one of the threads is managing 50-85 TCP
connections over which JSON stream encapsulated in the WebSocket
protocol are received.

That more complex setup is also using IORING_SETUP_ATTACH_WQ to share
the sqpoll thread between the 2 instances.

In that more complex config, the sqpoll thread is running at 85-95% of
its dedicated CPU.

For the patch performance testing I did use the simplest config:
Single thread, 1 TCP connection, no sqpoll.

To process an incoming 5Mbps stream, it takes about 5% of the CPU.

Here is the testing methodology:
add 2 fields in struct io_rw:
u64 startTs;
u8 readType;

startTs is set with ktime_get_raw_fast_ns() in io_read_prep()

readType is set to:
0: data ready
1: use fast poll (we ignore those)
2: reissue
3: async

readType is used to know what type of measurement it is at the
recording point.

end time is measured at 3 recording point:
1. In __io_queue_sqe() when io_issue_sqe() returns 0
2. In __io_queue_sqe() after io_queue_async_work() call
3. In io_wq_submit_work() after the while loop.

So I took 4 measurements.

1. The time it takes to process a read request when data is already
available
2. The time it takes to process by calling twice io_issue_sqe() after
vfs_poll() indicated that data was available
3. The time it takes to execute io_queue_async_work()
4. The time it takes to complete a read request asynchronously

Before presenting the results, I want to mention that 2.25% of the
total number of my read requests ends up in the situation where the
read() syscall did return EAGAIN but data became available by the time
vfs_poll gets called.

My expectations were that reissuing a sqe could be on par or a bit more
expensive than placing it on io-wq for async processing and that would
put the patch in some gray zone with pros and cons in terms of
performance.

The reality is instead super nice (numbers in nSec):

ready data (baseline)
avg 3657.94182918628
min 580
max 20098
stddev 1213.15975908162

reissue completion
average 7882.67567567568
min 2316
max 28811
stddev 1982.79172973284

insert io-wq time
average 8983.82276995305
min 3324
max 87816
stddev 2551.60056552038

async time completion
average 24670.4758861127
min 10758
max 102612
stddev 3483.92416873804

Conclusion:
On average reissuing the sqe with the patch code is 1.1uSec faster and
in the worse case scenario 59uSec faster than placing the request on
io-wq

On average completion time by reissuing the sqe with the patch code is
16.79uSec faster and in the worse case scenario 73.8uSec faster than
async completion.

One important detail to mention about the async completion time, in the
testing the ONLY way that a request can end up being completed async is
if vfs_poll() reports that the file is ready. Otherwise, the request
ends up being processed with io_uring fast poll feature.

So there does not seem to have any downside to the patch. TBH, at the
initial patch submission, I only did use my intuition to evaluate the
merit of my patch but, thx to your healthy skepticism, Pavel, this did
force me to actually measure the patch and it appears to incontestably
improve performance for both the reissued SQE and also all the other
sqes found in a batch submission.

Hopefully, the results will please you as much as they have for me!

Greetings,
Olivier