Re: [PATCH 3/3] xen/block: add multi-page ring support
From: Marcus Granado
Date: Tue Jun 23 2015 - 08:51:24 EST
On 22/06/15 02:20, Bob Liu wrote:
On 06/09/2015 10:07 PM, Roger Pau Monné wrote:
El 09/06/15 a les 15.39, Konrad Rzeszutek Wilk ha escrit:
...
Roger, I put them (patches) on devel/for-jens-4.2 on
git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen.git
I think these two patches:
drivers: xen-blkback: delay pending_req allocation to connect_ring
xen/block: add multi-page ring support
are the only ones that haven't been Acked by you (or maybe they
have and I missed the Ack?)
Hello,
I was waiting to Ack those because the XenServer storage performance
folks found out that these patches cause a performance regression on
some of their tests. I'm adding them to the conversation so they can
provide more details about the issues they found, and whether we should
hold pushing this patches or not.
Hey,
Are there any updates? What's the performance regression problem?
Hi,
We were using the 2 last weeks to finish measurements on the multipage
ring v5 patches in a range of diverse conditions.
The measurements were obtained under the following conditions:
- using blkback as the dom0 backend with a back-ported multipage ring v5
applied to our dom0 kernel 3.10.
- using a recent Ubuntu 15.04 kernel 3.19 with v5 frontend applied to be
used as guest
- using a micron RealSSD P320h as the underlying local storage on a Dell
PowerEdge R720 with 2 Xeon E5-2643 v2 cpus.
- fio 2.2.7-22-g36870 as the generator of synthetic loads in the guest.
We used direct_io to skip caching in the guest and ran fio for 60s for a
number of block sizes ranging from 512 bytes to 4MiB. We also tried pure
random and pure sequential reads. Random reads were used to counter-act
read-ahead prefetching at the underlying storage.
We noticed that using large (>16) queue depths in fio would saturate
individual vcpus in the guest, so to better utilise the cpu resources in
the guest, we chose to (a) fix the queue depth to 4 for each fio thread,
(b) increase the guest vcpus to 16 and (c) vary the number of fio
threads from 1 to 64.
We were interested in observing storage iops and throughput for
different values of in-flight requests (= io depth * fio threads)
generated by the guest. Our expectation was that iops and throughput
with single-page and multi-page rings would be the same up to 32
in-flight requests (the number of requests that fit in a single-page
ring), and then the single-page ring case would flat-line with >32
in-flight requests, whereas the multi-page ring case would continue to
show improvements until hitting some other bottleneck. The effect should
be more visible when using requests with smaller block sizes because the
measurements are less likely to be affected by memory copy delays or
large data transfer delays to storage.
These are the results we got for the conditions above with 4KiB blocks
and random reads:
fio_threads io_depth in_flight 1-page_IOPS 8-page_IOPS
1 4 4 19K 19K
4 4 16 89K 89K
8 4 32 149K 149K
16 4 64 131K 198K
32 4 128 127K 208K
64 4 256 132K 209K
We believe that this data shows that there's a clear improvement when
using multi-page rings when there are more than 32 in-flight requests.
We observed similar improvements when writing, and across all small
block sizes. For block sizes >=16KiB, the results were similar between
single- and multi-page rings, and we attribute that to bottlenecks when
transferring large amounts of data that is not present with smaller
block sizes.
Another reason for using random reads in the synthetic fio tests above
is that we noticed that when sequential reads are used there were some
anomalies that we believe would affect a fair comparison:
(A)- in some situations with sequential read, we observed a decreasing
number of merges in the guest (according to 'iostat -x -m 1') with small
block sizes <=4KiB when increasing the number of ring pages. There were
no merges whenever in_flight < ring_pages * 32. With larger in_flight
requests (>=128) -- visible with both 8 fio_threads x 32 io_depth and 32
fio_threads x 8 io_depth -- storage throughput with 1 page was around
25% better than with 8 pages. This is the regression that Roger was
talking about previously in this discussion. It seems related to the
merges of requests occurring much more frequently with 1 page than with
8 pages. During the measurements, the average request queue size in
iostat has always a similar value as the number of requests in the ring.
I would appreciate potential explanations of why the guest kernel would
behave like that. We believe that this regression is a corner-case that
would be difficult to spot in a real-world load, where random reads are
interspersed with sequential reads of many different block sizes and io
depths, and we only spotted it because of our synthetic load with fio
used a wide range of parameters with sequential reads. It may also be
specific to the way that Linux handles this situation.
(B)- in other situations with sequential read (block sizes between 8KiB
and 128KiB), we observed the storage throughput with 1 page was around
50% worse than with 8 pages. Again, this seems related to the existence
of merges with 1 page but not with 8 pages, and I would appreciate
potential explanations.
For sequential reads, arguably the performance difference spotted in (A)
is counter balanced by the performance difference in (B), and they
cancel each other out if all block sizes are considered together. For
random reads, 8-page rings were similar or superior to 1-page rings in
all tested conditions.
All considered, we believe that the multi-page ring patches improve the
storage performance (apart from case (A)) and therefore should be good
to merge.
Marcus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/