Re: Slow file transfer speeds with CFQ IO scheduler in some cases
From: Vladislav Bolkhovitin
Date: Tue Apr 21 2009 - 14:19:57 EST
Wu Fengguang, on 03/23/2009 04:42 AM wrote:
Here are the conclusions from tests:
1. Making all IO threads work in the same IO context with CFQ (vanilla
RA and default RA size) brings near 100% link utilization on single
stream reads (100MB/s) and with deadline about 50% (50MB/s). I.e. there
is 100% improvement of CFQ over deadline. With 2 read streams CFQ has
ever more advantage: >400% (23MB/s vs 5MB/s).
The ideal 2-stream throughput should be >60MB/s, so I guess there are
still room of improvements for the CFQ's 23MB/s?
Yes, plenty. But, I think, not in CFQ, but in readahead. With RA 4096K
we were able to get ~40MB/s, see the previous e-mail and below.
The one fact I cannot understand is that SCST seems to breaking up the
client side 64K reads into server side 4K reads(above readahead layer).
But I remember you told me that SCST don't do NFS rsize style split-ups.
Is this a bug? The 4K read size is too small to be CPU/network friendly...
Where are the split-up and re-assemble done? On the client side or
internal to the server?
This is on the client's side. See the target's log in the attachment.
Here is the summary of commands data sizes, came to the server, for "dd
if=/dev/sdb of=/dev/null bs=64K count=200" ran on the client:
4K 11
8K 0
16K 0
32K 0
64K 0
128K 81
256K 8
512K 0
1024K 0
2048K 0
4096K 0
There's a way too many 4K requests. Apparently, the requests submission
path isn't optimal.
Actually, this is another question I wanted to rise from the very
beginning.
6. Unexpected result. In case, when ll IO threads work in the same IO
context with CFQ increasing RA size *decreases* throughput. I think this
is, because RA requests performed as single big READ requests, while
requests coming from remote clients are much smaller in size (up to
256K), so, when the read by RA data transferred to the remote client on
100MB/s speed, the backstorage media gets rotated a bit, so the next
read request must wait the rotation latency (~0.1ms on 7200RPM). This is
well conforms with (3) above, when context RA has 40% advantage over
vanilla RA with default RA, but much smaller with higher RA.
Maybe. But the readahead IOs (as showed by the trace) are _async_ ones...
That doesn't matter, because new request from the client won't come
until all data for the previous one transferred to it. And that transfer
is done on a very *finite* speed.
Bottom line IMHO conclusions:
1. Context RA should be considered after additional examination to
replace current RA algorithm in the kernel
That's my plan to push context RA to mainline. And thank you very much
for providing and testing out a real world application for it!
You're welcome!
2. It would be better to increase default RA size to 1024K
That's a long wish to increase the default RA size. However I have a
vague feeling that it would be better to first make the lower layers
more smart on max_sectors_kb granularity request splitting and batching.
Can you elaborate more on that, please?
*AND* one of the following:
3.1. All RA requests should be split in smaller requests with size up to
256K, which should not be merged with any other request
Are you referring to max_sectors_kb?
Yes
What's your max_sectors_kb and nr_requests? Something like
grep -r . /sys/block/sda/queue/
Default: 512 and 128 correspondingly.
OR
3.2. New RA requests should be sent before the previous one completed to
don't let the storage device rotate too far to need full rotation to
serve the next request.
Linus has a mmap readahead cleanup patch that can do this. It
basically replaces a {find_lock_page(); readahead();} sequence into
{find_get_page(); readahead(); lock_page();}.
I'll try to push that patch into mainline.
Good!
I like suggestion 3.1 a lot more, since it should be simple to implement
and has the following 2 positive side effects:
1. It would allow to minimize negative effect of higher RA size on the
I/O delay latency by allowing CFQ to switch to too long waiting
requests, when necessary.
2. It would allow better requests pipelining, which is very important to
minimize uplink latency for synchronous requests (i.e. with only one IO
request at time, next request issued, when the previous one completed).
You can see in http://www.3ware.com/kb/article.aspx?id=11050 that 3ware
recommends for maximum performance set max_sectors_kb as low as *64K*
with 16MB RA. It allows to maximize serving commands pipelining. And
this suggestion really works allowing to improve throughput in 50-100%!
Seems I should elaborate more on this. Case, when client is remote has a
fundamental difference from the case, when client is local, for which
Linux currently optimized. When client is local data delivered to it
from the page cache with a virtually infinite speed. But when client is
remote data delivered to it from the server's cache on a *finite* speed.
In our case this speed is about the same as speed of reading data to the
cache from the storage. It has the following consequences:
1. Data for any READ request at first transferred from the storage to
the cache, then from the cache to the client. If those transfers are
done purely sequentially without overlapping, i.e. without any
readahead, resulting throughput T can be found from equation: 1/T =
1/Tlocal + 1/Tremote, where Tlocal and Tremote are throughputs of the
local (i.e. from the storage) and remote links. In case, when Tlocal ~=
Tremote, T ~= Tremote/2. Quite unexpected result, right? ;)
2. If data transfers on the local and remote links aren't coordinated,
it is possible that only one link transfers data at some time. From the
(1) above you can calculate that % of this "idle" time is % of the lost
throughput. I.e. to get the maximum throughput both links should
transfer data as simultaneous as possible. For our case, when Tlocal ~=
Tremote, both links should be all the time busy. Moreover, it is
possible that the local transfer finished, but during the remote
transfer the storage media rotated too far, so for the next request it
will be needed to wait the full rotation to finish (i.e. several ms of
lost bandwidth).
Thus, to get the maximum possible throughput, we need to maximize
simultaneous load of both local and remote links. It can be done by
using well known pipelining technique. For that client should read the
same amount of data at once, but those read should be split on smaller
chunks, like 64K at time. This approach looks being against the
"conventional wisdom", saying that bigger request means bigger
throughput, but, in fact, it doesn't, because the same (big) amount of
data are read at time. Bigger count of smaller requests will make more
simultaneous load on both participating in the data transfers links. In
fact, even if client is local, in most cases there is a second data
transfer link. It's in the storage. This is especially true for RAID
controllers. Guess, why 3ware recommends to set max_sectors_kb to 64K
and increase RA in the above link? ;)
Of course, max_sectors_kb should be decreased only for smart devices,
which allow >1 outstanding requests at time, i.e. for all modern
SCSI/SAS/SATA/iSCSI/FC/etc. drives.
There is an objection against having too many outstanding requests at
time. This is latency. But, since overall size of all requests remains
unchanged, this objection isn't relevant in this proposal. There is the
same latency-related objection against increasing RA. But many small
individual RA requests it isn't relevant as well.
We did some measurements to support the this proposal. They were done
only with deadline scheduler to make the picture clearer. They were done
with context RA. The tests were the same as before.
--- Baseline, all default:
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 51,1 MB/s
b) 51,4 MB/s
c) 51,1 MB/s
Run at the same time:
# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 4,7 MB/s
b) 4,6 MB/s
c) 4,8 MB/s
--- Client - all default, on the server max_sectors_kb set to 64K:
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
- 100 MB/s
- 100 MB/s
- 102 MB/s
# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
- 5,2 MB/s
- 5,3 MB/s
- 4,2 MB/s
100% and 8% improvement comparing to the baseline.
From the previous e-mail you can see that with 4096K RA
# while true; do dd if=/dev/sdc of=/dev/null bs=64K; done
# dd if=/dev/sdb of=/dev/null bs=64K count=80000
a) 39,9 MB/s
b) 39,5 MB/s
c) 38,4 MB/s
I.e. there is 760% improvement over the baseline.
Thus, I believe, that for all devices, supporting queue depths >1,
max_sectors_kb should be set by default to 64K (or to 128K, maybe, but
not more), and default RA increased to at least 1M, better 2-4M.
(Can I wish a CONFIG_PRINTK_TIME=y next time? :-)
Sure
Thanks,
Vlad
Attachment:
req_split.log.bz2
Description: application/bzip