On Fri, Feb 13, 2009 at 11:08:25PM +0300, Vladislav Bolkhovitin wrote:Wu Fengguang, on 02/13/2009 04:57 AM wrote:On Thu, Feb 12, 2009 at 09:35:18PM +0300, Vladislav Bolkhovitin wrote:Yes, they directly connected using GbE.Sorry for such a huge delay. There were many other activities I had to do before + I had to be sure I didn't miss anything.Vladislav, Thank you for the benchmarks! I'm very interested in
We didn't use NFS, we used SCST (http://scst.sourceforge.net) with iSCSI-SCST target driver. It has similar to NFS architecture, where N threads (N=5 in this case) handle IO from remote initiators (clients) coming from wire using iSCSI protocol. In addition, SCST has patch called export_alloc_io_context (see http://lkml.org/lkml/2008/12/10/282), which allows for the IO threads queue IO using single IO context, so we can see if context RA can replace grouping IO threads in single IO context.
Unfortunately, the results are negative. We find neither any advantages of context RA over current RA implementation, nor possibility for context RA to replace grouping IO threads in single IO context.
Setup on the target (server) was the following. 2 SATA drives grouped in md RAID-0 with average local read throughput ~120MB/s ("dd if=/dev/zero of=/dev/md0 bs=1M count=20000" outputs "20971520000 bytes (21 GB) copied, 177,742 s, 118 MB/s"). The md device was partitioned on 3 partitions. The first partition was 10% of space in the beginning of the device, the last partition was 10% of space in the end of the device, the middle one was the rest in the middle of the space them. Then the first and the last partitions were exported to the initiator (client). They were /dev/sdb and /dev/sdc on it correspondingly.
optimizing your workload and figuring out what happens underneath.
Are the client and server two standalone boxes connected by GBE?
When you set readahead sizes in the benchmarks, you are setting themYes, it's the server. On the client all the parameters were left default.
in the server side? I.e. "linux-4dtq" is the SCST server?
What's theDefault, i.e. 128K
client side readahead size?
It would help a lot to debug readahead if you can provide theWe will do it as soon as we have a free window on that system.
server side readahead stats and trace log for the worst case.
This will automatically answer the above questions as well as disclose
the micro-behavior of readahead:
mount -t debugfs none /sys/kernel/debug
echo > /sys/kernel/debug/readahead/stats # reset counters
# do benchmark
cat /sys/kernel/debug/readahead/stats
echo 1 > /sys/kernel/debug/readahead/trace_enable
# do micro-benchmark, i.e. run the same benchmark for a short time
echo 0 > /sys/kernel/debug/readahead/trace_enable
dmesg
The above readahead trace should help find out how the client side
sequential reads convert into server side random reads, and how we can
prevent that.
Thank you. For NFS, the client side read/readahead requests will be
split into units of rsize which will be served by a pool of nfsd
concurrently and possibly out of order. Does SCST have the same
process? If so, what's the rsize value for your SCST benchmarks?