Re: AW: Slow I/O on USB media after commit f664a3cc17b7d0a2bc3b3ab96181e1029b0ec0e6

From: Ming Lei
Date: Thu Dec 26 2019 - 03:37:38 EST


On Wed, Dec 25, 2019 at 10:30:57PM -0500, Theodore Y. Ts'o wrote:
> On Thu, Dec 26, 2019 at 10:27:02AM +0800, Ming Lei wrote:
> > Maybe we need to be careful for HDD., since the request count in scheduler
> > queue is double of in-flight request count, and in theory NCQ should only
> > cover all in-flight 32 requests. I will find a sata HDD., and see if
> > performance drop can be observed in the similar 'cp' test.
>
> Please try to measure it, but I'd be really surprised if it's
> significant with with modern HDD's.

Just find one machine with AHCI SATA, and run the following xfs
overwrite test:

#!/bin/bash
DIR=$1
echo 3 > /proc/sys/vm/drop_caches
fio --readwrite=write --filesize=5g --overwrite=1 --filename=$DIR/fiofile \
--runtime=60s --time_based --ioengine=psync --direct=0 --bs=4k
--iodepth=128 --numjobs=2 --group_reporting=1 --name=overwrite

FS is xfs, and disk is LVM over AHCI SATA with NCQ(depth 32), because the
machine is picked up from RH beaker, and it is the only disk in the box.

#lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 931.5G 0 disk
ââsda1 8:1 0 1G 0 part /boot
ââsda2 8:2 0 930.5G 0 part
âârhel_hpe--ml10gen9--01-root 253:0 0 50G 0 lvm /
âârhel_hpe--ml10gen9--01-swap 253:1 0 3.9G 0 lvm [SWAP]
âârhel_hpe--ml10gen9--01-home 253:2 0 876.6G 0 lvm /home


kernel: 3a7ea2c483a53fc("scsi: provide mq_ops->busy() hook") which is
the previous commit of f664a3cc17b7 ("scsi: kill off the legacy IO path").

|scsi_mod.use_blk_mq=N |scsi_mod.use_blk_mq=Y |
-----------------------------------------------------------
throughput: |244MB/s |169MB/s |
-----------------------------------------------------------

Similar result can be observed on v5.4 kernel(184MB/s) with same test
steps.


> That because they typically have
> a queue depth of 16, and a max_sectors_kb of 32767 (e.g., just under
> 32 MiB). Sort seeks are typically 1-2 ms, with full stroke seeks
> 8-10ms. Typical sequential write speeds on a 7200 RPM drive is
> 125-150 MiB/s. So suppose every other request sent to the HDD is from
> the other request stream. The disk will chose the 8 requests from its
> queue that are contiguous, and so it will be writing around 256 MiB,
> which will take 2-3 seconds. If it then needs to spend between 1 and
> 10 ms seeking to another location of the disk, before it writes the
> next 256 MiB, the worst case overhead of that seek is 10ms / 2s, or
> 0.5%. That may very well be within your measurements' error bars.

Looks you assume that disk seeking just happens once when writing around
256MB. This assumption may not be true, given all data can be in page
cache before writing. So when two tasks are submitting IOs concurrently,
IOs from each single task is sequential, and NCQ may order the current batch
submitted from the two streams. However disk seeking may still be needed
for the next batch handled by NCQ.

> And of course, note that in real life, we are very *often* writing to
> multiple files in parallel, for example, during a "make -j16" while
> building the kernel. Writing a single large file is certainly
> something people do (but even there people who are burning a 4G DVD
> rip are often browsing the web while they are waiting for it to
> complete, and the browser will be writing cache files, etc.). So
> whether or not this is something where we should be stressing over
> this specific workload is going to be quite debateable.

Thanks,
Ming