Re: [patch] Revert "block: remove artifical max_hw_sectors cap"

From: Jeff Moyer
Date: Wed Jul 29 2015 - 12:52:48 EST


Christoph Hellwig <hch@xxxxxxxxxxxxx> writes:

> On Mon, Jul 20, 2015 at 03:17:07PM -0400, Jeff Moyer wrote:
>> For SAN storage, we've seen initial write and re-write performance drop
>> 25-50% across all I/O sizes. On locally attached storage, we've seen
>> regressions of 40% for all I/O types, but only for I/O sizes larger than
>> 1MB.
>
> Workload, and hardare please. An only mainline numbers, not some old
> hacked vendor kernel, please.

I've attached a simple fio config that reproduces the problem. It just
does sequential, O_DIRECT write I/O with I/O sizes of 1M, 2M and 4M. So
far I've tested it on an HP HSV400 and an IBM XIV SAN array connected
via a qlogic adapter, a nearline sata driveand a WD Red (NAS) sata disk
connected via an intel ich9r sata controller. The kernel I tested was
4.2.0-rc3, and the testing was done across 3 different hosts (just
because I don't have all the hardware connected to a single box). I did
10 runs using max_sectors_kb set to 1024, and 10 runs with it set to
32767. Results compare the averages of those 10 runs. In no cases did
I see a performance gain. In two cases, there is a performance hit.

In addition to my testing, our performance teams have seen performance
regressions running iozone on fibre channel-attached HP MSA1000 storage,
as well as on an SSD hidden behind a megaraid controller. I was not
able to get the exact details on the SSD. iozone configurations can be
provided, but I think I've nailed the underlying problem with this test
case.

But, don't take my word for it. Run the fio script on your own
hardware. All you have to do is echo a couple of values into
/sys/block/sdX/queue/max_sectors_kb to test, no kernel rebuilding
required.

In the tables below, concentrate on the BW/IOPS numbers under the WRITE
column. Negative numbers indicate that max_sectors_kb of 32767 shows a
performance regression of the indicated percentage when compared with a
setting of 1024.

Christoph, did you have some hardware where a higher max_sectors_kb
improved performance?

Cheers,
Jeff

Vendor identification: HP
Product identification: HSV400
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
1M 0 0 0 0 0 0 0.00 0.00 0.00
2M 0 0 0 -14 -14 16 -15.75 -15.73 0.00
4M 0 0 0 -17 -17 20 -21.20 -16.23 0.00

Vendor identification: IBM
Product identification: 2810XIV
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
1M 0 0 0 0 0 0 0.00 23.12 0.00
2M 0 0 0 0 0 0 -10.18 0.00 0.00
4M 0 0 0 0 0 0 -6.08 0.00 0.00

Vendor identification: ATA
Product identification: WDC WD5001FXYZ-0
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
1M 0 0 0 0 0 0 0.00 0.00 0.00
2M 0 0 0 0 0 0 0.00 0.00 0.00
4M 0 0 0 -30 -30 51 -57.32 -38.11 0.00

Vendor identification: ATA
Product identification: WDC WD10EFRX-68P
%diff
READ WRITE CPU
Job Name BW IOPS msec BW IOPS msec usr sys csw
1M 0 0 0 0 0 0 -23.73 6.82 0.00
2M 0 0 0 0 0 0 27.16 -7.91 0.00
4M 0 0 0 0 0 0 0.00 0.00 0.00



[global]
ioengine=sync
direct=1
filename=/dev/DEVNAME
size=1G
rw=write

[1M]
stonewall
blocksize=1M

[2M]
stonewall
blocksize=2M

[4M]
stonewall
blocksize=4M