Re: Small writes being split with fdatasync based on non-aligned partition ending

From: Jens Rosenboom
Date: Thu Feb 11 2016 - 04:54:53 EST


2016-02-11 4:48 GMT+01:00 Sitsofe Wheeler <sitsofe@xxxxxxxxx>:
> Trying to cc the GNU parted and linux-block mailing lists.
>
> On 9 February 2016 at 13:02, Jens Rosenboom <j.rosenboom@xxxxxxxx> wrote:
>> While trying to reproduce some performance issues I have been seeing
>> with Ceph, I have come across a strange behaviour which is seemingly
>> affected only by the end point (and thereby the size) of a partition
>> being an odd number of sectors. Since all documentation about
>> alignment only refers to the starting point of the partition, this was
>> pretty surprising and I would like to know whether this is expected
>> behaviour or maybe a kernel issue.
>>
>> The command I am using is pretty simple:
>>
>> fio --rw=randwrite --size=1G --fdatasync=1 --bs=4k
>> --filename=/dev/sdb2 --runtime=10 --name=test
>>
>> The difference shows itself when the partition is created either by
>> sgdisk or by parted:
>>
>> sgdisk --new=2:6000M: /dev/sdb
>>
>> parted -s /dev/sdb mkpart osd-device-1-block 6291456000B 100%
>>
>> The difference in the partition table looks like this:
>>
>> < 2 6291456000B 1600320962559B 1594029506560B
>> osd-device-1-block
>> ---
>>> 2 6291456000B 1600321297919B 1594029841920B osd-device-1-block
>
> Looks like parted took you at your word when you asked for your
> partition at 100%. Just out of curiosity if you try and make the same
> partition interactively with parted do you get any warnings after
> making and after running align-check ?

No warnings and everything fine for align-check. I found out that I
can get the same effect if I step the partition ending manually in
parted in 1s increments. The sequence of write sizes is 8, 1, 2, 1, 4,
1, 2, 1, 8, ... which corresponds to the size (unit s) of the
resulting partion mod 8.

>> So this is really only the end of the partition that is different.
>> However, in the first case, the 4k writes all get broken up into 512b
>> writes somewhere in the kernel, as can be seen with btrace:
>>
>> 8,16 3 36 0.000102666 8184 A WS 12353985 + 1 <- (8,18) 65985
>> 8,16 3 37 0.000102739 8184 Q WS 12353985 + 1 [fio]
>> 8,16 3 38 0.000102875 8184 M WS 12353985 + 1 [fio]
>> 8,16 3 39 0.000103038 8184 A WS 12353986 + 1 <- (8,18) 65986
>> 8,16 3 40 0.000103109 8184 Q WS 12353986 + 1 [fio]
>> 8,16 3 41 0.000103196 8184 M WS 12353986 + 1 [fio]
>> 8,16 3 42 0.000103335 8184 A WS 12353987 + 1 <- (8,18) 65987
>> 8,16 3 43 0.000103403 8184 Q WS 12353987 + 1 [fio]
>> 8,16 3 44 0.000103489 8184 M WS 12353987 + 1 [fio]
>> 8,16 3 45 0.000103609 8184 A WS 12353988 + 1 <- (8,18) 65988
>> 8,16 3 46 0.000103678 8184 Q WS 12353988 + 1 [fio]
>> 8,16 3 47 0.000103767 8184 M WS 12353988 + 1 [fio]
>> 8,16 3 48 0.000103879 8184 A WS 12353989 + 1 <- (8,18) 65989
>> 8,16 3 49 0.000103947 8184 Q WS 12353989 + 1 [fio]
>> 8,16 3 50 0.000104035 8184 M WS 12353989 + 1 [fio]
>> 8,16 3 51 0.000104150 8184 A WS 12353990 + 1 <- (8,18) 65990
>> 8,16 3 52 0.000104219 8184 Q WS 12353990 + 1 [fio]
>> 8,16 3 53 0.000104307 8184 M WS 12353990 + 1 [fio]
>> 8,16 3 54 0.000104452 8184 A WS 12353991 + 1 <- (8,18) 65991
>> 8,16 3 55 0.000104520 8184 Q WS 12353991 + 1 [fio]
>> 8,16 3 56 0.000104609 8184 M WS 12353991 + 1 [fio]
>> 8,16 3 57 0.000104885 8184 I WS 12353984 + 8 [fio]
>>
>> whereas in the second case, I'm getting the expected 4k writes:
>>
>> 8,16 6 42 1266874889.659842036 8409 A WS 12340232 + 8 <-
>> (8,18) 52232
>> 8,16 6 43 1266874889.659842167 8409 Q WS 12340232 + 8 [fio]
>> 8,16 6 44 1266874889.659842393 8409 G WS 12340232 + 8 [fio]
>
> This is weird because --size=1G should mean that fio is "seeing" an
> aligned end. Does direct=1 with a sequential job of iodepth=1 show the
> problem too?

IIUC fio uses the size only to find out where to write to, it opens
the block device and passes the resulting fd to the fdatasync call, so
the kernel will not know about what size fio thinks the device has. In
fact, the effect is the same without the size=1G option, I used it
just to make sure that the writes do not go anywhere near the badly
aligned partition ending.

direct=1 kills the effect, i.e. all writes will be 4k size again.
Astonishingly though, sequential writes also are affected, i.e.
changing to rw=write in my sample above behaves the same as randwrite.

>> The above examples are from running with an SSD, where the small
>> writes get merged together again before hitting the block device,
>> which is still pretty o.k. performance wise. But when I run the same
>> test on some NVMe device, the writes do not get merged, instead the
>> performance drops to less then 10% of what I get in the second case.
>
> Perhaps the ioscheduler doesn't have the opportunity with the NVMe device...

Yes, there is no scheduler available in this case:

$ cat /sys/block/nvme0n1/queue/scheduler
none

This is just to show that the argument "Don't bother, the writes get
merged back together anyway" doesn't hold true in all cases.

>> If this is indeed expected behaviour from the kernel pov, it might
>> need some better documentation and probably sgdisk should also be
>> enhanced to align the end of the partition as well. FWIW, this happens
>> on a stock 4.4.0 kernel as well as recent Ubuntu and CentOS kernels.
>
> Do you mean parted?

No, as I am currently assuming that the issue is caused by some effect
happening inside the kernel during the fdatasync call, there was the
idea that only certain kernels might be affected. But I don't have a
clue yet how for back I would have to go in order to find a kernel
that behaves differently.