Re: Small writes being split with fdatasync based on non-aligned partition ending

From: Sitsofe Wheeler
Date: Fri Feb 12 2016 - 01:59:22 EST


CC'ing Jens Axboe.

On 11 February 2016 at 09:54, Jens Rosenboom <j.rosenboom@xxxxxxxx> wrote:
> 2016-02-11 4:48 GMT+01:00 Sitsofe Wheeler <sitsofe@xxxxxxxxx>:
>> Trying to cc the GNU parted and linux-block mailing lists.
>>
>> On 9 February 2016 at 13:02, Jens Rosenboom <j.rosenboom@xxxxxxxx> wrote:
>>> While trying to reproduce some performance issues I have been seeing
>>> with Ceph, I have come across a strange behaviour which is seemingly
>>> affected only by the end point (and thereby the size) of a partition
>>> being an odd number of sectors. Since all documentation about
>>> alignment only refers to the starting point of the partition, this was
>>> pretty surprising and I would like to know whether this is expected
>>> behaviour or maybe a kernel issue.
>>>
>>> The command I am using is pretty simple:
>>>
>>> fio --rw=randwrite --size=1G --fdatasync=1 --bs=4k
>>> --filename=/dev/sdb2 --runtime=10 --name=test
>>>
>>> The difference shows itself when the partition is created either by
>>> sgdisk or by parted:
>>>
>>> sgdisk --new=2:6000M: /dev/sdb
>>>
>>> parted -s /dev/sdb mkpart osd-device-1-block 6291456000B 100%
>>>
>>> The difference in the partition table looks like this:
>>>
>>> < 2 6291456000B 1600320962559B 1594029506560B
>>> osd-device-1-block
>>> ---
>>>> 2 6291456000B 1600321297919B 1594029841920B osd-device-1-block
>>
>> Looks like parted took you at your word when you asked for your
>> partition at 100%. Just out of curiosity if you try and make the same
>> partition interactively with parted do you get any warnings after
>> making and after running align-check ?
>
> No warnings and everything fine for align-check. I found out that I
> can get the same effect if I step the partition ending manually in
> parted in 1s increments. The sequence of write sizes is 8, 1, 2, 1, 4,
> 1, 2, 1, 8, ... which corresponds to the size (unit s) of the
> resulting partion mod 8.

OK. Could you add the output of
grep . /sys/block/nvme0n1/queue/*size
sgdisk -D /dev/sdb
and could you post the information about the whole partition table.
Does sgdisk create a similar problem ending if you use
sgdisk --new=2:0 /dev/sdb
? It seems strange that the end of the disk (and thus a 100% sized
partition) wouldn't end on a multiple of 4k...

>>> So this is really only the end of the partition that is different.
>>> However, in the first case, the 4k writes all get broken up into 512b
>>> writes somewhere in the kernel, as can be seen with btrace:
>>>
>>> 8,16 3 36 0.000102666 8184 A WS 12353985 + 1 <- (8,18) 65985
>>> 8,16 3 37 0.000102739 8184 Q WS 12353985 + 1 [fio]
>>> 8,16 3 38 0.000102875 8184 M WS 12353985 + 1 [fio]
>>> 8,16 3 39 0.000103038 8184 A WS 12353986 + 1 <- (8,18) 65986
>>> 8,16 3 40 0.000103109 8184 Q WS 12353986 + 1 [fio]
>>> 8,16 3 41 0.000103196 8184 M WS 12353986 + 1 [fio]
>>> 8,16 3 42 0.000103335 8184 A WS 12353987 + 1 <- (8,18) 65987
>>> 8,16 3 43 0.000103403 8184 Q WS 12353987 + 1 [fio]
>>> 8,16 3 44 0.000103489 8184 M WS 12353987 + 1 [fio]
>>> 8,16 3 45 0.000103609 8184 A WS 12353988 + 1 <- (8,18) 65988
>>> 8,16 3 46 0.000103678 8184 Q WS 12353988 + 1 [fio]
>>> 8,16 3 47 0.000103767 8184 M WS 12353988 + 1 [fio]
>>> 8,16 3 48 0.000103879 8184 A WS 12353989 + 1 <- (8,18) 65989
>>> 8,16 3 49 0.000103947 8184 Q WS 12353989 + 1 [fio]
>>> 8,16 3 50 0.000104035 8184 M WS 12353989 + 1 [fio]
>>> 8,16 3 51 0.000104150 8184 A WS 12353990 + 1 <- (8,18) 65990
>>> 8,16 3 52 0.000104219 8184 Q WS 12353990 + 1 [fio]
>>> 8,16 3 53 0.000104307 8184 M WS 12353990 + 1 [fio]
>>> 8,16 3 54 0.000104452 8184 A WS 12353991 + 1 <- (8,18) 65991
>>> 8,16 3 55 0.000104520 8184 Q WS 12353991 + 1 [fio]
>>> 8,16 3 56 0.000104609 8184 M WS 12353991 + 1 [fio]
>>> 8,16 3 57 0.000104885 8184 I WS 12353984 + 8 [fio]
>>>
>>> whereas in the second case, I'm getting the expected 4k writes:
>>>
>>> 8,16 6 42 1266874889.659842036 8409 A WS 12340232 + 8 <-
>>> (8,18) 52232
>>> 8,16 6 43 1266874889.659842167 8409 Q WS 12340232 + 8 [fio]
>>> 8,16 6 44 1266874889.659842393 8409 G WS 12340232 + 8 [fio]
>>
>> This is weird because --size=1G should mean that fio is "seeing" an
>> aligned end. Does direct=1 with a sequential job of iodepth=1 show the
>> problem too?
>
> IIUC fio uses the size only to find out where to write to, it opens
> the block device and passes the resulting fd to the fdatasync call, so
> the kernel will not know about what size fio thinks the device has. In
> fact, the effect is the same without the size=1G option, I used it
> just to make sure that the writes do not go anywhere near the badly
> aligned partition ending.
>
> direct=1 kills the effect, i.e. all writes will be 4k size again.
> Astonishingly though, sequential writes also are affected, i.e.
> changing to rw=write in my sample above behaves the same as randwrite.

Do you get this style of behaviour without fdatasync (or with larger
values of fdatasync) too?

>>> The above examples are from running with an SSD, where the small
>>> writes get merged together again before hitting the block device,
>>> which is still pretty o.k. performance wise. But when I run the same
>>> test on some NVMe device, the writes do not get merged, instead the
>>> performance drops to less then 10% of what I get in the second case.
>>
>> Perhaps the ioscheduler doesn't have the opportunity with the NVMe device...
>
> Yes, there is no scheduler available in this case:
>
> $ cat /sys/block/nvme0n1/queue/scheduler
> none
>
> This is just to show that the argument "Don't bother, the writes get
> merged back together anyway" doesn't hold true in all cases.
>
>>> If this is indeed expected behaviour from the kernel pov, it might
>>> need some better documentation and probably sgdisk should also be
>>> enhanced to align the end of the partition as well. FWIW, this happens
>>> on a stock 4.4.0 kernel as well as recent Ubuntu and CentOS kernels.
>>
>> Do you mean parted?
>
> No, as I am currently assuming that the issue is caused by some effect
> happening inside the kernel during the fdatasync call, there was the
> idea that only certain kernels might be affected. But I don't have a
> clue yet how for back I would have to go in order to find a kernel
> that behaves differently.

--
Sitsofe | http://sucs.org/~sits/