Re: [PATCH] block: BFQ default for single queue devices

From: Bryan Gurney
Date: Wed Oct 03 2018 - 13:35:00 EST


On Wed, Oct 3, 2018 at 11:53 AM, Paolo Valente <paolo.valente@xxxxxxxxxx> wrote:
>
>
>> Il giorno 03 ott 2018, alle ore 10:28, Linus Walleij <linus.walleij@xxxxxxxxxx> ha scritto:
>>
>> On Wed, Oct 3, 2018 at 9:42 AM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
>>
>>> There is another class of outliers: host-managed SMR disks (SATA and SCSI,
>>> definitely single hw queue). For these, using mq-deadline is mandatory in many
>>> cases in order to guarantee sequential write command delivery to the device
>>> driver. Having the default changed to bfq, which as far as I know is not SMR
>>> friendly (can sequential writes within a single zone be reordered ?) is asking
>>> for troubles (unaligned write errors showing up).
>>
>> Ah, that is interesting.
>>
>> Which device driver files are we talking about here, specifically?
>> I'd like to take a look.
>>
>> I guess what you say is not that you are looking for the deadline
>> scheduling per se (as in deadline scheduling is nice), what you want is
>> the zone locking semantics in that scheduler, is that right?
>>
>> I.e. this business:
>> blk_queue_is_zoned(q)
>> blk_req_zone_write_lock(rq);
>> blk_req_zone_write_unlock(rq);
>> and mq-deadline solves this with a spinlock.
>>
>> I will augment the patch to enforce mq-deadline
>> if blk_queue_is_zoned(q) is true, as it is clear that
>> any device with that characteristic must use mq-deadline.
>>
>> Paoly might be interested in looking into whether BFQ could
>> also handle zoned devices in the future, I have no idea of how
>> hard that would be.
>>
>
> Absolutely, as I already wrote in my reply to Damien.
>
> In the meantime, Linus, augmenting your patch as you propose seems
> a clean and effective solution to me.
>
> Thanks,
> Paolo
>
>> The zoned business seems a bit fragile. Should it even be
>> allowed to select any other scheduler than deadline on these
>> devices? Presenting all compiled in schedulers in
>> /sysblock/device/queue/scheduler sounds like just giving
>> sysadmins too much rope.
>>
>> Yours,
>> Linus Walleij
>

Right now, users of host-managed SMR drives should be using "deadline"
or "mq-deadline", to avoid out-of-order writes in sequential-only
zones.

I'm running into a situation right now on a test system (Fedora 28,
4.18.7 kernel) where I copied test data onto an F2FS filesystem, but I
accidentally forgot to add my "udev rule" file:

# cat /etc/udev/rules.d/99-zoned-block-devices.rules
ACTION=="add|change", KERNEL=="sd[a-z]",
ATTRS{queue/zoned}=="host-managed", ATTR{queue/scheduler}="deadline"

...and now, I see these messages when that specific SMR drive is mounted:

kernel: F2FS-fs (sdc): IO Block Size: 4 KB
kernel: F2FS-fs (sdc): Found nat_bits in checkpoint
kernel: F2FS-fs (sdc): Mounted with checkpoint version = 212216ab
kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08),
sub_code(0x0000)
kernel: mpt3sas_cm0: log_info(0x31080000): originator(PL), code(0x08),
sub_code(0x0000)
kernel: scsi_io_completion: 20 callbacks suppressed
kernel: sd 7:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK
driverbyte=DRIVER_SENSE
kernel: sd 7:0:0:0: [sdb] tag#0 Sense Key : Aborted Command [current]
kernel: sd 7:0:0:0: [sdb] tag#0 Add. Sense: No additional sense information
kernel: sd 7:0:0:0: [sdb] tag#0 CDB: Write(16) 8a 00 00 00 00 00 3d d4
ec 99 00 00 00 80 00 00

I was also running into problems with creating new directories on this
F2FS filesystem. However, "fsck.f2fs" reports no problems. So at
this point, I created a new F2FS filesystem on a second SMR drive, and
am currently copying the data from the "bad" F2FS filesystem to the
"good" one.

I wouldn't call zoned block devices "fragile"; they simply have I/O
rules that didn't previously exist: all writes to sequential-only
zones must be sequential. And one of the things that schedulers do is
reorder writes. After 4.16, sd stopped being the "gatekeeper" of
ensuring sequential writes, but the only "zoned-aware" schedulers were
deadline and mq-deadline. Since my test system defaulted to "cfq", I
ran into problems.

So I welcome any changes that make it impossible for the user to
"accidentally use the wrong scheduler".

At least this time, I didn't "brick" my test system's BIOS, like I did
back in May of this year [1].


Thanks,

Bryan


[1] https://www.spinics.net/lists/linux-block/msg26798.html