Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS
From: Kanchan Joshi
Date: Mon Sep 07 2020 - 03:01:46 EST
On Wed, Aug 19, 2020 at 4:47 PM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
>
> On 2020/08/19 19:32, Kanchan Joshi wrote:
> > On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
> >>
> >> On 2020/08/19 18:27, Kanchan Joshi wrote:
> >>> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig <hch@xxxxxx> wrote:
> >>>>
> >>>> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote:
> >>>>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS.
> >>>>
> >>>> No, it is not.
> >>>
> >>> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS?
> >>> I see that null-block zoned and SCSI-ZBC both set this requirement. I
> >>> wonder how it became different for NVMe.
> >>
> >> It is not required for an NVMe ZNS drive that has zone append native support.
> >> zonefs and upcoming btrfs do not use regular writes, removing the requirement
> >> for zone write locking.
> >
> > I understand that if a particular user (zonefs, btrfs etc) is not
> > sending regular-write and sending append instead, write-lock is not
> > required.
> > But if that particular user or some other user (say F2FS) sends
> > regular write(s), write-lock is needed.
>
> And that can be trivially enabled by setting the drive elevator to mq-deadline.
>
> > Above block-layer, both the opcodes REQ_OP_WRITE and
> > REQ_OP_ZONE_APPEND are available to be used by users. And I thought
> > write-lock is taken or not is a per-opcode thing and not per-user (FS,
> > MD/DM, user-space etc.), is not that correct? And MQ-deadline can
> > cater to both the opcodes, while other schedulers cannot serve
> > REQ_OP_WRITE well for zoned-device.
>
> mq-deadline ignores zone append commands. No zone lock is taken for these. In
> scsi, the emulation takes the zone lock before transforming the zone append into
> a regular write. That locking is consistent with the mq-scheduler level locking
> since the same lock bitmap is used. So if the user only issues zone append
> writes, mq-deadline is not needed and there is no reasons to force its use by
> setting ELEVATOR_F_ZBD_SEQ_WRITE. E.g. the user may want to use kyber...
Right, got your point.
> >> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be set only
> >> and only if the drive does not have native zone append support.
> >
> > Sure I can keep it that way, once I get it right. If it is really not
> > required for native-append drive, it should not be here at the place
> > where I added.
> >
> >> And even in that
> >> case, since for an emulated zone append the zone write lock is taken and
> >> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is required
> >> only if the user will also be issuing regular writes at high QD. And that is
> >> trivially controllable by the user by simply setting the drive elevator to
> >> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed.
> >
> > Are we saying applications should switch schedulers based on the write
> > QD (use any-scheduler for QD1 and mq-deadline for QD-N).
> > Even if it does that, it does not know what other applications would
> > be doing. That seems hard-to-get-right and possible only in a
> > tightly-controlled environment.
>
> Even for SMR, the user is free to set the elevator to none, which disables zone
> write locking. Issuing writes correctly then becomes the responsibility of the
> application. This can be useful for settings that for instance use NCQ I/O
> priorities, which give better results when "none" is used.
Was it not a problem that even if the application is sending writes
correctly, scheduler may not preserve the order.
And even when none is being used, re-queue can happen which may lead
to different ordering.
> As far as I know, zoned drives are always used in tightly controlled
> environments. Problems like "does not know what other applications would be
> doing" are non-existent. Setting up the drive correctly for the use case at hand
> is a sysadmin/server setup problem, based on *the* application (singular)
> requirements.
Fine.
But what about the null-block-zone which sets MQ-deadline but does not
actually use write-lock to avoid race among multiple appends on a
zone.
Does that deserve a fix?
--
Kanchan