Re: [PATCH 1/2] nvme: set io-scheduler requirement for ZNS

From: Damien Le Moal
Date: Mon Sep 07 2020 - 08:57:28 EST


On 2020/09/07 20:54, Kanchan Joshi wrote:
> On Mon, Sep 7, 2020 at 5:07 PM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
>>
>> On 2020/09/07 20:24, Kanchan Joshi wrote:
>>> On Mon, Sep 7, 2020 at 1:52 PM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote:
>>>>
>>>> On 2020/09/07 16:01, Kanchan Joshi wrote:
>>>>>> Even for SMR, the user is free to set the elevator to none, which disables zone
>>>>>> write locking. Issuing writes correctly then becomes the responsibility of the
>>>>>> application. This can be useful for settings that for instance use NCQ I/O
>>>>>> priorities, which give better results when "none" is used.
>>>>>
>>>>> Was it not a problem that even if the application is sending writes
>>>>> correctly, scheduler may not preserve the order.
>>>>> And even when none is being used, re-queue can happen which may lead
>>>>> to different ordering.
>>>>
>>>> "Issuing writes correctly" means doing small writes, one per zone at most. In
>>>> that case, it does not matter if the block layer reorders writes. Per zone, they
>>>> will still be sequential.
>>>>
>>>>>> As far as I know, zoned drives are always used in tightly controlled
>>>>>> environments. Problems like "does not know what other applications would be
>>>>>> doing" are non-existent. Setting up the drive correctly for the use case at hand
>>>>>> is a sysadmin/server setup problem, based on *the* application (singular)
>>>>>> requirements.
>>>>>
>>>>> Fine.
>>>>> But what about the null-block-zone which sets MQ-deadline but does not
>>>>> actually use write-lock to avoid race among multiple appends on a
>>>>> zone.
>>>>> Does that deserve a fix?
>>>>
>>>> In nullblk, commands are executed under a spinlock. So there is no concurrency
>>>> problem. The spinlock serializes the execution of all commands. null_blk zone
>>>> append emulation thus does not need to take the scheduler level zone write lock
>>>> like scsi does.
>>>
>>> I do not see spinlock for that. There is one "nullb->lock", but its
>>> scope is limited to memory-backed handling.
>>> For concurrent zone-appends on a zone, multiple threads may set the
>>> "same" write-pointer into incoming request(s).
>>> Are you referring to any other spinlock that can avoid "same wp being
>>> returned to multiple threads".
>>
>> Checking again, it looks like you are correct. nullb->lock is indeed only used
>> for processing read/write with memory backing turned on.
>> We either need to extend that spinlock use, or add one to protect the zone array
>> when doing zoned commands and checks of read/write against a zone wp.
>> Care to send a patch ? I can send one too.
>
> Sure, I can send.
> Do you think it is not OK to use zone write-lock (same like SCSI
> emulation) instead of introducing a new spinlock?

zone write lock will not protect against read or zone management commands
executed concurrently with writes. Only concurrent writes to the same zone will
be serialized with the scheduler zone write locking, which may not be used at
all also if the user set the scheduler to none. A lock for exclusive access and
changes to the zone array is needed.


>
>


--
Damien Le Moal
Western Digital Research