Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]

From: Damien Le Moal
Date: Wed Sep 21 2022 - 19:56:09 EST

Next message: Mike Kravetz: "Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault"
Previous message: Tejun Heo: "Re: 10% regression in qperf tcp latency after introducing commit "4a61bf7f9b18 random: defer fast pool mixing to worker""
In reply to: Mike Snitzer: "Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]"
Next in thread: Pankaj Raghav: "Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 9/22/22 02:27, Mike Snitzer wrote:
> On Tue, Sep 20 2022 at 5:11P -0400,
> Pankaj Raghav <p.raghav@xxxxxxxxxxx> wrote:
>
>> - Background and Motivation:
>>
>> The zone storage implementation in Linux, introduced since v4.10, first
>> targetted SMR drives which have a power of 2 (po2) zone size alignment
>> requirement. The po2 zone size was further imposed implicitly by the
>> block layer's blk_queue_chunk_sectors(), used to prevent IO merging
>> across chunks beyond the specified size, since v3.16 through commit
>> 762380ad9322 ("block: add notion of a chunk size for request merging").
>> But this same general block layer po2 requirement for blk_queue_chunk_sectors()
>> was removed on v5.10 through commit 07d098e6bbad ("block: allow 'chunk_sectors'
>> to be non-power-of-2").
>>
>> NAND, which is the media used in newer zoned storage devices, does not
>> naturally align to po2. In these devices, zone capacity(cap) is not the
>> same as the po2 zone size. When the zone cap != zone size, then unmapped
>> LBAs are introduced to cover the space between the zone cap and zone size.
>> po2 requirement does not make sense for these type of zone storage devices.
>> This patch series aims to remove these unmapped LBAs for zoned devices when
>> zone cap is npo2. This is done by relaxing the po2 zone size constraint
>> in the kernel and allowing zoned device with npo2 zone sizes if zone cap
>> == zone size.
>>
>> Removing the po2 requirement from zone storage should be possible
>> now provided that no userspace regression and no performance regressions are
>> introduced. Stop-gap patches have been already merged into f2fs-tools to
>> proactively not allow npo2 zone sizes until proper support is added [1].
>>
>> There were two efforts previously to add support to npo2 devices: 1) via
>> device level emulation [2] but that was rejected with a final conclusion
>> to add support for non po2 zoned device in the complete stack[3] 2)
>> adding support to the complete stack by removing the constraint in the
>> block layer and NVMe layer with support to btrfs, zonefs, etc which was
>> rejected with a conclusion to add a dm target for FS support [0]
>> to reduce the regression impact.
>>
>> This series adds support to npo2 zoned devices in the block and nvme
>> layer and a new **dm target** is added: dm-po2zoned-target. This new
>> target will be initially used for filesystems such as btrfs and
>> f2fs until native npo2 zone support is added.
>
> As this patchset nears the point of being "ready for merge" and DM's
> "zoned" oriented targets are multiplying, I need to understand: where
> are we collectively going? How long are we expecting to support the
> "stop-gap zoned storage" layers we've constructed?
>
> I know https://zonedstorage.io/docs/introduction exists... but it
> _seems_ stale given the emergence of ZNS and new permutations of zoned
> hardware. Maybe that isn't quite fair (it does cover A LOT!) but I'm
> still left wanting (e.g. "bring it all home for me!")...
>
> Damien, as the most "zoned storage" oriented engineer I know, can you
> please kick things off by shedding light on where Linux is now, and
> where it's going, for "zoned storage"?

Let me first start with what we have seen so far with deployments in the
field.

The largest user base for zoned storage is for now hyperscalers (cloud
services) deploying SMR disks. E.g. Dropbox has many times publicized its
use of SMR HDDs. ZNS is fairly new, and while it is being actively
evaluated by many, there are not yet any large scale deployments that I am
aware of.

Most of the large scale SMR users today mainly use the zoned storage
drives directly, without a file system, similarly to their use of regular
block devices. Some erasure coded object store sits on top of the zoned
drives and manage them. The interface used for that has now switched to
using the kernel API, from libzbc pass-through in the early days of SMR
support. With the inclusion of zonefs in kernel 5.6, many are now
switching to using that instead of directly accessing the block device
file. zonefs makes the application development somewhat easier (there is
no need for issuing zone management ioctls) and can also result in
applications that can actually run almost as-is on top of regular block
devices with a file system. That is a very interesting property,
especially in development phase for the user.

Beside these large scale SMR deployments, there are also many smaller
users. For these cases, dm-zoned seemed to be used a lot. In particular,
the Chia cryptocurrency boom (now fading ?) did generate a fair amount of
new SMR users relying on dm-zoned. With btrfs zoned storage support
maturing, dm-zoned is not as needed as it used to though. SMR drives can
be used directly under btrfs and I certainly am always recommending this
approach over dm-zoned+ext4 or dm-zoned+xfs as performance is much better
for write intensive workloads.

For Linux kernel overall, zoned storage is in a very good shape for raw
block device use and zonefs use. Production deployments we are seeing are
a proof of that. Currently, my team effort is mostly focused on btrfs and
zonefs and increasing zoned storage use cases.

1) For btrfs, Johannes and Naohiro are working on stabilizing support for
ZNS (we still have some issues with the management of active zones) and
implementing de-clustered parity RAID support so that zoned drives can be
used in RAID 0, 1, 10, 5, 6 and erasure coded volumes. This will address
use cases such as home NAS boxes, backup servers, small file servers,
video applications (e.g. video surveillance) etc. Essentially, any
application with large storage capacity needs that is not a distributed
setup. There are many.

2) For zonefs, I have some to-do items lined up to improve performance
(better read IO tail latency) and further improve ease of use (e.g. remove
the O_DIRECT write constraint).

3) At the block device level, we are also working on adding zoned block
device specifications to virtio and implementing that support in qemu and
the kernel. Patches are floating around now but not yet merged. This
addresses the use of zoned storage in VM environments through virtio
interface instead of directly attaching devices to guests.

> To give some additional context to help me when you answer: I'm left
> wondering what, if any, role dm-zoned has to play moving forward given
> ZNS is "the future" (and yeah "the future" is now but...)? E.g.: Does
> it make sense to stack dm-zoned ontop of dm-po2zoned!?

That is a lot to unfold in a small paragraph :)

First of all, I would rephrase "ZNS is the future" into "ZNS is a very
interesting alternative to generic NVMe SSDs". The reason being that HDD
are not dead, far from it. They still are way cheaper than SSDs in $/TB :)
So ZNS is not really in competition with SMR HDDs jere. The 2 are
complementary, exactly like regular SSDs are complementary to regular HDDs.

dm-zoned serves some use cases for SMR HDDs (see above) but does not
address ZNS (more on this below). And given that all SMR HDD on the market
today have a zone size that is a power of 2 number of LBAs (256MB zone
size is by far the most common), dm-po2zoned is not required at all for SMR.

Pankaj patch series is all about supporting ZNS devices that have a zone
size that is not a power of 2 number of LBAs as some vendors want to
produce such drives. There is no such move happening in the SMR world as
all users are happy with the current zone sizes which match the kernel
support (which currently requires power-of-2 number of LBAs for the zone
size).

I do not think we have yet reached a consensus on if we really want to
accept any zone size for zoned storage. I personally am not a big fan of
removing the existing constraint as that makes the code somewhat heavier
(multiplication & divisions instead of bit shifts) without introducing any
benefit to the user that I can see (or agree with). And there is also a
risk of forcing onto the users to redesign/change their code to support
different devices in the same system. That is never nice to fragment
support like this for the same device class. This is why several people,
including me, requested something like dm-po2zoned, to avoid breaking user
applications if non-power-of-2 zone size drives support is merged. Better
than nothing for sure, but not ideal either. That is only my opinion.
There are different opinions out there.

> Yet more context: When I'm asked to add full-blown support for
> dm-zoned to RHEL my gut is "please no, why!?". And if we really
> should add dm-zoned is dm-po2zoned now also a requirement (to support
> non-power-of-2 ZNS devices in our never-ending engineering of "zoned
> storage" compatibility stop-gaps)?

Support for dm-zoned in RHEL really depends on if your customers need it.
Having SMR and ZNS block device (CONFIG_BLK_DEV_ZONED) and zonefs support
enabled would already cover a lot of use cases on their own, at least the
ones we see in the field today.

Going forward, we expect more use cases to rely on btrfs rather than
dm-zoned or any equivalent DM target for ZNS. And that can also include
non power of 2 zone size drives as btrfs should normally be able to handle
such devices, if the support for them is merged. But we are not there yet
with btrfs support, hence dm-po2zoned.

But again, that all depends on if Pankaj patch series is accepted, that
is, on everybody accepting that we lift the power-of-2 zone size constraint.
> In addition, it was my understanding that WDC had yet another zoned DM
> target called "dm-zap" that is for ZNS based devices... It's all a bit
> messy in my head (that's on me for not keeping up, but I think we need
> a recap!)

Since the ZNS specification does not define conventional zones, dm-zoned
cannot be used as a standalone DM target (read: single block device) with
NVMe zoned block devices. Furthermore, due to its block mapping scheme,
dm-zoned does not support devices with zones that have a capacity lower
than the zone size. So ZNS is really a big *no* for dm-zoned. dm-zap is a
prototype and in a nutshell is the equivalent of dm-zoned for ZNS. dm-zap
can deal with the smaller zone capacity and does not require conventional
zones. We are not trying to push for dm-zap to be merged for now as we are
still evaluating its potential use cases. We also have a different but
functionally equivalent approach implemented as a block device driver that
we are evaluating internally.

Given the above mentioned usage pattern we have seen so far for zoned
storage, it is not yet clear if something like dm-zap for ZNS is needed
beside some niche use cases.

> So please help me, and others, become more informed as quickly as
> possible! ;)

I hope the above helps. If you want me to develop further any of the
points above, feel free to let me know.

> ps. I'm asking all this in the open on various Linux mailing lists
> because it doesn't seem right to request a concall to inform only
> me... I think others may need similar "zoned storage" help.

All good with me :)

--
Damien Le Moal
Western Digital Research

Next message: Mike Kravetz: "Re: [PATCH] mm: hugetlb: fix UAF in hugetlb_handle_userfault"
Previous message: Tejun Heo: "Re: 10% regression in qperf tcp latency after introducing commit "4a61bf7f9b18 random: defer fast pool mixing to worker""
In reply to: Mike Snitzer: "Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]"
Next in thread: Pankaj Raghav: "Re: Please further explain Linux's "zoned storage" roadmap [was: Re: [PATCH v14 00/13] support zoned block devices with non-power-of-2 zone sizes]"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]