Re: [PATCH 0/2] New zonefs file system

From: Damien Le Moal
Date: Mon Dec 16 2019 - 19:05:30 EST


On 2019/12/16 17:19, Enrico Weigelt, metux IT consult wrote:
> On 12.12.19 19:38, Damien Le Moal wrote:
>
> Hi,
>
>> zonefs is a very simple file system exposing each zone of a zoned block
>> device as a file. Unlike a regular file system with zoned block device
>> support (e.g. f2fs or the on-going btrfs effort), zonefs does not hide
>> the sequential write constraint of zoned block devices to the user.
>
> Just curious: what's the exact definition of "zoned" here ?
> Something like partitions ?

As Carlos commented already, a zoned block device is Linux abstraction
used to handle SMR HDDs (Shingled Magnetic Recording). These disks
expose an LBA range that is divided into zones that can only be written
sequentially for host-managed models. Other models such as host-aware or
drive-managed allow random writes to all zones at the cost of potential
serious performance degradation due to disk internal garbage collection
of zones (similarly to an SSD handling of erase blocks).

While today zoned block devices exist on the market only in the form of
SMR disks, NVMe SSDs will also soon be available with the completion of
the Zoned Namespace specifications.

Zoning of block devices has several advantages: higher capacities for
HDDs and more predictable and lower IO latencies for SSDs (almost no
internal GC/weir leveling needed). But taking full advantage of these
devices require software changes on the host due to the sequential write
constraint imposed by the devices interface.

> Can these files then also serve as block devices for other filesystems ?
> Just a funny idea: could we handle partitions by a file system ?
>
> Even more funny idea: give file systems block device ops, so they can
> be directly used as such (w/o explicitly using loopdev) ;-)

This is outside the scope of this thread, so let's not start a
discussion about this here. Start a new thread !

>> Files representing sequential write zones of the device must be written
>> sequentially starting from the end of the file (append only writes).
>
> So, these files can only be accessed like a tape ?

Writes must be sequential within a zone but reads can be random to any
writen LBA.

> Assuming you're working ontop of standard block devices anyways (instead
> of tape-like media ;-)) - why introducing such a limitation ?

See above: the limitation is physical, by the device, so that different
improvements can be achieved depending on the storage medium being used
(increased capacity, lower latencies, lower over provisioning, etc)

>
>> zonefs is not a POSIX compliant file system. It's goal is to simplify
>> the implementation of zoned block devices support in applications by
>> replacing raw block device file accesses with a richer file based API,
>> avoiding relying on direct block device file ioctls which may
>> be more obscure to developers.
>
> ioctls ?
>
> Last time I checked, block devices could be easily accessed via plain
> file ops (read, write, seek, ...). You can basically treat them just
> like big files of fixed size.

I was not clear, my apologies. I am refering here to the zoned block
device related ioctls defined in include/uapi/linux/blkzoned.h. These
ioctls allow an application to manage the device zones (obtain zone
information, erase zones, etc). These ioctls trigger issuing zone
related commands to the device. These commands are defined by the ZBC
and ZAC standards for SCSI and ATA, and NVMe Zoned Namespace in the very
near future.

>> One example of this approach is the
>> implementation of LSM (log-structured merge) tree structures (such as
>> used in RocksDB and LevelDB)
>
> The same LevelDB as used eg. in Chrome browser, which destroys itself
> every time a little temporary problem (eg. disk full) occours ?
> If that's the usecase I'd rather use an simple in-memory table instead
> and and enough swap, as leveldb isn't reliable enough for persistent
> data anyways :p

The intent of my comment was not to advocate for or discuss the merits
of any particular KV implementation. I was only pointing out that zonefs
does not come in a void and that we do have use cases for it and did the
work on some user space software to validate it. Leveldb and RocksDB are
the 2 LSM-tree based KV stores we worked on as they are very popular and
widely used.

>> on zoned block devices by allowing SSTables
>> to be stored in a zone file similarly to a regular file system rather
>> than as a range of sectors of a zoned device. The introduction of the
>> higher level construct "one file is one zone" can help reducing the
>> amount of changes needed in the application while at the same time
>> allowing the use of zoned block devices with various programming
>> languages other than C.
>
> Why not just simply use files on a suited filesystem (w/ low block io
> overhead) or LVM volumes ?

Using a file system compliant with zoned block device constraint such as
f2fs or btrfs (on-going work) is certainly a valid approach. However,
this may not be the most optimal one if the application being used as a
mostly sequential write behavior. LSM-tree based KV stores fall into
this category: SSTables are large (several MB) and always written
sequentially. There are not random writes, which facilitates supporting
directly zoned block devices without the need for a file system which
would add a GC background process and degrade performance. As mentioned
in the cover letter, zonefs goal is to facilitate the implementation of
this support compared toa pure raw block device use.

>
>
> --mtx
>


--
Damien Le Moal
Western Digital Research