Re: [PATCH v7 1/2] fs: New zonefs file system

From: Hannes Reinecke
Date: Wed Jan 15 2020 - 04:36:12 EST


On 1/15/20 7:28 AM, Damien Le Moal wrote:
> zonefs is a very simple file system exposing each zone of a zoned block
> device as a file. Unlike a regular file system with zoned block device
> support (e.g. f2fs), zonefs does not hide the sequential write
> constraint of zoned block devices to the user. Files representing
> sequential write zones of the device must be written sequentially
> starting from the end of the file (append only writes).
>
> As such, zonefs is in essence closer to a raw block device access
> interface than to a full featured POSIX file system. The goal of zonefs
> is to simplify the implementation of zoned block device support in
> applications by replacing raw block device file accesses with a richer
> file API, avoiding relying on direct block device file ioctls which may
> be more obscure to developers. One example of this approach is the
> implementation of LSM (log-structured merge) tree structures (such as
> used in RocksDB and LevelDB) on zoned block devices by allowing SSTables
> to be stored in a zone file similarly to a regular file system rather
> than as a range of sectors of a zoned device. The introduction of the
> higher level construct "one file is one zone" can help reducing the
> amount of changes needed in the application as well as introducing
> support for different application programming languages.
>
> Zonefs on-disk metadata is reduced to an immutable super block to
> persistently store a magic number and optional feature flags and
> values. On mount, zonefs uses blkdev_report_zones() to obtain the device
> zone configuration and populates the mount point with a static file tree
> solely based on this information. E.g. file sizes come from the device
> zone type and write pointer offset managed by the device itself.
>
> The zone files created on mount have the following characteristics.
> 1) Files representing zones of the same type are grouped together
> under a common sub-directory:
> * For conventional zones, the sub-directory "cnv" is used.
> * For sequential write zones, the sub-directory "seq" is used.
> These two directories are the only directories that exist in zonefs.
> Users cannot create other directories and cannot rename nor delete
> the "cnv" and "seq" sub-directories.
> 2) The name of zone files is the number of the file within the zone
> type sub-directory, in order of increasing zone start sector.
> 3) The size of conventional zone files is fixed to the device zone size.
> Conventional zone files cannot be truncated.
> 4) The size of sequential zone files represent the file's zone write
> pointer position relative to the zone start sector. Truncating these
> files is allowed only down to 0, in which case, the zone is reset to
> rewind the zone write pointer position to the start of the zone, or
> up to the zone size, in which case the file's zone is transitioned
> to the FULL state (finish zone operation).
> 5) All read and write operations to files are not allowed beyond the
> file zone size. Any access exceeding the zone size is failed with
> the -EFBIG error.
> 6) Creating, deleting, renaming or modifying any attribute of files and
> sub-directories is not allowed.
> 7) There are no restrictions on the type of read and write operations
> that can be issued to conventional zone files. Buffered, direct and
> mmap read & write operations are accepted. For sequential zone files,
> there are no restrictions on read operations, but all write
> operations must be direct IO append writes. mmap write of sequential
> files is not allowed.
>
> Several optional features of zonefs can be enabled at format time.
> * Conventional zone aggregation: ranges of contiguous conventional
> zones can be aggregated into a single larger file instead of the
> default one file per zone.
> * File ownership: The owner UID and GID of zone files is by default 0
> (root) but can be changed to any valid UID/GID.
> * File access permissions: the default 640 access permissions can be
> changed.
>
> The mkzonefs tool is used to format zoned block devices for use with
> zonefs. This tool is available on Github at:
>
> git@xxxxxxxxxx:damien-lemoal/zonefs-tools.git.
>
> zonefs-tools also includes a test suite which can be run against any
> zoned block device, including null_blk block device created with zoned
> mode.
>
> Example: the following formats a 15TB host-managed SMR HDD with 256 MB
> zones with the conventional zones aggregation feature enabled.
>
> $ sudo mkzonefs -o aggr_cnv /dev/sdX
> $ sudo mount -t zonefs /dev/sdX /mnt
> $ ls -l /mnt/
> total 0
> dr-xr-xr-x 2 root root 1 Nov 25 13:23 cnv
> dr-xr-xr-x 2 root root 55356 Nov 25 13:23 seq
>
> The size of the zone files sub-directories indicate the number of files
> existing for each type of zones. In this example, there is only one
> conventional zone file (all conventional zones are aggregated under a
> single file).
>
> $ ls -l /mnt/cnv
> total 137101312
> -rw-r----- 1 root root 140391743488 Nov 25 13:23 0
>
> This aggregated conventional zone file can be used as a regular file.
>
> $ sudo mkfs.ext4 /mnt/cnv/0
> $ sudo mount -o loop /mnt/cnv/0 /data
>
> The "seq" sub-directory grouping files for sequential write zones has
> in this example 55356 zones.
>
> $ ls -lv /mnt/seq
> total 14511243264
> -rw-r----- 1 root root 0 Nov 25 13:23 0
> -rw-r----- 1 root root 0 Nov 25 13:23 1
> -rw-r----- 1 root root 0 Nov 25 13:23 2
> ...
> -rw-r----- 1 root root 0 Nov 25 13:23 55354
> -rw-r----- 1 root root 0 Nov 25 13:23 55355
>
> For sequential write zone files, the file size changes as data is
> appended at the end of the file, similarly to any regular file system.
>
> $ dd if=/dev/zero of=/mnt/seq/0 bs=4K count=1 conv=notrunc oflag=direct
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.000452219 s, 9.1 MB/s
>
> $ ls -l /mnt/seq/0
> -rw-r----- 1 root root 4096 Nov 25 13:23 /mnt/seq/0
>
> The written file can be truncated to the zone size, preventing any
> further write operation.
>
> $ truncate -s 268435456 /mnt/seq/0
> $ ls -l /mnt/seq/0
> -rw-r----- 1 root root 268435456 Nov 25 13:49 /mnt/seq/0
>
> Truncation to 0 size allows freeing the file zone storage space and
> restart append-writes to the file.
>
> $ truncate -s 0 /mnt/seq/0
> $ ls -l /mnt/seq/0
> -rw-r----- 1 root root 0 Nov 25 13:49 /mnt/seq/0
>
> Since files are statically mapped to zones on the disk, the number of
> blocks of a file as reported by stat() and fstat() indicates the size
> of the file zone.
>
> $ stat /mnt/seq/0
> File: /mnt/seq/0
> Size: 0 Blocks: 524288 IO Block: 4096 regular empty file
> Device: 870h/2160d Inode: 50431 Links: 1
> Access: (0640/-rw-r-----) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2019-11-25 13:23:57.048971997 +0900
> Modify: 2019-11-25 13:52:25.553805765 +0900
> Change: 2019-11-25 13:52:25.553805765 +0900
> Birth: -
>
> The number of blocks of the file ("Blocks") in units of 512B blocks
> gives the maximum file size of 524288 * 512 B = 256 MB, corresponding
> to the device zone size in this example. Of note is that the "IO block"
> field always indicates the minimum IO size for writes and corresponds
> to the device physical sector size.
>
> This code contains contributions from:
> * Johannes Thumshirn <jthumshirn@xxxxxxx>,
> * Darrick J. Wong <darrick.wong@xxxxxxxxxx>,
> * Christoph Hellwig <hch@xxxxxx>,
> * Chaitanya Kulkarni <chaitanya.kulkarni@xxxxxxx> and
> * Ting Yao <tingyao@xxxxxxxxxxx>.
>
> Signed-off-by: Damien Le Moal <damien.lemoal@xxxxxxx>
> Reviewed-by: Darrick J. Wong <darrick.wong@xxxxxxxxxx>
> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@xxxxxxx>
> ---
> MAINTAINERS | 9 +
> fs/Kconfig | 1 +
> fs/Makefile | 1 +
> fs/zonefs/Kconfig | 9 +
> fs/zonefs/Makefile | 4 +
> fs/zonefs/super.c | 1177 ++++++++++++++++++++++++++++++++++++
> fs/zonefs/zonefs.h | 175 ++++++
> include/uapi/linux/magic.h | 1 +
> 8 files changed, 1377 insertions(+)
> create mode 100644 fs/zonefs/Kconfig
> create mode 100644 fs/zonefs/Makefile
> create mode 100644 fs/zonefs/super.c
> create mode 100644 fs/zonefs/zonefs.h
>

Reviewed-by: Hannes Reinecke <hare@xxxxxxx>

Cheers,

Hannes
--
Dr. Hannes Reinecke Teamlead Storage & Networking
hare@xxxxxxx +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 NÃrnberg
HRB 36809 (AG NÃrnberg), GF: Felix ImendÃrffer