Re: [PATCH v2 1/2] fs: New zonefs file system

From: Damien Le Moal
Date: Sun Dec 22 2019 - 20:33:37 EST


On 2019/12/21 7:38, Darrick J. Wong wrote:
> On Fri, Dec 20, 2019 at 03:55:27PM +0900, Damien Le Moal wrote:
[...]>> +static int zonefs_inode_setattr(struct dentry *dentry, struct
iattr *iattr)
>> +{
>> + struct inode *inode = d_inode(dentry);
>> + int ret;
>> +
>> + ret = setattr_prepare(dentry, iattr);
>> + if (ret)
>> + return ret;
>> +
>> + if ((iattr->ia_valid & ATTR_UID &&
>> + !uid_eq(iattr->ia_uid, inode->i_uid)) ||
>> + (iattr->ia_valid & ATTR_GID &&
>> + !gid_eq(iattr->ia_gid, inode->i_gid))) {
>> + ret = dquot_transfer(inode, iattr);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + if (iattr->ia_valid & ATTR_SIZE) {
>> + /* The size of conventional zone files cannot be changed */
>> + if (ZONEFS_I(inode)->i_ztype == ZONEFS_ZTYPE_CNV)
>> + return -EPERM;
>> +
>> + ret = zonefs_seq_file_truncate(inode, iattr->ia_size);
>> + if (ret)
>> + return ret;
>> + }
>
> /me wonders if you need to filter out ATTR_MODE changes here, at least
> so you can't make the zone file for a readonly zone writable?

Good point. Will add that to V3.

> I also wonder, does an O_TRUNC open reset the zone's write pointer to
> zero?

Yes, it does. That does not change from a regular FS behavior. This is
also consistent with the fact that a truncate(0) does exactly the same
thing.

[...]
>> +static const struct vm_operations_struct zonefs_file_vm_ops = {
>> + .fault = zonefs_filemap_fault,
>> + .map_pages = filemap_map_pages,
>> + .page_mkwrite = zonefs_filemap_page_mkwrite,
>> +};
>> +
>> +static int zonefs_file_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> + /*
>> + * Conventional zone files can be mmap-ed READ/WRITE.
>> + * For sequential zone files, only readonly mappings are possible.
>
> Hmm, but the code below looks like it allows private writable mmapings
> of sequential zones?

It is my understanding that changes made to pages of a MAP_PRIVATE
mapping are not written back to the underlying file, so a
mmap(MAP_WRITE|MAP_PRIVATE) is essentially equivalent to a read only
mapping for the FS. Am I missing something ?

Not sure if it make any sense at all to allow private writeable mappings
though, but if my assumption is correct, I do not see any reason to
prevent them either.

[...]
>> +static const struct iomap_dio_ops zonefs_dio_ops = {
>> + .end_io = zonefs_file_dio_write_end,
>> +};
>> +
>> +static ssize_t zonefs_file_dio_write(struct kiocb *iocb, struct iov_iter *from)
>> +{
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> + struct zonefs_sb_info *sbi = ZONEFS_SB(inode->i_sb);
>> + struct zonefs_inode_info *zi = ZONEFS_I(inode);
>> + size_t count;
>> + ssize_t ret;
>> +
>> + if (iocb->ki_flags & IOCB_NOWAIT) {
>> + if (!inode_trylock(inode))
>> + return -EAGAIN;
>> + } else {
>> + inode_lock(inode);
>> + }
>> +
>> + ret = generic_write_checks(iocb, from);
>> + if (ret <= 0)
>> + goto out;
>> +
>> + iov_iter_truncate(from, zi->i_max_size - iocb->ki_pos);
>> + count = iov_iter_count(from);
>> +
>> + /*
>> + * Direct writes must be aligned to the block size, that is, the device
>> + * physical sector size, to avoid errors when writing sequential zones
>> + * on 512e devices (512B logical sector, 4KB physical sectors).
>> + */
>> + if ((iocb->ki_pos | count) & sbi->s_blocksize_mask) {
>> + ret = -EINVAL;
>> + goto out;
>> + }
>> +
>> + /*
>> + * Enforce sequential writes (append only) in sequential zones.
>> + */
>
> I wonder, shouldn't zonefs require users to open sequential zones with
> O_APPEND? I don't see anything in here that would suggest that it does,
> though maybe I missed something.

Yes, I thought about this too but decided against it for several reasons:
1) Requiring O_APPEND breaks some shell command like tools such as
"truncate" which makes scripting (including tests) harder.
2) Without enforcing O_APPEND, an application doing pwrite() or aios to
an incorrect offset will see an error instead of potential file data
corruption (due to the application bug, not the FS).
3) Since sequential zone file size is updated only on completion of
direct IOs, O_APPEND would generate an incorrect offset for AIOs at
queue depth bigger than 1.

Thoughts ?

[...]
>> +static ssize_t zonefs_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> +{
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> + /*
>> + * Check that the write operation does not go beyond the zone size.
>> + */
>> + if (iocb->ki_pos >= ZONEFS_I(inode)->i_max_size)
>> + return -EFBIG;
>> +
>> + if (iocb->ki_flags & IOCB_DIRECT)
>> + return zonefs_file_dio_write(iocb, from);
>> +
>> + return zonefs_file_buffered_write(iocb, from);
>> +}
>> +
>> +static const struct file_operations zonefs_file_operations = {
>> + .open = generic_file_open,
>
> Hmm, ok, so there isn't any explicit O_APPEND requirement, even though
> it looks like the filesystem enforces one.

Yes, in purpose. See above for the reasons.

[...]
>> +static void zonefs_init_file_inode(struct inode *inode, struct blk_zone *zone)
>> +{
>> + struct super_block *sb = inode->i_sb;
>> + struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>> + struct zonefs_inode_info *zi = ZONEFS_I(inode);
>> + umode_t perm = sbi->s_perm;
>> +
>> + zi->i_ztype = zonefs_zone_type(zone);
>> + zi->i_zsector = zone->start;
>> +
>> + switch (zone->cond) {
>> + case BLK_ZONE_COND_OFFLINE:
>> + /*
>> + * Disable all accesses and set the file size to 0 for
>> + * offline zones.
>> + */
>> + zi->i_wpoffset = 0;
>> + zi->i_max_size = 0;
>> + perm = 0;
>> + break;
>> + case BLK_ZONE_COND_READONLY:
>> + /* Do not allow writes in read-only zones*/
>> + perm &= ~(0222); /* S_IWUGO */
>> + /* Fallthrough */
>
> You might want to set S_IMMUTABLE in i_flags here, since (I assume)
> readonly zones are never, ever, going to be modifable in any way?

Good point. Will do.

> In which case, zonefs probably shouldn't let people run 'chmod a+w' on a
> readonly zone. Either that or disallow mode changes via
> zonefs_inode_setattr.

Yes, will do.

[...]
>> +static int zonefs_create_zgroup(struct zonefs_zone_data *zd,
>> + enum zonefs_ztype type)
>> +{
>> + struct super_block *sb = zd->sb;
>> + struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>> + struct blk_zone *zone, *next, *end;
>> + char name[ZONEFS_NAME_MAX];
>> + struct dentry *dir;
>> + unsigned int n = 0;
>> +
>> + /* If the group is empty, there is nothing to do */
>> + if (!zd->nr_zones[type])
>> + return 0;
>> +
>> + dir = zonefs_create_inode(sb->s_root, zgroups_name[type], NULL);
>> + if (!dir)
>> + return -ENOMEM;
>> +
>> + /*
>> + * The first zone contains the super block: skip it.
>> + */
>> + end = zd->zones + blkdev_nr_zones(sb->s_bdev->bd_disk);
>> + for (zone = &zd->zones[1]; zone < end; zone = next) {
>> +
>> + next = zone + 1;
>> + if (zonefs_zone_type(zone) != type)
>> + continue;
>> +
>> + /*
>> + * For conventional zones, contiguous zones can be aggregated
>> + * together to form larger files.
>> + * Note that this overwrites the length of the first zone of
>> + * the set of contiguous zones aggregated together.
>> + * Only zones with the same condition can be agreggated so that
>> + * offline zones are excluded and readonly zones are aggregated
>> + * together into a read only file.
>> + */
>> + if (type == ZONEFS_ZTYPE_CNV &&
>> + sbi->s_features & ZONEFS_F_AGGRCNV) {
>
> This probably needs parentheses around the flag check, e.g.
>
> if (type == ZONEFS_ZTYPE_CNV &&
> (sbi->s_features & ZONEFS_F_AGGRCNV)) {

gcc does not complain but I agree. It is cleaner and older gcc versions
will also probably be happier :)

[...]
>> +
>> +static int zonefs_get_zone_info(struct zonefs_zone_data *zd)
>> +{
>> + struct block_device *bdev = zd->sb->s_bdev;
>> + int ret;
>> +
>> + zd->zones = kvcalloc(blkdev_nr_zones(bdev->bd_disk),
>> + sizeof(struct blk_zone), GFP_KERNEL);
>
> Hmm, so one 64-byte blk_zone structure for each zone on the disk?
>
> I have a 14TB SMR disk with ~459,000x 32M zones on it. That's going to
> require a contiguous 30MB memory allocation to hold all the zone
> information. Even your 15T drive from the commit message will need a
> contiguous 3.8MB memory allocation for all the zone info.
>
> I wonder if each zone should really be allocated separately and then
> indexed with an xarray or something like that to reduce the chance of
> failure when memory is fragmented or tight.
>
> That could be subsequent work though, since in the meantime that just
> makes zonefs mounts more likely to run out of memory and fail. I
> suppose you don't hang on to the huge allocation for very long.

No, this memory allocation is only for mount. It is dropped as soon as
all the zone file inodes are created. Furthermore, this allocation is a
kvalloc, not a kmalloc. So there is no memory continuity requirement.
This is only an array of structures and that is not used to do IOs for
the report zone itself.

I debated trying to optimize (I mean reducing the mount temporary memory
use) by processing mount in small chunks of zones instead of all zones
in one go. I kept simple, but rather brutal, approach to keep the code
simple. This can be rewritten and optimized at any time if we see
problems appearing.

>
>> + if (!zd->zones)
>> + return -ENOMEM;
>> +
>> + /* Get zones information */
>> + ret = blkdev_report_zones(bdev, 0, BLK_ALL_ZONES,
>> + zonefs_get_zone_info_cb, zd);
>> + if (ret < 0) {
>> + zonefs_err(zd->sb, "Zone report failed %d\n", ret);
>> + return ret;
>> + }
>> +
>> + if (ret != blkdev_nr_zones(bdev->bd_disk)) {
>> + zonefs_err(zd->sb, "Invalid zone report (%d/%u zones)\n",
>> + ret, blkdev_nr_zones(bdev->bd_disk));
>> + return -EIO;
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static inline void zonefs_cleanup_zone_info(struct zonefs_zone_data *zd)
>> +{
>> + kvfree(zd->zones);
>> +}
>> +
>> +/*
>> + * Read super block information from the device.
>> + */
>> +static int zonefs_read_super(struct super_block *sb)
>> +{
>> + struct zonefs_sb_info *sbi = ZONEFS_SB(sb);
>> + struct zonefs_super *super;
>> + u32 crc, stored_crc;
>> + struct page *page;
>> + struct bio_vec bio_vec;
>> + struct bio bio;
>> + int ret;
>> +
>> + page = alloc_page(GFP_KERNEL);
>> + if (!page)
>> + return -ENOMEM;
>> +
>> + bio_init(&bio, &bio_vec, 1);
>> + bio.bi_iter.bi_sector = 0;
>> + bio_set_dev(&bio, sb->s_bdev);
>> + bio_set_op_attrs(&bio, REQ_OP_READ, 0);
>> + bio_add_page(&bio, page, PAGE_SIZE, 0);
>> +
>> + ret = submit_bio_wait(&bio);
>> + if (ret)
>> + goto out;
>> +
>> + super = page_address(page);
>> +
>> + stored_crc = super->s_crc;
>> + super->s_crc = 0;
>> + crc = crc32_le(ZONEFS_MAGIC, (unsigned char *)super,
>> + sizeof(struct zonefs_super));
>
> Unusual; usually crc32 computations are seeded with ~0U, but <shrug>.

No strong opinion on this one. I will change to ~0U to follow the
general convention.

> Anyway, this looks to be in decent shape now, modulo other comments.

Thank you for your comments. Sending a V3.



--
Damien Le Moal
Western Digital Research