Re: [rfc][patch 3/4] fs: new truncate sequence

From: Boaz Harrosh
Date: Wed Jul 08 2009 - 12:09:56 EST


On 07/08/2009 03:34 PM, Nick Piggin wrote:
> On Wed, Jul 08, 2009 at 06:47:01AM -0400, Christoph Hellwig wrote:
>> On Wed, Jul 08, 2009 at 08:32:25AM +0200, Nick Piggin wrote:
>>> Thanks for the patch, I think I will fold it in to the series. I
>>> think we probably do need to call simple_setsize in inode_setattr
>>> though (unless you propose to eventually convert every filesystem
>>> to define a .setattr). This would also require eg. your ext2
>>> conversion to strip ATTR_SIZE before passing through to inode_setattr.
>> Yes, we should eventually make .setattr mandatory. Doing a default
>> action when a method lacks tends to cause more issues than it solves.
>>
>> I'm happy to help in doing that part of the conversion (and also other
>> bits)
>
> OK well here is what I have now for 3/4 and 4/4. Basically just
> folded your patch on top, changed ordering of some checks, have
> fs clear ATTR_SIZE before calling inode_setattr, add a .new_truncate
> field to check against rather than .truncate, and provide a default
> ATTR_SIZE handler in inode_setattr (simple_setsize).
>
> ---
> Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
> setattr > vmtruncate > truncate, have filesystems call their truncate sequence
> from ->setattr if filesystem specific operations are required. vmtruncate is
> deprecated, and truncate_pagecache and inode_newsize_ok helpers introduced
> previously should be used.
>
> simple_setsize is also introduced to perform the equivalent of vmtruncate.
> simple_setsize gets called by inode_setattr when ATTR_SIZE is passed. So
> filesystems implementing their own truncate code in setattr then calling
> through to inode_setattr should clear ATTR_SIZE.
>
> A new attribute is introduced into inode_operations structure; .new_truncate
> is a temporary hack to distinguish filesystems that implement the new
> truncate system. These guys cannot trim off block past i_size via vmtruncate,
> so instead they must handle it in fs code. This gives better opportunity to
> catch errors etc anyway. .new_truncate and .truncate will go away once all
> filesystems are converted.
>
> Big problem with the previous calling sequence: the filesystem is not called
> until i_size has already changed. This means it is not allowed to fail the
> call, and also it does not know what the previous i_size was. Also, generic
> code calling vmtruncate to truncate allocated blocks in case of error had
> no good way to return a meaningful error (or, for example, atomically handle
> block deallocation).
>
> Signed-off-by: Nick Piggin <npiggin@xxxxxxx>
> ---
> Documentation/filesystems/vfs.txt | 7 ++++++-
> fs/attr.c | 7 ++++++-
> fs/buffer.c | 12 +++++++++---
> fs/direct-io.c | 7 ++++---
> fs/libfs.c | 17 +++++++++++++++++
> include/linux/fs.h | 2 ++
> mm/truncate.c | 6 ++----
> 7 files changed, 46 insertions(+), 12 deletions(-)
>
> Index: linux-2.6/fs/libfs.c
> ===================================================================
> --- linux-2.6.orig/fs/libfs.c
> +++ linux-2.6/fs/libfs.c
> @@ -329,6 +329,22 @@ int simple_rename(struct inode *old_dir,
> return 0;
> }
>
> +int simple_setsize(struct inode *inode, loff_t newsize)
> +{
> + loff_t oldsize;
> + int error;
> +
> + error = inode_newsize_ok(inode, newsize);
> + if (error)
> + return error;
> +
> + oldsize = inode->i_size;
> + i_size_write(inode, newsize);
> + truncate_pagecache(inode, oldsize, newsize);
> +
> + return error;
> +}
> +
> int simple_readpage(struct file *file, struct page *page)
> {
> clear_highpage(page);
> @@ -840,6 +856,7 @@ EXPORT_SYMBOL(generic_read_dir);
> EXPORT_SYMBOL(get_sb_pseudo);
> EXPORT_SYMBOL(simple_write_begin);
> EXPORT_SYMBOL(simple_write_end);
> +EXPORT_SYMBOL(simple_setsize);
> EXPORT_SYMBOL(simple_dir_inode_operations);
> EXPORT_SYMBOL(simple_dir_operations);
> EXPORT_SYMBOL(simple_empty);
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -1527,6 +1527,7 @@ struct inode_operations {
> void * (*follow_link) (struct dentry *, struct nameidata *);
> void (*put_link) (struct dentry *, struct nameidata *, void *);
> void (*truncate) (struct inode *);
> + int new_truncate; /* nasty hack to transition to new truncate code */
> int (*permission) (struct inode *, int);
> int (*setattr) (struct dentry *, struct iattr *);
> int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
> @@ -2332,6 +2333,7 @@ extern int simple_link(struct dentry *,
> extern int simple_unlink(struct inode *, struct dentry *);
> extern int simple_rmdir(struct inode *, struct dentry *);
> extern int simple_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
> +extern int simple_setsize(struct inode *inode, loff_t newsize);
> extern int simple_sync_file(struct file *, struct dentry *, int);
> extern int simple_empty(struct dentry *);
> extern int simple_readpage(struct file *file, struct page *page);
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -1992,9 +1992,14 @@ int block_write_begin(struct file *file,
> * prepare_write() may have instantiated a few blocks
> * outside i_size. Trim these off again. Don't need
> * i_size_read because we hold i_mutex.
> + *
> + * Filesystems which define i_op->new_truncate must
> + * handle this themselves. Eventually this will go
> + * away because everyone will be converted.
> */
> if (pos + len > inode->i_size)
> - vmtruncate(inode, inode->i_size);
> + if (!inode->i_op->new_truncate)
> + vmtruncate(inode, inode->i_size);
> }
> }
>
> @@ -2371,7 +2376,7 @@ int block_commit_write(struct page *page
> *
> * We are not allowed to take the i_mutex here so we have to play games to
> * protect against truncate races as the page could now be beyond EOF. Because
> - * vmtruncate() writes the inode size before removing pages, once we have the
> + * truncate writes the inode size before removing pages, once we have the
> * page lock we can determine safely if the page is beyond EOF. If it is not
> * beyond EOF, then the page is guaranteed safe against truncation until we
> * unlock the page.
> @@ -2595,7 +2600,8 @@ out_release:
> *pagep = NULL;
>
> if (pos + len > inode->i_size)
> - vmtruncate(inode, inode->i_size);
> + if (!inode->i_op->new_truncate)
> + vmtruncate(inode, inode->i_size);
>
> return ret;
> }
> Index: linux-2.6/fs/direct-io.c
> ===================================================================
> --- linux-2.6.orig/fs/direct-io.c
> +++ linux-2.6/fs/direct-io.c
> @@ -1210,14 +1210,15 @@ __blockdev_direct_IO(int rw, struct kioc
> /*
> * In case of error extending write may have instantiated a few
> * blocks outside i_size. Trim these off again for DIO_LOCKING.
> - * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this by
> - * it's own meaner.
> + * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this in
> + * their own manner.
> */
> if (unlikely(retval < 0 && (rw & WRITE))) {
> loff_t isize = i_size_read(inode);
>
> if (end > isize && dio_lock_type == DIO_LOCKING)
> - vmtruncate(inode, isize);
> + if (!inode->i_op->new_truncate)
> + vmtruncate(inode, isize);
> }
>
> if (rw == READ && dio_lock_type == DIO_LOCKING)
> Index: linux-2.6/fs/attr.c
> ===================================================================
> --- linux-2.6.orig/fs/attr.c
> +++ linux-2.6/fs/attr.c
> @@ -112,7 +112,12 @@ int inode_setattr(struct inode * inode,
>
> if (ia_valid & ATTR_SIZE &&
> attr->ia_size != i_size_read(inode)) {
> - int error = vmtruncate(inode, attr->ia_size);
> + int error;
> +
> + if (inode->i_op->new_truncate)
> + error = simple_setsize(inode, attr->ia_size);

I don't understand this branch.
If a filesystem has been converted to set "i_op->new_truncate=true"
then it must have been converted to intersect ->setattr and has set
the i_size (And needs to clear ATTR_SIZE, why?)

All other cases of systems not converted, or systems that do not have
->truncate will fall to the "else" part.

before the removal of i_op->new_truncate you will need to do something
with the systems that do not have ->truncate which will be a
.setattr = simple_setattr or something

So I don't understand this conditional

> + else
> + error = vmtruncate(inode, attr->ia_size);
> if (error)
> return error;
> }
> Index: linux-2.6/Documentation/filesystems/vfs.txt
> ===================================================================
> --- linux-2.6.orig/Documentation/filesystems/vfs.txt
> +++ linux-2.6/Documentation/filesystems/vfs.txt
> @@ -401,11 +401,16 @@ otherwise noted.
> started might not be in the page cache at the end of the
> walk).
>
> - truncate: called by the VFS to change the size of a file. The
> + truncate: Deprecated. This will not be called if ->setsize is defined.
> + Called by the VFS to change the size of a file. The
> i_size field of the inode is set to the desired size by the
> VFS before this method is called. This method is called by
> the truncate(2) system call and related functionality.
>
> + Note: ->truncate and vmtruncate are deprecated. Do not add new
> + instances/calls of these. Filesystems shoud be converted to do their
> + truncate sequence via ->setattr().
> +
> permission: called by the VFS to check for access rights on a POSIX-like
> filesystem.
>
> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -513,12 +513,10 @@ int vmtruncate(struct inode * inode, lof
> loff_t oldsize;
> int error;
>
> - error = inode_newsize_ok(inode, offset);
> + error = simple_setsize(inode, offset);
> if (error)
> return error;
> - oldsize = inode->i_size;
> - i_size_write(inode, offset);
> - truncate_pagecache(inode, oldsize, offset);
> +
> if (inode->i_op->truncate)
> inode->i_op->truncate(inode);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/