Re: [rfc][patch 1/3] fs: new truncate sequence

From: Boaz Harrosh
Date: Tue Jul 07 2009 - 07:45:20 EST


On 07/06/2009 07:54 PM, Nick Piggin wrote:
> Hi,
>
> Here is a next iteration of the truncate change. 1st patch introduces
> all the mechanism. 2nd patch makes use of some new helper functions
> but otherwise is uninteresting. Last patch converts a couple of filesystems
> just to see how it looks (more complete conversion may actually try to check
> and handle errors when truncating blocks in the fs, but that's outside the
> scope of this series).
>
> I am also working on removing page lock requirement from extending i_size
> during write(2) syscall as we discussed which could allow solution for
> partial-page page_mkwrite problems. I've got some code here but it's got
> some remaining fsx failure I'm tracking down... Anyway I think the truncate
> reorganisation should be a good change regardless (I need it for fsblock
> too).
>
> --
>
> Introduce a new truncate calling sequence into fs/mm subsystems. Rather than
> setattr > vmtruncate > truncate, add a new inode operation ->ftruncate, called
> from inode_setattr when called with ATTR_SIZE. The filesystem will be
> responsible for updating i_size and truncating pagecache.
>
> Generic code which previously called vmtruncate (in order to truncate blocks
> that may have been instantiated past i_size) no longer calls vmtruncate in the
> case that the inode has an ->ftruncate attribute. In that case it is the
> responsibility of the caller to trim off blocks appropriately in case of
> error.
>
> New helper functions, inode_truncate_ok and truncate_pagecache are broken out
> of vmtruncate, allowing the filesystem more flexibility in ordering of
> operations. simple_ftruncate is implemented for filesystems which have no
> need for a ->truncate operation under the old scheme.
>
> Big problem with the previous calling sequence: the filesystem is not called
> until i_size has already changed. This means it is not allowed to fail the
> call, and also it does not know what the previous i_size was. Also, generic
> code calling vmtruncate to truncate allocated blocks in case of error had
> no good way to return a meaningful error (or, for example, atomically handle
> block deallocation).
>
> ---
> Documentation/filesystems/Locking | 7 +--
> Documentation/vm/locking | 2 -
> fs/attr.c | 55 ++++++++++++++++++++++++---
> fs/buffer.c | 9 +++-
> fs/direct-io.c | 6 +--
> fs/libfs.c | 18 +++++++++
> include/linux/fs.h | 4 ++
> include/linux/mm.h | 2 +
> mm/filemap.c | 2 -
> mm/memory.c | 62 +------------------------------
> mm/mremap.c | 4 +-
> mm/nommu.c | 40 --------------------
> mm/truncate.c | 76 ++++++++++++++++++++++++++++++++++++++
> 13 files changed, 168 insertions(+), 119 deletions(-)
>
> Index: linux-2.6/fs/attr.c
> ===================================================================
> --- linux-2.6.orig/fs/attr.c
> +++ linux-2.6/fs/attr.c
> @@ -60,18 +60,61 @@ fine:
> error:
> return retval;
> }
> -
> EXPORT_SYMBOL(inode_change_ok);
>
> +int inode_truncate_ok(struct inode *inode, loff_t offset)
> +{
> + if (inode->i_size < offset) {
> + unsigned long limit;
> +
> + limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
> + if (limit != RLIM_INFINITY && offset > limit)
> + goto out_sig;
> + if (offset > inode->i_sb->s_maxbytes)
> + goto out_big;
> + } else {
> + /*
> + * truncation of in-use swapfiles is disallowed - it would
> + * cause subsequent swapout to scribble on the now-freed
> + * blocks.
> + */
> + if (IS_SWAPFILE(inode))
> + return -ETXTBSY;
> + }
> +
> + return 0;
> +out_sig:
> + send_sig(SIGXFSZ, current, 0);
> +out_big:
> + return -EFBIG;
> +}
> +EXPORT_SYMBOL(inode_truncate_ok);
> +
> int inode_setattr(struct inode * inode, struct iattr * attr)
> {
> unsigned int ia_valid = attr->ia_valid;
>
> - if (ia_valid & ATTR_SIZE &&
> - attr->ia_size != i_size_read(inode)) {
> - int error = vmtruncate(inode, attr->ia_size);
> - if (error)
> - return error;
> + if (ia_valid & ATTR_SIZE) {
> + loff_t offset = attr->ia_size;
> +
> + if (offset != inode->i_size) {

Deleted code used to i_size_read() here.
Is there a good documentation of when / what gets locked
in regarding to inode->i_size? And when can it be accessed
directly and when through helpers?

> + int error;
> +
> + if (inode->i_op->ftruncate) {
> + struct file *filp = NULL;
> + int open = 0;
> +
> + if (ia_valid & ATTR_FILE)
> + filp = attr->ia_file;
> + if (ia_valid & ATTR_OPEN)
> + open = 1;
> + error = inode->i_op->ftruncate(filp, open,
> + inode, offset);
> + } else
> + error = vmtruncate(inode, offset);
> + if (error)
> + return error;
> + }
> }
>
> if (ia_valid & ATTR_UID)
> Index: linux-2.6/fs/libfs.c
> ===================================================================
> --- linux-2.6.orig/fs/libfs.c
> +++ linux-2.6/fs/libfs.c
> @@ -329,6 +329,23 @@ int simple_rename(struct inode *old_dir,
> return 0;
> }
>
> +int simple_ftruncate(struct file *file, int open,
> + struct inode *inode, loff_t offset)
> +{
> + loff_t oldsize;
> + int error;
> +
> + error = inode_truncate_ok(inode, offset);
> + if (error)
> + return error;
> +
> + oldsize = inode->i_size;
> + i_size_write(inode, offset);
> + truncate_pagecache(inode, oldsize, offset);
> +
> + return error;
> +}
> +
> int simple_readpage(struct file *file, struct page *page)
> {
> clear_highpage(page);
> @@ -840,6 +857,7 @@ EXPORT_SYMBOL(generic_read_dir);
> EXPORT_SYMBOL(get_sb_pseudo);
> EXPORT_SYMBOL(simple_write_begin);
> EXPORT_SYMBOL(simple_write_end);
> +EXPORT_SYMBOL(simple_ftruncate);
> EXPORT_SYMBOL(simple_dir_inode_operations);
> EXPORT_SYMBOL(simple_dir_operations);
> EXPORT_SYMBOL(simple_empty);
> Index: linux-2.6/include/linux/fs.h
> ===================================================================
> --- linux-2.6.orig/include/linux/fs.h
> +++ linux-2.6/include/linux/fs.h
> @@ -1527,6 +1527,7 @@ struct inode_operations {
> void * (*follow_link) (struct dentry *, struct nameidata *);
> void (*put_link) (struct dentry *, struct nameidata *, void *);
> void (*truncate) (struct inode *);
> + int (*ftruncate) (struct file *, int, struct inode *, loff_t);
> int (*permission) (struct inode *, int);
> int (*setattr) (struct dentry *, struct iattr *);
> int (*getattr) (struct vfsmount *mnt, struct dentry *, struct kstat *);
> @@ -2332,6 +2333,8 @@ extern int simple_link(struct dentry *,
> extern int simple_unlink(struct inode *, struct dentry *);
> extern int simple_rmdir(struct inode *, struct dentry *);
> extern int simple_rename(struct inode *, struct dentry *, struct inode *, struct dentry *);
> +extern int simple_ftruncate(struct file *file, int open,
> + struct inode *inode, loff_t offset);
> extern int simple_sync_file(struct file *, struct dentry *, int);
> extern int simple_empty(struct dentry *);
> extern int simple_readpage(struct file *file, struct page *page);
> @@ -2367,6 +2370,7 @@ extern int buffer_migrate_page(struct ad
> #endif
>
> extern int inode_change_ok(struct inode *, struct iattr *);
> +extern int inode_truncate_ok(struct inode *, loff_t offset);
> extern int __must_check inode_setattr(struct inode *, struct iattr *);
>
> extern void file_update_time(struct file *file);
> Index: linux-2.6/include/linux/mm.h
> ===================================================================
> --- linux-2.6.orig/include/linux/mm.h
> +++ linux-2.6/include/linux/mm.h
> @@ -805,7 +805,9 @@ static inline void unmap_shared_mapping_
> unmap_mapping_range(mapping, holebegin, holelen, 0);
> }
>
> +extern void truncate_pagecache(struct inode * inode, loff_t old, loff_t new);
> extern int vmtruncate(struct inode * inode, loff_t offset);
> +extern int truncate_blocks(struct inode *inode, loff_t offset);
> extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
>
> #ifdef CONFIG_MMU
> Index: linux-2.6/mm/memory.c
> ===================================================================
> --- linux-2.6.orig/mm/memory.c
> +++ linux-2.6/mm/memory.c
> @@ -282,7 +282,8 @@ void free_pgtables(struct mmu_gather *tl
> unsigned long addr = vma->vm_start;
>
> /*
> - * Hide vma from rmap and vmtruncate before freeing pgtables
> + * Hide vma from rmap and truncate_pagecache before freeing
> + * pgtables
> */
> anon_vma_unlink(vma);
> unlink_file_vma(vma);
> @@ -2358,7 +2359,7 @@ restart:
> * @mapping: the address space containing mmaps to be unmapped.
> * @holebegin: byte in first page to unmap, relative to the start of
> * the underlying file. This will be rounded down to a PAGE_SIZE
> - * boundary. Note that this is different from vmtruncate(), which
> + * boundary. Note that this is different from truncate_pagecache(), which
> * must keep the partial page. In contrast, we must get rid of
> * partial pages.
> * @holelen: size of prospective hole in bytes. This will be rounded
> @@ -2409,63 +2410,6 @@ void unmap_mapping_range(struct address_
> }
> EXPORT_SYMBOL(unmap_mapping_range);
>
> -/**
> - * vmtruncate - unmap mappings "freed" by truncate() syscall
> - * @inode: inode of the file used
> - * @offset: file offset to start truncating
> - *
> - * NOTE! We have to be ready to update the memory sharing
> - * between the file and the memory map for a potential last
> - * incomplete page. Ugly, but necessary.
> - */
> -int vmtruncate(struct inode * inode, loff_t offset)
> -{
> - if (inode->i_size < offset) {
> - unsigned long limit;
> -
> - limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
> - if (limit != RLIM_INFINITY && offset > limit)
> - goto out_sig;
> - if (offset > inode->i_sb->s_maxbytes)
> - goto out_big;
> - i_size_write(inode, offset);
> - } else {
> - struct address_space *mapping = inode->i_mapping;
> -
> - /*
> - * truncation of in-use swapfiles is disallowed - it would
> - * cause subsequent swapout to scribble on the now-freed
> - * blocks.
> - */
> - if (IS_SWAPFILE(inode))
> - return -ETXTBSY;
> - i_size_write(inode, offset);
> -
> - /*
> - * unmap_mapping_range is called twice, first simply for
> - * efficiency so that truncate_inode_pages does fewer
> - * single-page unmaps. However after this first call, and
> - * before truncate_inode_pages finishes, it is possible for
> - * private pages to be COWed, which remain after
> - * truncate_inode_pages finishes, hence the second
> - * unmap_mapping_range call must be made for correctness.
> - */
> - unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
> - truncate_inode_pages(mapping, offset);
> - unmap_mapping_range(mapping, offset + PAGE_SIZE - 1, 0, 1);
> - }
> -
> - if (inode->i_op->truncate)
> - inode->i_op->truncate(inode);
> - return 0;
> -
> -out_sig:
> - send_sig(SIGXFSZ, current, 0);
> -out_big:
> - return -EFBIG;
> -}
> -EXPORT_SYMBOL(vmtruncate);
> -
> int vmtruncate_range(struct inode *inode, loff_t offset, loff_t end)
> {
> struct address_space *mapping = inode->i_mapping;
> Index: linux-2.6/Documentation/filesystems/Locking
> ===================================================================
> --- linux-2.6.orig/Documentation/filesystems/Locking
> +++ linux-2.6/Documentation/filesystems/Locking
> @@ -78,10 +78,9 @@ removexattr: yes
> victim.
> cross-directory ->rename() has (per-superblock) ->s_vfs_rename_sem.
> ->truncate() is never called directly - it's a callback, not a
> -method. It's called by vmtruncate() - library function normally used by
> -->setattr(). Locking information above applies to that call (i.e. is
> -inherited from ->setattr() - vmtruncate() is used when ATTR_SIZE had been
> -passed).
> +method. It's called by the default inode_setattr() library function normally
> +used by ->setattr() when ATTR_SIZE has been passed. Locking information above
> +applies to that call (i.e. is inherited from ->setattr()).
>
> See Documentation/filesystems/directory-locking for more detailed discussion
> of the locking scheme for directory operations.
> Index: linux-2.6/mm/nommu.c
> ===================================================================
> --- linux-2.6.orig/mm/nommu.c
> +++ linux-2.6/mm/nommu.c
> @@ -86,46 +86,6 @@ struct vm_operations_struct generic_file
> };
>
> /*
> - * Handle all mappings that got truncated by a "truncate()"
> - * system call.
> - *
> - * NOTE! We have to be ready to update the memory sharing
> - * between the file and the memory map for a potential last
> - * incomplete page. Ugly, but necessary.
> - */
> -int vmtruncate(struct inode *inode, loff_t offset)
> -{
> - struct address_space *mapping = inode->i_mapping;
> - unsigned long limit;
> -
> - if (inode->i_size < offset)
> - goto do_expand;
> - i_size_write(inode, offset);
> -
> - truncate_inode_pages(mapping, offset);
> - goto out_truncate;
> -
> -do_expand:
> - limit = current->signal->rlim[RLIMIT_FSIZE].rlim_cur;
> - if (limit != RLIM_INFINITY && offset > limit)
> - goto out_sig;
> - if (offset > inode->i_sb->s_maxbytes)
> - goto out;
> - i_size_write(inode, offset);
> -
> -out_truncate:
> - if (inode->i_op->truncate)
> - inode->i_op->truncate(inode);
> - return 0;
> -out_sig:
> - send_sig(SIGXFSZ, current, 0);
> -out:
> - return -EFBIG;
> -}
> -
> -EXPORT_SYMBOL(vmtruncate);
> -
> -/*
> * Return the total memory allocated for this pointer, not
> * just what the caller asked for.
> *
> Index: linux-2.6/Documentation/vm/locking
> ===================================================================
> --- linux-2.6.orig/Documentation/vm/locking
> +++ linux-2.6/Documentation/vm/locking
> @@ -80,7 +80,7 @@ Note: PTL can also be used to guarantee
> mm start up ... this is a loose form of stability on mm_users. For
> example, it is used in copy_mm to protect against a racing tlb_gather_mmu
> single address space optimization, so that the zap_page_range (from
> -vmtruncate) does not lose sending ipi's to cloned threads that might
> +truncate) does not lose sending ipi's to cloned threads that might
> be spawned underneath it and go to user mode to drag in pte's into tlbs.
>
> swap_lock
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -59,7 +59,7 @@
> /*
> * Lock ordering:
> *
> - * ->i_mmap_lock (vmtruncate)
> + * ->i_mmap_lock (truncate_pagecache)
> * ->private_lock (__free_pte->__set_page_dirty_buffers)
> * ->swap_lock (exclusive_swap_page, others)
> * ->mapping->tree_lock
> Index: linux-2.6/mm/mremap.c
> ===================================================================
> --- linux-2.6.orig/mm/mremap.c
> +++ linux-2.6/mm/mremap.c
> @@ -85,8 +85,8 @@ static void move_ptes(struct vm_area_str
> if (vma->vm_file) {
> /*
> * Subtle point from Rajesh Venkatasubramanian: before
> - * moving file-based ptes, we must lock vmtruncate out,
> - * since it might clean the dst vma before the src vma,
> + * moving file-based ptes, we must lock truncate_pagecache
> + * out, since it might clean the dst vma before the src vma,
> * and we propagate stale pages into the dst afterward.
> */
> mapping = vma->vm_file->f_mapping;
> Index: linux-2.6/mm/truncate.c
> ===================================================================
> --- linux-2.6.orig/mm/truncate.c
> +++ linux-2.6/mm/truncate.c
> @@ -465,3 +465,79 @@ int invalidate_inode_pages2(struct addre
> return invalidate_inode_pages2_range(mapping, 0, -1);
> }
> EXPORT_SYMBOL_GPL(invalidate_inode_pages2);
> +
> +/**
> + * truncate_pagecache - unmap mappings "freed" by truncate() syscall
> + * @inode: inode
> + * @old: old file offset
> + * @new: new file offset
> + *
> + * inode's new i_size must already be written before truncate_pagecache
> + * is called.
> + */
> +void truncate_pagecache(struct inode * inode, loff_t old, loff_t new)
> +{
> + VM_BUG_ON(inode->i_size != new);
> +
> + if (new < old) {
> + struct address_space *mapping = inode->i_mapping;
> +
> +#ifdef CONFIG_MMU
> + /*
> + * unmap_mapping_range is called twice, first simply for
> + * efficiency so that truncate_inode_pages does fewer
> + * single-page unmaps. However after this first call, and
> + * before truncate_inode_pages finishes, it is possible for
> + * private pages to be COWed, which remain after
> + * truncate_inode_pages finishes, hence the second
> + * unmap_mapping_range call must be made for correctness.
> + */
> + unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
> + truncate_inode_pages(mapping, new);
> + unmap_mapping_range(mapping, new + PAGE_SIZE - 1, 0, 1);
> +#else
> + truncate_inode_pages(mapping, new);
> +#endif
> + }
> +}
> +EXPORT_SYMBOL(truncate_pagecache);
> +
> +/**
> + * vmtruncate - unmap mappings "freed" by truncate() syscall
> + * @inode: inode of the file used
> + * @offset: file offset to start truncating
> + *
> + * NOTE! We have to be ready to update the memory sharing
> + * between the file and the memory map for a potential last
> + * incomplete page. Ugly, but necessary.
> + *
> + * This function is deprecated and truncate_pagecache should be
> + * used instead.
> + */
> +int vmtruncate(struct inode * inode, loff_t offset)
> +{
> + loff_t oldsize;
> + int error;
> +
> + error = inode_truncate_ok(inode, offset);
> + if (error)
> + return error;
> + oldsize = inode->i_size;
> + i_size_write(inode, offset);
> + truncate_pagecache(inode, oldsize, offset);
> +

Here too it can be:

+ error = simple_ftruncate(NULL, 0, inode, offset);
+ if (error)
+ return error;

> + if (inode->i_op->truncate)
> + inode->i_op->truncate(inode);
> +
> + return error;
> +}
> +EXPORT_SYMBOL(vmtruncate);
> +
> +int truncate_blocks(struct inode *inode, loff_t offset)
> +{
> + if (inode->i_op->ftruncate) /* these guys handle it themselves */
> + return 0;
> +
> + return vmtruncate(inode, offset);
> +}
> +EXPORT_SYMBOL(truncate_blocks);
> Index: linux-2.6/fs/buffer.c
> ===================================================================
> --- linux-2.6.orig/fs/buffer.c
> +++ linux-2.6/fs/buffer.c
> @@ -1992,9 +1992,12 @@ int block_write_begin(struct file *file,
> * prepare_write() may have instantiated a few blocks
> * outside i_size. Trim these off again. Don't need
> * i_size_read because we hold i_mutex.
> + *
> + * Filesystems which define ->ftruncate must handle
> + * this themselves.
> */
> if (pos + len > inode->i_size)
> - vmtruncate(inode, inode->i_size);
> + truncate_blocks(inode, inode->i_size);
> }
> }
>
> @@ -2377,7 +2380,7 @@ int block_commit_write(struct page *page
> *
> * We are not allowed to take the i_mutex here so we have to play games to
> * protect against truncate races as the page could now be beyond EOF. Because
> - * vmtruncate() writes the inode size before removing pages, once we have the
> + * truncate writes the inode size before removing pages, once we have the
> * page lock we can determine safely if the page is beyond EOF. If it is not
> * beyond EOF, then the page is guaranteed safe against truncation until we
> * unlock the page.
> @@ -2601,7 +2604,7 @@ out_release:
> *pagep = NULL;
>
> if (pos + len > inode->i_size)
> - vmtruncate(inode, inode->i_size);
> + truncate_blocks(inode, inode->i_size);
>
> return ret;
> }
> Index: linux-2.6/fs/direct-io.c
> ===================================================================
> --- linux-2.6.orig/fs/direct-io.c
> +++ linux-2.6/fs/direct-io.c
> @@ -1210,14 +1210,14 @@ __blockdev_direct_IO(int rw, struct kioc
> /*
> * In case of error extending write may have instantiated a few
> * blocks outside i_size. Trim these off again for DIO_LOCKING.
> - * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this by
> - * it's own meaner.
> + * NOTE: DIO_NO_LOCK/DIO_OWN_LOCK callers have to handle this in
> + * their own manner.
> */
> if (unlikely(retval < 0 && (rw & WRITE))) {
> loff_t isize = i_size_read(inode);
>
> if (end > isize && dio_lock_type == DIO_LOCKING)
> - vmtruncate(inode, isize);
> + truncate_blocks(inode, isize);
> }
>
> if (rw == READ && dio_lock_type == DIO_LOCKING)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html

Boaz
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/