Re: [GIT PULL] xfs: shared data extents support for 4.9-rc1

From: Darrick J. Wong
Date: Wed Oct 12 2016 - 12:51:07 EST


On Wed, Oct 12, 2016 at 11:18:49PM +1100, Dave Chinner wrote:
> Hi Linus,
>
> This is the second part of the XFS updates for this merge cycle.
> This pullreq contains the new shared data extents feature for XFS,
> and can be found at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git tags/xfs-reflink-for-linus-4.9-rc1
>
> The full pull request output is below.
>
> Given the complexity and size of this change I am expecting - like
> the addition of reverse mapping last cycle - that there will be some
> follow-up bug fixes and cleanups around the -rc3 stage for issues
> that I'm sure will show up once the code hits a wider userbase.
>
> What it is:
>
> At the most basic level we are simply adding shared data extents to
> XFS - i.e. a single extent on disk can now have multiple owners. To
> do this we have to add new on-disk features to both track the shared
> extents and the number of times they've been shared. This is done by
> the new "refcount" btree that sits in every allocation group. When
> we share or unshare an extent, this tree gets updated.
>
> Along with this new tree, the reverse mapping tree needs to be
> updated to track each owner or a shared extent. This also needs to
> be updated ever share/unshare operation. These interactions at
> extent allocation and freeing time have complex ordering and
> recovery constraints, so there's a significant amount of new
> intent-based transaction code to ensure that operations are
> performed atomically from both the runtime and integrity/crash
> recovery perspectives.
>
> We also need to break sharing when writes hit a shared extent - this
> is where the new copy-on-write implementation comes in. We allocate
> new storage and copy the original data along with the overwrite data
> into the new location. We only do this for data as we don't share
> metadata at all - each inode has it's own metadata that tracks the
> shared data extents, the extents undergoing CoW and it's own private
> extents.
>
> Of course, being XFS, nothing is simple - we use delayed allocation
> for CoW similar to how we use it for normal writes. ENOSPC is a
> significant issue here - we build on the reservation code added
> in 4.8-rc1 with the reverse mapping feature to ensure we don't get
> spurious ENOSPC issues part way through a CoW operation. These
> mechanisms also help minimise fragmentation due to repeated CoW
> operations. To further reduce fragmentation overhead, we've also
> introduced a CoW extent size hint, which indicates how large a
> region we should allocate when we execute a CoW operation.
>
> With all this functionality in place, we can hook up
> .copy_file_range, .clone_file_range and .dedupe_file_range and we
> gain all the capabilities of reflink and other vfs provided
> functionality that enable manipulation to shared extents. We also
> added a fallocate mode that explicitly unshares a range of a file,
> which we implemented as an explicit CoW of all the shared extents in
> a file.
>
> As such, it's a huge chunk of new functionality with new on-disk
> format features and internal infrastructure. It warns at mount time
> as an experimental feature and that it may eat data (as we do with
> all new on-disk features until they stabilise). We have not
> released userspace suport for it yet - userspace support currently
> requires download from Darrick's xfsprogs repo and build from
> source, so the access to this feature is really developer/tester
> only at this point. Initial userspace support will be released at
> the same time the kernel with this code in it is released.

Userland support is in this branch:
https://github.com/djwong/xfsprogs/tree/for-dave-for-4.9-15

There will undoubtedly be more of these since Dave will libxfs-apply
the kernel patches into for-next after the merge window closes, after
which I'll rebase the tool patches against that.

> The new code causes 5-6 new failures with xfstests - these aren't
> serious functional failures but things the output of tests changing
> slightly due to perturbations in layouts, space usage, etc. OTOH,
> we've added 150+ new tests to xfstests that specifically exercise
> this new functionality so it's got far better test coverage than any
> functionality we've previously added to XFS.

https://github.com/djwong/xfstests/tree/djwong-devel
have fixes to some of the tests tests, if you dare. :)

I'll resync with upstream the next time I see a xfstests.git update.
(Merge window is open, so I don't anticipate that until next week.)

> Darrick has done a pretty amazing job getting us to this stage, and
> special mention also needs to go to Christoph (review, testing,
> improvements and bug fixes) and Brian (caught several intricate
> bugs during review) for the effort they've also put in.

Yes, my hearty thanks to Dave, Christoph, and Brian for their support!

--D

>
> Thanks,
>
> -Dave.
>
> ----------
> The following changes since commit 155cd433b516506df065866f3d974661f6473572:
>
> Merge branch 'xfs-4.9-log-recovery-fixes' into for-next (2016-10-03 09:56:28 +1100)
>
> are available in the git repository at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git tags/xfs-reflink-for-linus-4.9-rc1
>
> for you to fetch changes up to feac470e3642e8956ac9b7f14224e6b301b9219d:
>
> xfs: convert COW blocks to real blocks before unwritten extent conversion (2016-10-11 09:03:19 +1100)
>
> ----------------------------------------------------------------
> xfs: reflink update for 4.9-rc1
>
> < XFS has gained super CoW powers! >
> ----------------------------------
> \ ^__^
> \ (oo)\_______
> (__)\ )\/\
> ||----w |
> || ||
>
> Included in this update:
> - unshare range (FALLOC_FL_UNSHARE) support for fallocate
> - copy-on-write extent size hints (FS_XFLAG_COWEXTSIZE) for fsxattr interface
> - shared extent support for XFS
> - copy-on-write support for shared extents
> - copy_file_range support
> - clone_file_range support (implements reflink)
> - dedupe_file_range support
> - defrag support for reverse mapping enabled filesystems
>
> ----------------------------------------------------------------
> Christoph Hellwig (1):
> xfs: convert COW blocks to real blocks before unwritten extent conversion
>
> Darrick J. Wong (70):
> vfs: support FS_XFLAG_COWEXTSIZE and get/set of CoW extent size hint
> vfs: add a FALLOC_FL_UNSHARE mode to fallocate to unshare a range of blocks
> xfs: return an error when an inline directory is too small
> xfs: define tracepoints for refcount btree activities
> xfs: introduce refcount btree definitions
> xfs: refcount btree add more reserved blocks
> xfs: define the on-disk refcount btree format
> xfs: add refcount btree support to growfs
> xfs: account for the refcount btree in the alloc/free log reservation
> xfs: add refcount btree operations
> xfs: create refcount update intent log items
> xfs: log refcount intent items
> xfs: adjust refcount of an extent of blocks in refcount btree
> xfs: connect refcount adjust functions to upper layers
> xfs: adjust refcount when unmapping file blocks
> xfs: add refcount btree block detection to log recovery
> xfs: reserve AG space for the refcount btree root
> xfs: introduce reflink utility functions
> xfs: create bmbt update intent log items
> xfs: log bmap intent items
> xfs: map an inode's offset to an exact physical block
> xfs: pass bmapi flags through to bmap_del_extent
> xfs: implement deferred bmbt map/unmap operations
> xfs: when replaying bmap operations, don't let unlinked inodes get reaped
> xfs: return work remaining at the end of a bunmapi operation
> xfs: define tracepoints for reflink activities
> xfs: add reflink feature flag to geometry
> xfs: don't allow reflinked dir/dev/fifo/socket/pipe files
> xfs: introduce the CoW fork
> xfs: support bmapping delalloc extents in the CoW fork
> xfs: create delalloc extents in CoW fork
> xfs: support allocating delayed extents in CoW fork
> xfs: allocate delayed extents in CoW fork
> xfs: support removing extents from CoW fork
> xfs: move mappings from cow fork to data fork after copy-write
> xfs: report shared extent mappings to userspace correctly
> xfs: implement CoW for directio writes
> xfs: cancel CoW reservations and clear inode reflink flag when freeing blocks
> xfs: cancel pending CoW reservations when destroying inodes
> xfs: store in-progress CoW allocations in the refcount btree
> xfs: reflink extents from one file to another
> xfs: add clone file and clone range vfs functions
> xfs: add dedupe range vfs function
> xfs: teach get_bmapx about shared extents and the CoW fork
> xfs: swap inode reflink flags when swapping inode extents
> xfs: unshare a range of blocks via fallocate
> xfs: create a separate cow extent size hint for the allocator
> xfs: preallocate blocks for worst-case btree expansion
> xfs: don't allow reflink when the AG is low on space
> xfs: try other AGs to allocate a BMBT block
> xfs: garbage collect old cowextsz reservations
> xfs: increase log reservations for reflink
> xfs: add shared rmap map/unmap/convert log item types
> xfs: use interval query for rmap alloc operations on shared files
> xfs: convert unwritten status of reverse mappings for shared files
> xfs: set a default CoW extent size of 32 blocks
> xfs: check for invalid inode reflink flags
> xfs: don't mix reflink and DAX mode for now
> xfs: simulate per-AG reservations being critically low
> xfs: recognize the reflink feature bit
> xfs: various swapext cleanups
> xfs: refactor swapext code
> xfs: implement swapext for rmap filesystems
> xfs: check inode reflink flag before calling reflink functions
> xfs: reduce stack usage of _reflink_clear_inode_flag
> xfs: remove isize check from unshare operation
> xfs: fix label inaccuracies
> xfs: fix error initialization
> xfs: clear reflink flag if setting realtime flag
> xfs: rework refcount cow recovery error handling
>
> fs/open.c | 5 +
> fs/xfs/Makefile | 7 +
> fs/xfs/libxfs/xfs_ag_resv.c | 15 +-
> fs/xfs/libxfs/xfs_alloc.c | 23 +
> fs/xfs/libxfs/xfs_bmap.c | 575 +++++++++++-
> fs/xfs/libxfs/xfs_bmap.h | 67 +-
> fs/xfs/libxfs/xfs_bmap_btree.c | 18 +
> fs/xfs/libxfs/xfs_btree.c | 8 +-
> fs/xfs/libxfs/xfs_btree.h | 16 +
> fs/xfs/libxfs/xfs_defer.h | 2 +
> fs/xfs/libxfs/xfs_format.h | 97 +-
> fs/xfs/libxfs/xfs_fs.h | 10 +-
> fs/xfs/libxfs/xfs_inode_buf.c | 24 +-
> fs/xfs/libxfs/xfs_inode_buf.h | 1 +
> fs/xfs/libxfs/xfs_inode_fork.c | 70 +-
> fs/xfs/libxfs/xfs_inode_fork.h | 28 +-
> fs/xfs/libxfs/xfs_log_format.h | 118 ++-
> fs/xfs/libxfs/xfs_refcount.c | 1698 ++++++++++++++++++++++++++++++++++++
> fs/xfs/libxfs/xfs_refcount.h | 70 ++
> fs/xfs/libxfs/xfs_refcount_btree.c | 451 ++++++++++
> fs/xfs/libxfs/xfs_refcount_btree.h | 74 ++
> fs/xfs/libxfs/xfs_rmap.c | 1120 +++++++++++++++++++++---
> fs/xfs/libxfs/xfs_rmap.h | 7 +
> fs/xfs/libxfs/xfs_rmap_btree.c | 82 +-
> fs/xfs/libxfs/xfs_rmap_btree.h | 7 +
> fs/xfs/libxfs/xfs_sb.c | 9 +
> fs/xfs/libxfs/xfs_shared.h | 2 +
> fs/xfs/libxfs/xfs_trans_resv.c | 23 +-
> fs/xfs/libxfs/xfs_trans_resv.h | 3 +
> fs/xfs/libxfs/xfs_trans_space.h | 9 +
> fs/xfs/libxfs/xfs_types.h | 3 +-
> fs/xfs/xfs_aops.c | 222 ++++-
> fs/xfs/xfs_aops.h | 4 +-
> fs/xfs/xfs_bmap_item.c | 508 +++++++++++
> fs/xfs/xfs_bmap_item.h | 98 +++
> fs/xfs/xfs_bmap_util.c | 589 ++++++++++---
> fs/xfs/xfs_dir2_readdir.c | 3 +-
> fs/xfs/xfs_error.h | 10 +-
> fs/xfs/xfs_file.c | 221 ++++-
> fs/xfs/xfs_fsops.c | 107 ++-
> fs/xfs/xfs_fsops.h | 3 +
> fs/xfs/xfs_globals.c | 5 +-
> fs/xfs/xfs_icache.c | 243 +++++-
> fs/xfs/xfs_icache.h | 7 +
> fs/xfs/xfs_inode.c | 51 ++
> fs/xfs/xfs_inode.h | 19 +
> fs/xfs/xfs_inode_item.c | 2 +-
> fs/xfs/xfs_ioctl.c | 75 +-
> fs/xfs/xfs_iomap.c | 35 +-
> fs/xfs/xfs_iomap.h | 3 +-
> fs/xfs/xfs_iops.c | 1 +
> fs/xfs/xfs_itable.c | 8 +-
> fs/xfs/xfs_linux.h | 1 +
> fs/xfs/xfs_log_recover.c | 357 ++++++++
> fs/xfs/xfs_mount.c | 32 +
> fs/xfs/xfs_mount.h | 8 +
> fs/xfs/xfs_ondisk.h | 3 +
> fs/xfs/xfs_pnfs.c | 7 +
> fs/xfs/xfs_refcount_item.c | 539 ++++++++++++
> fs/xfs/xfs_refcount_item.h | 101 +++
> fs/xfs/xfs_reflink.c | 1688 +++++++++++++++++++++++++++++++++++
> fs/xfs/xfs_reflink.h | 58 ++
> fs/xfs/xfs_rmap_item.c | 12 +
> fs/xfs/xfs_stats.c | 1 +
> fs/xfs/xfs_stats.h | 18 +-
> fs/xfs/xfs_super.c | 87 ++
> fs/xfs/xfs_sysctl.c | 9 +
> fs/xfs/xfs_sysctl.h | 1 +
> fs/xfs/xfs_trace.h | 742 +++++++++++++++-
> fs/xfs/xfs_trans.h | 29 +
> fs/xfs/xfs_trans_bmap.c | 249 ++++++
> fs/xfs/xfs_trans_refcount.c | 264 ++++++
> fs/xfs/xfs_trans_rmap.c | 9 +
> include/linux/falloc.h | 3 +-
> include/uapi/linux/falloc.h | 18 +
> include/uapi/linux/fs.h | 4 +-
> 76 files changed, 10683 insertions(+), 413 deletions(-)
> create mode 100644 fs/xfs/libxfs/xfs_refcount.c
> create mode 100644 fs/xfs/libxfs/xfs_refcount.h
> create mode 100644 fs/xfs/libxfs/xfs_refcount_btree.c
> create mode 100644 fs/xfs/libxfs/xfs_refcount_btree.h
> create mode 100644 fs/xfs/xfs_bmap_item.c
> create mode 100644 fs/xfs/xfs_bmap_item.h
> create mode 100644 fs/xfs/xfs_refcount_item.c
> create mode 100644 fs/xfs/xfs_refcount_item.h
> create mode 100644 fs/xfs/xfs_reflink.c
> create mode 100644 fs/xfs/xfs_reflink.h
> create mode 100644 fs/xfs/xfs_trans_bmap.c
> create mode 100644 fs/xfs/xfs_trans_refcount.c
> --
> Dave Chinner
> david@xxxxxxxxxxxxx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at http://vger.kernel.org/majordomo-info.html