[PATCH 0/2] xfs: make cluster size tunnable for sparse allocation

From: Tianxiang Peng
Date: Mon Dec 16 2024 - 08:06:26 EST


This patch series makes inode cluster size a tunnable parameter in
mkfs.xfs when sparse allocation is enabled, and also makes xfs use
inode cluster size directly from the superblock read in rather than
recalculate itself and verify.

Under extreme fragmentation situations, even inode sparse allocation
may fail with current default inode cluster size i.e. 8192 bytes. Such
situations may come from the PUNCH_HOLE fallocation which is used by
some applications, for example MySQL innodb page compression. With xfs
of 4K blocksize, MySQL may write out 16K buffer with direct I/O(which
immediately triggers block allocation) then try to compress the 16K
buffer to <4K. If the compression succeeds, MySQL will punch out the
latter 12K, leave only the first 4K allocated:
after write 16k buffer: OOOO
after punch latter 12K: OXXX
where O means page with block allocated, X means page without.

Such feature saves disk space(the 12K freed by punching can be used
by others), but also makes the filesystem much more fragmented.
Considering xfs has no automatic defragmentation mechanism, in the
most extreme cases, there will be only 1-3 physically continuous
blocks finally avaliable.

For data block allocation, such fragmentation is not a problem, as
physical continuation is not always required. But inode chunk
allocation requires so. Even for sparse allocation, physical
continuation has also to be guaranteed in a way. Currently this
value is calculated from a scaled inode cluster size. In xfs, inodes
are manipulated(e.g. read in, logged, written back) in cluster, and
the size of that cluster is just the inode cluster size. Sparse
allocation unit currently is calculated from that:
(inode size / MIN_INODE_SIZE) * inode cluster size
-> sparse allocation aligmnet
-> sparse allocation unit
For example, under default mkfs configuration(i.e. crc and sparse
allocation enabled, 4K blocksize), inode size is 512 bytes(2 times
of MIN_INODE_SIZE=256 bytes), then sparse allocation unit will be
2 * current inode cluster size(8192 bytes) = 16384 bytes, that is
4 blocks. As we mentioned above, under extreme fragmentation, the
filesystem may be full of 1-3 physically continuous blocks but can
never find one of 4, so even sparese allocation will also fail. If
we know application will easily create such fragmentation, then we
had better have a way to loose sparse allocation requirement manually.

This patch series achieves that by making the source of sparse
allocation unit, inode cluster size a tunnable parameter. When
sparse allocation enabled, make that size tunnable in mkfs. As xfs
itself currently recalculate and verify related value, change xfs
behavior to directly using the value provided by superblock read in.

Tianxiang Peng (2):
xfs: calculate cluster_size_raw from sb when sparse alloc enabled
mkfs: make cluster size tunnable when sparse alloc enabled

fs/xfs/libxfs/xfs_ialloc.c | 35 ++++++++++++++++++++++-------------
fs/xfs/xfs_mount.c | 12 ++++++------
mkfs/xfs_mkfs.c | 34 +++++++++++++++++++++++++++++-----
3 files changed, 57 insertions(+), 24 deletions(-)

--
2.43.5