Re: [PATCH v2 04/14] fs: xfs: Make file data allocations observe the 'forcealign' flag

From: John Garry
Date: Tue Mar 05 2024 - 10:24:04 EST


On 05/03/2024 00:44, Dave Chinner wrote:
On Mon, Mar 04, 2024 at 01:04:18PM +0000, John Garry wrote:
From: "Darrick J. Wong" <djwong@xxxxxxxxxx>

The existing extsize hint code already did the work of expanding file
range mapping requests so that the range is aligned to the hint value.
Now add the code we need to guarantee that the space allocations are
also always aligned.

XXX: still need to check all this with reflink

Signed-off-by: "Darrick J. Wong" <djwong@xxxxxxxxxx>
Co-developed-by: John Garry <john.g.garry@xxxxxxxxxx>
Signed-off-by: John Garry <john.g.garry@xxxxxxxxxx>
---
fs/xfs/libxfs/xfs_bmap.c | 22 +++++++++++++++++-----
fs/xfs/xfs_iomap.c | 4 +++-
2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
index 60d100134280..8dee60795cf4 100644
--- a/fs/xfs/libxfs/xfs_bmap.c
+++ b/fs/xfs/libxfs/xfs_bmap.c
@@ -3343,6 +3343,19 @@ xfs_bmap_compute_alignments(
align = xfs_get_cowextsz_hint(ap->ip);
else if (ap->datatype & XFS_ALLOC_USERDATA)
align = xfs_get_extsz_hint(ap->ip);
+
+ /*
+ * xfs_get_cowextsz_hint() returns extsz_hint for when forcealign is
+ * set as forcealign and cowextsz_hint are mutually exclusive
+ */
+ if (xfs_inode_forcealign(ap->ip) && align) {
+ args->alignment = align;
+ if (stripe_align % align)
+ stripe_align = align;

This kinda does the right thing, but it strikes me as the wrong
thing to be doing - it seems to conflates two different physical
alignment properties in potentially bad ways, and we should never
get this far into the allocator to discover these issues.

Stripe alignment is alignment to physical disks in a RAID array.
Inode forced alignment is file offset alignment to specificly
defined LBA address ranges. The stripe unit/width is set at mkfs
time, the inode extent size hint at runtime.

These can only work together in specific situations:

1. max sized RWF_ATOMIC IO must be the same or smaller size
than the stripe unit. Alternatively: stripe unit must be
an integer (power of 2?) multiple of max atomic IO size.

Sure, it would not make sense to have max sized RWF_ATOMIC IO > stripe alignment.


IOWs, stripe unit vs atomic IO constraints must be
resolved in a compatible manner at mkfs time. If they
can't be resolved (e.g. non-power of 2 stripe unit) then
atomic IO cannot be done on the filesystem at all. Stripe
unit constraints always win over atomic IO.

We can could do that. Indeed, I thought our xfsprogs was doing that, but we have had a few versions now for forcealign ...


2. min sized RWF_ATOMIC IO must be an integer divider of
stripe unit so that small atomic IOs are always correctly
aligned to the stripe unit.

That's a given from atomic write rules and point 1.


3. Inode forced alignment constraints set at runtime (i.e.
extsize hints) must fit within the above stripe unit vs
min/max atomic IO requirements.
> i.e. extent size hint will typically need to an integer
multiple of stripe unit size which will always be
compatible with stripe unit and atomic IO size and
alignments...

I think that any extsize hint for forcealign needs to comply with stripe unit, same as point 1.


IOWs, these are static geometry constraints and so should be checked
and rejected at the point where alignments are specified (i.e.
mkfs, mount and ioctls). Then the allocator can simply assume that
forced inode alignments are always stripe alignment compatible and
we don't need separate handling of two possibly incompatible
alignments.

ok, makes sense.

Please note in case missed, I am mandating extsize hint for forcealign needs to be a power-of-2. It just makes life easier for all the sub-extent zero'ing later on.

Also we need to enforce that the AG count to be compliant with the extsize hint for forcealign; but since the extsize hint for forcealign needs to be compliant with stripe unit, above, and stripe unit would be compliant wth AG count (right?), then this would be a given.


Regardless, I think the code as it stands won't work correctly when
a stripe unit is not set.

The correct functioning of this code appears to be dependent on
stripe_align being set >= the extent size hint. If stripe align is
not set, then what does (0 % align) return? If my checks are
correct, this will return 0,

Correct

and so the above code will end up with
stripe_align = 0.

Now, consider that args->alignment is used to signal to the
allocator what the -start alignment- of the allocation is supposed
to be, whilst args->prod/args->mod are used to tell the allocator
what the tail adjustment is supposed to be.

Stripe alignment wants start alignment, extent size hints want tail
adjustment and force aligned allocation wants both start alignment
and tail adjustment.

Right


However, the allocator overrides these depending on what it is
doing. e.g. exact block EOF allocation turns off start alignment but
has to ensure space is reserved for start alignment if it fails.
This reserves space for stripe_align in args->minalignslop, but if
force alignment has a start alignment requirement larger than stripe
unit alignment, this code fails to reserve the necessary alignment
slop for the force alignment code.

If we aren't doing exact block EOF allocation, then we do stripe
unit alignment. Again, if this fails and the forced alignment is
larger than stripe alignment, this code does not reserve space for
the larger alignment in args->minalignslop, so it will also do the
wrong when we fall back to an args->alignment > 1 allocation.

Failing to correctly account for minalignslop is known to cause
accounting problems when the AG is very near empty, and the usual
result of that is cancelling of a dirty allocation transaction and a
forced shutdown. So this is something we need to get right.

For sure


More below....

+ } else {
+ args->alignment = 1;
+ }

Just initialise the allocation args structure with a value of 1 like
we already do?

It was being done in this way to have just a single place where the value is initialised. It can easily be kept as is.


+
if (align) {
if (xfs_bmap_extsize_align(mp, &ap->got, &ap->prev, align, 0,
ap->eof, 0, ap->conv, &ap->offset,
@@ -3438,7 +3451,6 @@ xfs_bmap_exact_minlen_extent_alloc(
args.minlen = args.maxlen = ap->minlen;
args.total = ap->total;
- args.alignment = 1;
args.minalignslop = 0;

This likely breaks the debug code. This isn't intended for testing
atomic write capability, it's for exercising low space allocation
algorithms with worst case fragmentation patterns. This is only
called when error injection is turned on to trigger this path, so we
shouldn't need to support forced IO alignment here.

ok, it can be left untouched.


args.minleft = ap->minleft;
@@ -3484,6 +3496,7 @@ xfs_bmap_btalloc_at_eof(
{
struct xfs_mount *mp = args->mp;
struct xfs_perag *caller_pag = args->pag;
+ int orig_alignment = args->alignment;
int error;
/*
@@ -3558,10 +3571,10 @@ xfs_bmap_btalloc_at_eof(
/*
* Allocation failed, so turn return the allocation args to their
- * original non-aligned state so the caller can proceed on allocation
- * failure as if this function was never called.
+ * original state so the caller can proceed on allocation failure as
+ * if this function was never called.
*/
- args->alignment = 1;
+ args->alignment = orig_alignment;
return 0;
}

As I said above, we can't set an alignment of > 1 here if we haven't
accounted for that alignment in args->minalignslop above. This leads
to unexpected ENOSPC conditions and filesystem shutdowns.

I suspect what we need to do is get rid of the separate stripe_align
variable altogether and always just set args->alignment to what we
need the extent start alignment to be, regardless of whether it is
from stripe alignment or forced alignment.

ok, it sounds a bit simpler at least


Then the code in xfs_bmap_btalloc_at_eof() doesn't need to know what
'stripe_align' is - the exact EOF block allocation can simply save
and restore the args->alignment value and use it for minalignslop
calculations for the initial exact block allocation.

Then, if xfs_bmap_btalloc_at_eof() fails and xfs_inode_forcealign()
is true, we can abort allocation immediately, and not bother to fall
back on further aligned/unaligned attempts that will also fail or do
the wrong them.

ok


Similarly, if we aren't doing EOF allocation, having args->alignment
set means it will do the right thing for the first allocation
attempt. Again, if that fails, we can check if
xfs_inode_forcealign() is true and fail the aligned allocation
instead of running the low space algorithm. This now makes it clear
that we're failing the allocation because of the forced alignment
requirement, and now the low space allocation code can explicitly
turn off start alignment as it isn't required...

are you saying that low-space allocator can set args->alignment = 1 to be explicit?



diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 18c8f168b153..70fe873951f3 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -181,7 +181,9 @@ xfs_eof_alignment(
* If mounted with the "-o swalloc" option the alignment is
* increased from the strip unit size to the stripe width.
*/
- if (mp->m_swidth && xfs_has_swalloc(mp))
+ if (xfs_inode_forcealign(ip))

I actually thought that I dropped this chunk as it was causing alignment issues. I need to check that again.

+ align = xfs_get_extsz_hint(ip);
+ else if (mp->m_swidth && xfs_has_swalloc(mp))
align = mp->m_swidth;
else if (mp->m_dalign)
align = mp->m_dalign;

This doesn't work for files with a current size of less than
"align". The next line of code does:

if (align && XFS_ISIZE(ip) < XFS_FSB_TO_B(mp, align))
align = 0;

IOWs, the first aligned write allocation will occur with a file size
of zero, and that will result in this function returning 0. i.e.
speculative post EOF delalloc doesn't turn on and align the EOF
until writes up to offset == align have already been completed.

Maybe it was working as I have the change to keep EOF aligned for forcealign. I'll check that.


Essentially, this code wants to be:

if (xfs_inode_forcealign(ip))
return xfs_get_extsz_hint(ip);

So that any write to the a force aligned inode will always trigger
extent size hint EOF alignment.


ok, sounds reasonable.

Thanks,
John