Re: [PATCH v6 11/12] xfs: add xfs_compute_atomic_write_unit_max()

From: Dave Chinner
Date: Wed Apr 09 2025 - 01:30:53 EST


On Tue, Apr 08, 2025 at 05:41:56PM -0700, Darrick J. Wong wrote:
> On Wed, Apr 09, 2025 at 08:47:09AM +1000, Dave Chinner wrote:
> > On Tue, Apr 08, 2025 at 10:42:08AM +0000, John Garry wrote:
> > > Now that CoW-based atomic writes are supported, update the max size of an
> > > atomic write for the data device.
> > >
> > > The limit of a CoW-based atomic write will be the limit of the number of
> > > logitems which can fit into a single transaction.
> >
> > I still think this is the wrong way to define the maximum
> > size of a COW-based atomic write because it is going to change from
> > filesystem to filesystem and that variability in supported maximum
> > length will be exposed to userspace...
> >
> > i.e. Maximum supported atomic write size really should be defined as
> > a well documented fixed size (e.g. 16MB). Then the transaction
> > reservations sizes needed to perform that conversion can be
> > calculated directly from that maximum size and optimised directly
> > for the conversion operation that atomic writes need to perform.
>
> I'll get to this below...

I'll paste it here so it's all in one context.

> This is why I don't agree with adding a static 16MB limit -- we clearly
> don't need it to emulate current hardware, which can commit up to 64k
> atomically. Future hardware can increase that by 64x and we'll still be
> ok with using the existing tr_write transaction type.
>
> By contrast, adding a 16MB limit would result in a much larger minimum
> log size. If we add that to struct xfs_trans_resv for all filesystems
> then we run the risk of some ancient filesystem with a 12M log failing
> suddenly failing to mount on a new kernel.
>
> I don't see the point.

You've got stuck on ithe example size of 16MB I gave, not
the actual reason I gave that example.

That is: we should not be exposing some unpredictable, filesystem
geometry depending maximum atomic size to a userspace API. We need
to define a maximum size that we support, that we can clearly
document and guarantee will work on all filesystem geometries.

What size that is needs to be discussed - all you've done is
demonstrate that 16MB would require a larger minimum log size for a
small set of historic/legacy filesystem configs.

I'm actually quite fine with adding new reservations that change
minimum required log sizes - this isn't a show-stopper in any way.

Atomic writes are optional functionality, so if the log size is too
small for the atomic write transaction requirements, then we don't
enable the atomic write functionality on that filesystem. Hence the
minimum required log size for the filesystem -does not change- and
the filesystem mounts as normal..

i.e. the user hasn't lost anything on these tiny log filesystems
when the kernel is upgraded to support software atomic writes.
All that happens is that the legacy filesystem will never support
atomic writes....

Such a mount time check allows us to design the atomic write
functionality around a fixed upper size bound that we can guarantee
will work correctly. i.e. the size of the transaction reservation
needed for it is irrelevant because we guarantee to only run that
transaction on filesytsems with logs large enough to support it.

Having a consistent maximum atomic write size makes it easier for
userspace to create logic and algorithms around, and it makes it
much easier for us to do boundary condition testing as we don't have
to reverse engineer the expected boundaries from filesysetm geometry
in test code. Fixed sizes make API verification and testing a whole
lot simpler.

And, finally, if everything is sized from an single API constant, it
makes it easy to modify the supported size in future. The 64MB
minimum log size gives us lots of headroom to increase supported
atomic write sizes, so if userspace finds that it really needs
larger sizes than what we initially support, it's trivial to change
both the kernel and the test code to support a larger size...

> > > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > > index b2dd0c0bf509..42b2b7540507 100644
> > > --- a/fs/xfs/xfs_super.c
> > > +++ b/fs/xfs/xfs_super.c
> > > @@ -615,6 +615,28 @@ xfs_init_mount_workqueues(
> > > return -ENOMEM;
> > > }
> > >
> > > +unsigned int
> > > +xfs_atomic_write_logitems(
> > > + struct xfs_mount *mp)
> > > +{
> > > + unsigned int efi = xfs_efi_item_overhead(1);
> > > + unsigned int rui = xfs_rui_item_overhead(1);
> > > + unsigned int cui = xfs_cui_item_overhead(1);
> > > + unsigned int bui = xfs_bui_item_overhead(1);
> > > + unsigned int logres = M_RES(mp)->tr_write.tr_logres;
> > > +
> > > + /*
> > > + * Maximum overhead to complete an atomic write ioend in software:
> > > + * remove data fork extent + remove cow fork extent +
> > > + * map extent into data fork
> > > + */
> > > + unsigned int atomic_logitems =
> > > + (bui + cui + rui + efi) + (cui + rui) + (bui + rui);
> >
> > This seems wrong. Unmap from the data fork only logs a (bui + cui)
> > pair, we don't log a RUI or an EFI until the transaction that
> > processes the BUI or CUI actually frees an extent from the the BMBT
> > or removes a block from the refcount btree.
>
> This is tricky -- the first transaction in the chain creates a BUI and a
> CUI and that's all it needs.
>
> Then we roll to finish the BUI. The second transaction needs space for
> the BUD, an RUI, and enough space to relog the CUI (== CUI + CUD).
>
> Then we roll again to finish the RUI. This third transaction needs
> space for the RUD and space to relog the CUI.
>
> Roll again, fourth transaction needs space for the CUD and possibly a
> new EFI.
>
> Roll again, fifth transaction needs space for an EFD.

Yes, that is exactly the point I was making.

> > > +
> > > + /* atomic write limits are always a power-of-2 */
> > > + return rounddown_pow_of_two(logres / (2 * atomic_logitems));
> >
> > What is the magic 2 in that division?
>
> That's handwaving the lack of a computation involving
> xfs_allocfree_block_count. A better version would be to figure out the
> log space needed:
>
> /* Overhead to finish one step of each intent item type */
> const unsigned int f1 = libxfs_calc_finish_efi_reservation(mp, 1);
> const unsigned int f2 = libxfs_calc_finish_rui_reservation(mp, 1);

Yup, those should be a single call to xfs_calc_buf_res(xfs_allocfree_block_count())

> const unsigned int f3 = libxfs_calc_finish_cui_reservation(mp, 1);

Similarly, xfs_calc_refcountbt_reservation().

> const unsigned int f4 = libxfs_calc_finish_bui_reservation(mp, 1);

xfs_calc_write_reservation() but for a single extent instead of 2.
Also, I think the bui finish needs to take into account dquots, too.

> /* We only finish one item per transaction in a chain */
> const unsigned int step_size = max(f4, max3(f1, f2, f3));

And that is xfs_calc_itruncate_reservation(), except it reserves
space for 1 extent to be processed instead of 4.

FWIW, I'd suggest that these helpers make for a better way of
calculating high level reservations such as
xfs_calc_write_reservation() and xfs_calc_itruncate_reservation()
because those functions currently don't take into account how
intents are relogged during those operations.

> and then you have what you need to figure out the logres needed to
> support a given awu_max, or vice versa:
>
> if (desired_max) {
> dbprintf(
> "desired_max: %u\nstep_size: %u\nper_intent: %u\nlogres: %u\n",
> desired_max, step_size, per_intent,
> (desired_max * per_intent) + step_size);
> } else if (logres) {
> dbprintf(
> "logres: %u\nstep_size: %u\nper_intent: %u\nmax_awu: %u\n",
> logres, step_size, per_intent,
> logres >= step_size ? (logres - step_size) / per_intent : 0);
> }
>
> I hacked this into xfs_db so that I could compute a variety of
> scenarios. Let's pretend I have a 10T filesystem with 4k fsblocks and
> the default configuration.
>
> # xfs_db /dev/mapper/moo -c logres -c "untorn_max -b $(( (16 * 1048576) / 4096 ))" -c "untorn_max -l 244216"
> type 0 logres 244216 logcount 5 flags 0x4
> type 1 logres 428928 logcount 5 flags 0x4
> <snip>
> minlogsize logres 648576 logcount 8
>
> To emulate a 16MB untorn write, you'd need a logres of:
>
> desired_max: 4096
> step_size: 107520
> per_intent: 208
> logres: 959488
>
> 959488 > 648576, which would alter the minlogsize calculation. I know
> we banned tiny filesystems a few years ago, but there's still a risk in
> increasing it.

Yeah, it's a bit under a megabyte of reservation space. That's no
big deal, as log reservations get much larger than this as block
size and log stripe units are increased.

<snip the rest, they all show roughly the same thing>

Ok, these numbers pretty much prove my point - that a fixed max
atomic write size somewhere around 16MB isn't going to be a problem
for any filesystem that uses the historic small fs default log size
(10MB) or larger.

Let's avoid the internal XFS minlogsize problems by disabling the
atomic write functionality on the increasingly rare legacy
filesystems where the log is too small. That allows us to design and
implement sane userspace API bounds and guarantees for atomic writes
and not expose internal filesystem constraints and inconsistencies
to applications...

-Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx