Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)
From: Dave Chinner
Date: Sun Oct 08 2023 - 20:16:11 EST
On Thu, Oct 05, 2023 at 10:35:33PM -0700, Hugh Dickins wrote:
> On Thu, 5 Oct 2023, Dave Chinner wrote:
> > On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> > > Percpu counter's compare and add are separate functions: without locking
> > > around them (which would defeat their purpose), it has been possible to
> > > overflow the intended limit. Imagine all the other CPUs fallocating
> > > tmpfs huge pages to the limit, in between this CPU's compare and its add.
> > >
> > > I have not seen reports of that happening; but tmpfs's recent addition
> > > of dquot_alloc_block_nodirty() in between the compare and the add makes
> > > it even more likely, and I'd be uncomfortable to leave it unfixed.
> > >
> > > Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
> > >
> > > I believe this implementation is correct, and slightly more efficient
> > > than the combination of compare and add (taking the lock once rather
> > > than twice when nearing full - the last 128MiB of a tmpfs volume on a
> > > machine with 128 CPUs and 4KiB pages); but it does beg for a better
> > > design - when nearing full, there is no new batching, but the costly
> > > percpu counter sum across CPUs still has to be done, while locked.
> > >
> > > Follow __percpu_counter_sum()'s example, including cpu_dying_mask as
> > > well as cpu_online_mask: but shouldn't __percpu_counter_compare() and
> > > __percpu_counter_limited_add() then be adding a num_dying_cpus() to
> > > num_online_cpus(), when they calculate the maximum which could be held
> > > across CPUs? But the times when it matters would be vanishingly rare.
> > >
> > > Signed-off-by: Hugh Dickins <hughd@xxxxxxxxxx>
> > > Cc: Tim Chen <tim.c.chen@xxxxxxxxx>
> > > Cc: Dave Chinner <dchinner@xxxxxxxxxx>
> > > Cc: Darrick J. Wong <djwong@xxxxxxxxxx>
> > > ---
> > > Tim, Dave, Darrick: I didn't want to waste your time on patches 1-7,
> > > which are just internal to shmem, and do not affect this patch (which
> > > applies to v6.6-rc and linux-next as is): but want to run this by you.
> >
> > Hmmmm. IIUC, this only works for addition that approaches the limit
> > from below?
>
> That's certainly how I was thinking about it, and what I need for tmpfs.
> Precisely what its limitations (haha) are, I'll have to take care to
> spell out.
>
> (IIRC - it's a while since I wrote it - it can be used for subtraction,
> but goes the very slow way when it could go the fast way - uncompared
> percpu_counter_sub() much better for that. You might be proposing that
> a tweak could adjust it to going the fast way when coming down from the
> "limit", but going the slow way as it approaches 0 - that would be neat,
> but I've not yet looked into whether it's feasily done.)
>
> >
> > So if we are approaching the limit from above (i.e. add of a
> > negative amount, limit is zero) then this code doesn't work the same
> > as the open-coded compare+add operation would?
>
> To it and to me, a limit of 0 means nothing positive can be added
> (and it immediately returns false for that case); and adding anything
> negative would be an error since the positive would not have been allowed.
>
> Would a negative limit have any use?
I don't have any use for it, but the XFS case is decrementing free
space to determine if ENOSPC has been hit. It's the opposite
implemention to shmem, which increments used space to determine if
ENOSPC is hit.
> It's definitely not allowing all the possibilities that you could arrange
> with a separate compare and add; whether it's ruling out some useful
> possibilities to which it can easily be generalized, I'm not sure.
>
> Well worth a look - but it'll be easier for me to break it than get
> it right, so I might just stick to adding some comments.
>
> I might find that actually I prefer your way round: getting slower
> as approaching 0, without any need for specifying a limit?? That the
> tmpfs case pushed it in this direction, when it's better reversed? Or
> that might be an embarrassing delusion which I'll regret having mentioned.
I think there's cases for both approaching and upper limit from
before and a lower limit from above. Both are the same "compare and
add" algorithm, just with minor logic differences...
> > Hence I think this looks like a "add if result is less than"
> > operation, which is distinct from then "add if result is greater
> > than" operation that we use this same pattern for in XFS and ext4.
> > Perhaps a better name is in order?
>
> The name still seems good to me, but a comment above it on its
> assumptions/limitations well worth adding.
>
> I didn't find a percpu_counter_compare() in ext4, and haven't got
Go search for EXT4_FREECLUSTERS_WATERMARK....
> far yet with understanding the XFS ones: tomorrow...
XFS detects being near ENOSPC to change the batch update size so
taht when near ENOSPC the percpu counter always aggregates to the
global sum on every modification. i.e. it becomes more accurate (but
slower) near the ENOSPC threshold. Then if the result of the
subtraction ends up being less than zero, it takes a lock (i.e. goes
even slower!), undoes the subtraction that took it below zero, and
determines if it can dip into the reserve pool or ENOSPC should be
reported.
Some of that could be optimised, but we need that external "lock and
undo" mechanism to manage the reserve pool space atomically at
ENOSPC...
Cheers,
Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx