Re: [PATCH 8/8] shmem,percpu_counter: add _limited_add(fbc, limit, amount)

From: Mateusz Guzik
Date: Sat May 25 2024 - 02:00:42 EST


On Fri, Sep 29, 2023 at 08:42:45PM -0700, Hugh Dickins wrote:
> Percpu counter's compare and add are separate functions: without locking
> around them (which would defeat their purpose), it has been possible to
> overflow the intended limit. Imagine all the other CPUs fallocating
> tmpfs huge pages to the limit, in between this CPU's compare and its add.
>
> I have not seen reports of that happening; but tmpfs's recent addition
> of dquot_alloc_block_nodirty() in between the compare and the add makes
> it even more likely, and I'd be uncomfortable to leave it unfixed.
>
> Introduce percpu_counter_limited_add(fbc, limit, amount) to prevent it.
>

I think the posted (and by now landed) code is racy.

I had seen there was a follow up patch which further augmented the
routine, but it did not alter the issue below so I'm responding to this
thread.

> +/*
> + * Compare counter, and add amount if the total is within limit.
> + * Return true if amount was added, false if it would exceed limit.
> + */
> +bool __percpu_counter_limited_add(struct percpu_counter *fbc,
> + s64 limit, s64 amount, s32 batch)
> +{
> + s64 count;
> + s64 unknown;
> + unsigned long flags;
> + bool good;
> +
> + if (amount > limit)
> + return false;
> +
> + local_irq_save(flags);
> + unknown = batch * num_online_cpus();
> + count = __this_cpu_read(*fbc->counters);
> +
> + /* Skip taking the lock when safe */
> + if (abs(count + amount) <= batch &&
> + fbc->count + unknown <= limit) {
> + this_cpu_add(*fbc->counters, amount);
> + local_irq_restore(flags);
> + return true;
> + }
> +

Note the value of fbc->count is *not* stabilized.

> + raw_spin_lock(&fbc->lock);
> + count = fbc->count + amount;
> +
> + /* Skip percpu_counter_sum() when safe */
> + if (count + unknown > limit) {
> + s32 *pcount;
> + int cpu;
> +
> + for_each_cpu_or(cpu, cpu_online_mask, cpu_dying_mask) {
> + pcount = per_cpu_ptr(fbc->counters, cpu);
> + count += *pcount;
> + }
> + }
> +
> + good = count <= limit;
> + if (good) {
> + count = __this_cpu_read(*fbc->counters);
> + fbc->count += count + amount;
> + __this_cpu_sub(*fbc->counters, count);
> + }
> +
> + raw_spin_unlock(&fbc->lock);
> + local_irq_restore(flags);
> + return good;
> +}
> +

Consider 2 cpus executing the func at the same time, where one is
updating fbc->counter in the slow path while the other is testing it in
the fast path.

cpu0 cpu1
if (abs(count + amount) <= batch &&
fbc->count + unknown <= limit)
/* gets past the per-cpu traversal */
/* at this point cpu0 decided to bump fbc->count, while cpu1 decided to
* bump the local counter */
this_cpu_add(*fbc->counters, amount);
fbc->count += count + amount;

Suppose fbc->count update puts it close enough to the limit that an
addition from cpu1 would put the entire thing over said limit.

Since the 2 operations are not synchronized cpu1 can observe fbc->count
prior to the bump and update it's local counter, resulting in
aforementioned overflow.

Am I misreading something here? Is this prevented?

To my grep the only user is quotas in shmem and I wonder if that use is
even justified. I am aware of performance realities of atomics. However,
it very well may be that whatever codepaths which exercise the counter
are suffering multicore issues elsewhere, making an atomic (in a
dedicated cacheline) a perfectly sane choice for the foreseeable future.
Should this be true there would be 0 rush working out a fixed variant of
the routine.