Re: [PATCH] percpu_counter: return precise count from __percpu_counter_compare()

From: Waiman Long
Date: Thu Oct 08 2015 - 12:06:55 EST


On 10/07/2015 09:02 PM, Dave Chinner wrote:
On Wed, Oct 07, 2015 at 04:20:10PM -0700, Tejun Heo wrote:
Hello, Dave.

On Thu, Oct 08, 2015 at 10:04:42AM +1100, Dave Chinner wrote:
...
As it is, the update race you pointed out is easy to solve with
__this_cpu_cmpxchg rather than _this_cpu_sub (similar to mod_state()
in the MM percpu counter stats code, perhaps).
percpu cmpxchg is no different from sub or any other operations
regarding cross-CPU synchronization. They're safe iff the operations
are on the local CPU. They have to be made atomics if they need to be
manipulated from remote CPUs.
Again, another trivially solvable problem, but still irrelevant
because we don't have the data that tells us whether changing the
counter behaviour solves the problem....

That said, while we can't manipulate the percpu counters directly, we
can add a separate global counter to cache sum result from the
previous run which gets automatically invalidated when any percpu
counter overflows.

That should give better and in case of
back-to-back invocations pretty good precision compared to just
returning the global overflow counter. Interface-wise, that'd be a
lot easier to deal with although I have no idea whether it'd fit this
particular use case or whether this use case even exists.
No, it doesn't help - it's effectively what Waiman's original patch
did by returning the count from the initial comparison and using
that for ENOSPC detection instead of doing a second comparison...

FWIW, XFS has done an expensive per-cpu counter sum in this ENOSPC
situation since 2006, but in 2007 ENOSPC was wrapped in a mutex to
prevent spinlock contention on the aggregated global counter:

commit 20b642858b6bb413976ff13ae6a35cc596967bab
Author: David Chinner<dgc@xxxxxxx>
Date: Sat Feb 10 18:35:09 2007 +1100

[XFS] Reduction global superblock lock contention near ENOSPC.

The existing per-cpu superblock counter code uses the global superblock
spin lock when we approach ENOSPC for global synchronisation. On larger
machines than this code was originally tested on this can still get
catastrophic spinlock contention due increasing rebalance frequency near
ENOSPC.

By introducing a sleeping lock that is used to serialise balances and
modifications near ENOSPC we prevent contention from needlessly from
wasting the CPU time of potentially hundreds of CPUs.

To reduce the number of balances occuring, we separate the need rebalance
case from the slow allocate case. Now, a counter running dry will trigger
a rebalance during which counters are disabled. Any thread that sees a
disabled counter enters a different path where it waits on the new mutex.
When it gets the new mutex, it checks if the counter is disabled. If the
counter is disabled, then we _know_ that we have to use the global counter
and lock and it is safe to do so immediately. Otherwise, we drop the mutex
and go back to trying the per-cpu counters which we know were re-enabled.

SGI-PV: 952227
SGI-Modid: xfs-linux-melb:xfs-kern:27612a

Signed-off-by: David Chinner<dgc@xxxxxxx>
Signed-off-by: Lachlan McIlroy<lachlan@xxxxxxx>
Signed-off-by: Tim Shimmin<tes@xxxxxxx>

This is effectively the same symptoms that what we are seeing with
the new "lockless" generic percpu counteri algorithm, which is why
I'm trying to find out if it an issue with the counter
implementation before I do anything else...

FWIW, the first comparison doesn't need to be that precise as it
just changes the batch passed to percpu_counter_add() to get the
value folded back into the global counter immediately near ENOSPC.
This is done so percpu_counter_read() becomes more accurate as
ENOSPC is approached as that is used for monitoring and reporting
(e.g. via vfsstat). If we want to avoid a counter sum, then this
is the comparison we will need to modify in XFS.

That is what I have advocated in the in the inlined patch that I sent you in a previous mail. That patch modified the first comparison, but leave the 2nd comparison intact. We will still see bad performance near ENOSPC, but it will be better than before.

However, the second comparison needs to be precise as that's the one
that does the ENOSPC detection. That sum needs to be done after the
counter add that "uses" the space and so there is no avoiding having
an expensive counter sum as we near ENOSPC....

Cheers,

Dave.

Cheers,
Longman
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/