Re: [PATCH 23/32] Generic dynamic per cpu refcounting
From: Kent Overstreet
Date: Mon Jan 07 2013 - 18:47:29 EST
On Thu, Jan 03, 2013 at 02:48:39PM -0800, Andrew Morton wrote:
> On Wed, 26 Dec 2012 18:00:02 -0800
> Kent Overstreet <koverstreet@xxxxxxxxxx> wrote:
> > This implements a refcount with similar semantics to
> > atomic_get()/atomic_dec_and_test(), that starts out as just an atomic_t
> > but dynamically switches to per cpu refcounting when the rate of
> > gets/puts becomes too high.
> > It also implements two stage shutdown, as we need it to tear down the
> > percpu counts. Before dropping the initial refcount, you must call
> > percpu_ref_kill(); this puts the refcount in "shutting down mode" and
> > switches back to a single atomic refcount with the appropriate barriers
> > (synchronize_rcu()).
> > It's also legal to call percpu_ref_kill() multiple times - it only
> > returns true once, so callers don't have to reimplement shutdown
> > synchronization.
> > For the sake of simplicity/efficiency, the heuristic is pretty simple -
> > it just switches to percpu refcounting if there are more than x gets
> > in one second (completely arbitrarily, 4096).
> > It'd be more correct to count the number of cache misses or something
> > else more profile driven, but doing so would require accessing the
> > shared ref twice per get - by just counting the number of gets(), we can
> > stick that counter in the high bits of the refcount and increment both
> > with a single atomic64_add(). But I expect this'll be good enough in
> > practice.
> I still don't "get" why this code exists. It is spectacularly,
> stunningly undocumented and if someone were to ask me "under what
> circumstances should I use percpu-refcount", I would not be able to
> help them.
Yeah... that was unfinished. Here's a patch on top of your git tree with
the missing documentation:
Author: Kent Overstreet <koverstreet@xxxxxxxxxx>
Date: Mon Jan 7 15:05:58 2013 -0800
Signed-off-by: Kent Overstreet <koverstreet@xxxxxxxxxx>
diff --git a/include/linux/percpu-refcount.h b/include/linux/percpu-refcount.h
index 1268010..1654a5b 100644
@@ -1,3 +1,71 @@
+ * Dynamic percpu refcounts:
+ * (C) 2012 Google, Inc.
+ * Author: Kent Overstreet <koverstreet@xxxxxxxxxx>
+ * This implements a refcount with similar semantics to atomic_t - atomic_inc(),
+ * atomic_dec_and_test() - but potentially percpu.
+ * There's one important difference between percpu refs and normal atomic_t
+ * refcounts; you have to keep track of your initial refcount, and then when you
+ * start shutting down you call percpu_ref_kill() _before_ dropping the initial
+ * refcount.
+ * Before you call percpu_ref_kill(), percpu_ref_put() does not check for the
+ * refcount hitting 0 - it can't, if it was in percpu mode. percpu_ref_kill()
+ * puts the ref back in single atomic_t mode, collecting the per cpu refs and
+ * issuing the appropriate barriers, and then marks the ref as shutting down so
+ * that percpu_ref_put() will check for the ref hitting 0. After it returns,
+ * it's safe to drop the initial ref.
+ * BACKGROUND:
+ * Percpu refcounts are quite useful for performance, but if we blindly
+ * converted all refcounts to percpu counters we'd waste quite a bit of memory
+ * think about all the refcounts embedded in kobjects, files, etc. most of which
+ * aren't used much.
+ * These start out as simple atomic counters - a little bigger than a bare
+ * atomic_t, 16 bytes instead of 4 - but if we exceed some arbitrary number of
+ * gets in one second, we then switch to percpu counters.
+ * This heuristic isn't perfect because it'll fire if the refcount was only
+ * being used on one cpu; ideally we'd be able to count the number of cache
+ * misses on percpu_ref_get() or something similar, but that'd make the non
+ * percpu path significantly heavier/more complex. We can count the number of
+ * gets() without any extra atomic instructions, on arches that support
+ * atomic64_t - simply by changing the atomic_inc() to atomic_add_return().
+ * USAGE:
+ * See fs/aio.c for some example usage; it's used there for struct kioctx, which
+ * is created when userspaces calls io_setup(), and destroyed when userspace
+ * calls io_destroy() or the process exits.
+ * In the aio code, kill_ioctx() is called when we wish to destroy a kioctx; it
+ * calls percpu_ref_kill(), then hlist_del_rcu() and sychronize_rcu() to remove
+ * the kioctx from the proccess's list of kioctxs - after that, there can't be
+ * any new users of the kioctx (from lookup_ioctx()) and it's then safe to drop
+ * the initial ref with percpu_ref_put().
+ * Code that does a two stage shutdown like this often needs some kind of
+ * explicit synchronization to ensure the initial refcount can only be dropped
+ * once - percpu_ref_kill() does this for you, it returns true once and false if
+ * someone else already called it. The aio code uses it this way, but it's not
+ * necessary if the code has some other mechanism to synchronize teardown.
+ * As mentioned previously, we decide when to convert a ref to percpu counters
+ * in percpu_ref_get(). However, since percpu_ref_get() will often be called
+ * with rcu_read_lock() held, it's not done there - percpu_ref_get() returns
+ * true if the ref should be converted to percpu counters.
+ * The caller should then call percpu_ref_alloc() after dropping
+ * rcu_read_lock(); if there is an uncommonly used codepath where it's
+ * inconvenient to call percpu_ref_alloc() after get(), it may be safely skipped
+ * and percpu_ref_get() will return true again the next time the counter wraps
+ * around.
@@ -16,11 +84,29 @@ int percpu_ref_put(struct percpu_ref *ref);
int percpu_ref_kill(struct percpu_ref *ref);
int percpu_ref_dead(struct percpu_ref *ref);
+ * percpu_ref_get - increment a dynamic percpu refcount
+ * Increments @ref and possibly converts it to percpu counters. Must be called
+ * with rcu_read_lock() held, and may potentially drop/reacquire rcu_read_lock()
+ * to allocate percpu counters - if sleeping/allocation isn't safe for some
+ * other reason (e.g. a spinlock), see percpu_ref_get_noalloc().
+ * Analagous to atomic_inc().
static inline void percpu_ref_get(struct percpu_ref *ref)
+ * percpu_ref_get_noalloc - increment a dynamic percpu refcount
+ * Increments @ref, to be used when it's not safe to allocate percpu counters.
+ * Must be called with rcu_read_lock() held.
+ * Analagous to atomic_inc().
static inline void percpu_ref_get_noalloc(struct percpu_ref *ref)
diff --git a/lib/percpu-refcount.c b/lib/percpu-refcount.c
index e2d8d12..e018f01 100644
@@ -5,6 +5,51 @@
+ * A percpu refcount can be in 4 different modes. The state is tracked in the
+ * low two bits of percpu_ref->pcpu_count:
+ * PCPU_REF_NONE - the initial state, no percpu counters allocated.
+ * PCPU_REF_PTR - using percpu counters for the refcount.
+ * PCPU_REF_DYING - we're shutting down so get()/put() should use the embedded
+ * atomic counter, but we're not finished updating the atomic counter from the
+ * percpu counters - this means that percpu_ref_put() can't check for the ref
+ * hitting 0 yet.
+ * PCPU_REF_DEAD - we've finished the teardown sequence, percpu_ref_put() should
+ * now check for the ref hitting 0.
+ * In PCPU_REF_NONE mode, we need to count the number of times percpu_ref_get()
+ * is called; this is done with the high bits of the raw atomic counter. We also
+ * track the time, in jiffies, when the get count last wrapped - this is done
+ * with the remaining bits of percpu_ref->percpu_count.
+ * So, when percpu_ref_get() is called it increments the get count and checks if
+ * it wrapped; if it did, it checks if the last time it wrapped was less than
+ * one second ago; if so, we want to allocate percpu counters.
+ * PCPU_COUNT_BITS determines the threshold where we convert to percpu: of the
+ * raw 64 bit counter, we use PCPU_COUNT_BITS for the refcount, and the
+ * remaining (high) bits to count the number of times percpu_ref_get() has been
+ * called. It's currently (completely arbitrarily) 16384 times in one second.
+ * Percpu mode (PCPU_REF_PTR):
+ * In percpu mode all we do on get and put is increment or decrement the cpu
+ * local counter, which is a 32 bit unsigned int.
+ * Note that all the gets() could be happening on one cpu, and all the puts() on
+ * another - the individual cpu counters can wrap (potentially many times).
+ * But this is fine because we don't need to check for the ref hitting 0 in
+ * percpu mode; before we set the state to PCPU_REF_DEAD we simply sum up all
+ * the percpu counters and add them to the atomic counter. Since addition and
+ * subtraction in modular arithmatic is still associative, the result will be
+ * correct.
#define PCPU_COUNT_BITS 50
#define PCPU_COUNT_MASK ((1LL << PCPU_COUNT_BITS) - 1)
@@ -18,6 +63,12 @@
#define REF_STATUS(count) ((unsigned long) count & PCPU_STATUS_MASK)
+ * percpu_ref_init - initialize a dynamic percpu refcount
+ * Initializes the refcount in single atomic counter mode with a refcount of 1;
+ * analagous to atomic_set(ref, 1).
void percpu_ref_init(struct percpu_ref *ref)
unsigned long now = jiffies;
@@ -78,6 +129,13 @@ void __percpu_ref_get(struct percpu_ref *ref, bool alloc)
+ * percpu_ref_put - decrement a dynamic percpu refcount
+ * Returns true if the result is 0, otherwise false; only checks for the ref
+ * hitting 0 after percpu_ref_kill() has been called. Analagous to
+ * atomic_dec_and_test().
int percpu_ref_put(struct percpu_ref *ref)
unsigned __percpu *pcpu_count;
@@ -109,6 +167,17 @@ int percpu_ref_put(struct percpu_ref *ref)
+ * percpu_ref_kill - prepare a dynamic percpu refcount for teardown
+ * Must be called before dropping the initial ref, so that percpu_ref_put()
+ * knows to check for the refcount hitting 0. If the refcount was in percpu
+ * mode, converts it back to single atomic counter mode.
+ * Returns true the first time called on @ref and false if @ref is already
+ * shutting down, so it may be used by the caller for synchronizing other parts
+ * of a two stage shutdown.
int percpu_ref_kill(struct percpu_ref *ref)
unsigned __percpu *old, *new, *pcpu_count = ref->pcpu_count;
@@ -156,6 +225,11 @@ int percpu_ref_kill(struct percpu_ref *ref)
+ * percpu_ref_dead - check if a dynamic percpu refcount is shutting down
+ * Returns true if percpu_ref_kill() has been called on @ref, false otherwise.
int percpu_ref_dead(struct percpu_ref *ref)
unsigned status = REF_STATUS(ref->pcpu_count);
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/