Re: [PATCH] netfilter: use per-cpu spinlock rather than RCU (v3)

From: Paul E. McKenney
Date: Thu Apr 16 2009 - 21:28:33 EST


On Thu, Apr 16, 2009 at 04:49:55PM -0700, Paul E. McKenney wrote:
> On Thu, Apr 16, 2009 at 03:33:54PM -0700, David Miller wrote:
> > From: Patrick McHardy <kaber@xxxxxxxxx>
> > Date: Thu, 16 Apr 2009 15:11:31 +0200
> >
> > > Linus Torvalds wrote:
> > >> On Wed, 15 Apr 2009, Stephen Hemminger wrote:
> > >>> The counters are the bigger problem, otherwise we could just free
> > >>> table
> > >>> info via rcu. Do we really have to support: replace where the counter
> > >>> values coming out to user space are always exactly accurate, or is it
> > >>> allowed to replace a rule and maybe lose some counter ticks (worst
> > >>> case
> > >>> NCPU-1).
> > >> Why not just read the counters fromt he old one at RCU free time (they
> > >> are guaranteed to be stable at that point, since we're all done with
> > >> those entries), and apply them at that point to the current setup?
> > >
> > > We need the counters immediately to copy them to userspace, so waiting
> > > for an asynchronous RCU free is not going to work.
> >
> > It just occurred to me that since all netfilter packet handling
> > goes through one place, we could have a sort-of "netfilter RCU"
> > of sorts to solve this problem.
>
> OK, I am putting one together...
>
> It will be needed sooner or later, though I suspect per-CPU locking
> would work fine in this case.

And here is a crude first cut. Untested, probably does not even compile.

Straight conversion of Mathieu Desnoyers's user-space RCU implementation
at git://lttng.org/userspace-rcu.git to the kernel (and yes, I did help
a little, but he must bear the bulk of the guilt). Pick on srcu.h
and srcu.c out of sheer laziness. User-space testing gives deep
sub-microsecond grace-period latencies, so should be fast enough, at
least if you don't mind two smp_call_function() invocations per grace
period and spinning on each instance of a per-CPU variable.

Again, I believe per-CPU locking should work fine for the netfilter
counters, but I guess "friends don't let friends use hashed locks".
(I would not know for sure, never having used them myself, except of
course to protect hash tables.)

Most definitely -not- for inclusion at this point. Next step is to hack
up the relevant rcutorture code and watch it explode on contact. ;-)

Signed-off-by: Paul E. McKenney <paulmck@xxxxxxxxxxxxxxxxxx>
---

include/linux/srcu.h | 30 ++++++++++++++++++++++++
kernel/srcu.c | 63 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 93 insertions(+)

diff --git a/include/linux/srcu.h b/include/linux/srcu.h
index aca0eee..4577cd0 100644
--- a/include/linux/srcu.h
+++ b/include/linux/srcu.h
@@ -50,4 +50,34 @@ void srcu_read_unlock(struct srcu_struct *sp, int idx) __releases(sp);
void synchronize_srcu(struct srcu_struct *sp);
long srcu_batches_completed(struct srcu_struct *sp);

+/* Single bit for grace-period index, low-order bits are nesting counter. */
+#define RCU_FGP_COUNT 1UL
+#define RCU_FGP_PARITY (1UL << (sizeof(long) << 2))
+#define RCU_FGP_NEST_MASK (RCU_FGP_PARITY - 1)
+
+extern long rcu_fgp_ctr;
+DECLARE_PER_CPU(long, rcu_fgp_active_readers);
+
+static inline void rcu_read_lock_fgp(void)
+{
+ long tmp;
+ long *uarp;
+
+ preempt_disable();
+ uarp = &__get_cpu_var(rcu_fgp_active_readers);
+ tmp = *uarp;
+ if (likely(!(tmp & RCU_FGP_NEST_MASK)))
+ *uarp = rcu_fgp_ctr; /* Outermost rcu_read_lock(). */
+ else
+ *uarp = tmp + RCU_FGP_COUNT; /* Nested rcu_read_lock(). */
+ barrier();
+}
+
+static inline void rcu_read_unlock_fgp(void)
+{
+ barrier();
+ __get_cpu_var(rcu_fgp_active_readers)--;
+ preempt_enable();
+}
+
#endif
diff --git a/kernel/srcu.c b/kernel/srcu.c
index b0aeeaf..a5dc464 100644
--- a/kernel/srcu.c
+++ b/kernel/srcu.c
@@ -255,3 +255,66 @@ EXPORT_SYMBOL_GPL(srcu_read_lock);
EXPORT_SYMBOL_GPL(srcu_read_unlock);
EXPORT_SYMBOL_GPL(synchronize_srcu);
EXPORT_SYMBOL_GPL(srcu_batches_completed);
+
+DEFINE_MUTEX(rcu_fgp_mutex);
+long rcu_fgp_ctr = RCU_FGP_COUNT;
+DEFINE_PER_CPU(long, rcu_fgp_active_readers);
+
+/*
+ * Determine if the specified counter value indicates that we need to
+ * wait on the corresponding CPU to exit an rcu_fgp read-side critical
+ * section. Return non-zero if so.
+ *
+ * Assumes that rcu_fgp_mutex is held, and thus that rcu_fgp_ctr is
+ * unchanging.
+ */
+static inline int rcu_old_fgp_ongoing(long *value)
+{
+ long v = ACCESS_ONCE(*value);
+
+ return (v & RCU_FGP_NEST_MASK) &&
+ ((v ^ rcu_fgp_ctr) & RCU_FGP_PARITY);
+}
+
+static void rcu_fgp_wait_for_quiescent_state(void)
+{
+ int cpu;
+ long *uarp;
+
+ for_each_online_cpu(cpu) {
+ uarp = &per_cpu(rcu_fgp_active_readers, cpu);
+ while (rcu_old_fgp_ongoing(uarp))
+ cpu_relax();
+ }
+}
+
+static void rcu_fgp_do_mb(void *unused)
+{
+ smp_mb(); /* Each CPU does a memory barrier. */
+}
+
+void synchronize_rcu_fgp(void)
+{
+ mutex_lock(&rcu_fgp_mutex);
+
+ /* CPUs must see earlier change before parity flip. */
+ smp_call_function(rcu_fgp_do_mb, NULL, 1);
+
+ /*
+ * We must flip twice to correctly handle tasks that stall
+ * in rcu_read_lock_fgp() between the time that they fetch
+ * rcu_fgp_ctr and the time that the store to their CPU's
+ * rcu_fgp_active_readers. No matter when they resume
+ * execution, we will wait for them to get to the corresponding
+ * rcu_read_unlock_fgp().
+ */
+ ACCESS_ONCE(rcu_fgp_ctr) ^= RCU_FGP_PARITY; /* flip parity 0 -> 1 */
+ rcu_fgp_wait_for_quiescent_state(); /* wait for old readers */
+ ACCESS_ONCE(rcu_fgp_ctr) ^= RCU_FGP_PARITY; /* flip parity 1 -> 0 */
+ rcu_fgp_wait_for_quiescent_state(); /* wait for old readers */
+
+ /* Prevent CPUs from reordering out of prior RCU critical sections. */
+ smp_call_function(rcu_fgp_do_mb, NULL, 1);
+
+ mutex_unlock(&rcu_fgp_mutex);
+}
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/