Re: [PATCH] An RCU for SMP with a single CPU garbage collector
From: Paul E. McKenney
Date: Sun Mar 13 2011 - 01:09:53 EST
On Sat, Mar 12, 2011 at 08:25:04PM -0500, Joe Korty wrote:
> On Sat, Mar 12, 2011 at 09:36:53AM -0500, Paul E. McKenney wrote:
> > Hello, Joe,
> >
> > My biggest question is "what does JRCU do that Frederic's patchset
> > does not do?" I am not seeing it at the moment. Given that Frederic's
> > patchset integrates into RCU, thus providing the full RCU API, I really
> > need a good answer to consider JRCU.
>
> Well, it's tiny, it's fast, and it does exactly one thing
> and does that really well. If a user doesn't need that
> one thing they shouldn't use JRCU. But mostly it is an
> exciting thought-experiment on another interesting way to
> do RCU. Who knows, maybe it may end up being better than
> for what it was aimed at.
Fair enough! From my perspective, I am likely to learn something from
watching you work on it, and it is reasonably likely that you will
come up with something useful for the existing RCU implementations.
And who knows what you might come up with?
> > For one, what sort of validation have you done?
> >
> > Thanx, Paul
>
> Not much, I'm writing the code and sending it out for
> comment. And it is currently missing many of the tweaks
> needed to make it a production RCU.
Ah, OK. I strongly recommend rcutorture.
> >> +struct rcu_data {
> >> + u8 wait; /* goes false when this cpu consents to
> >> + * the retirement of the current batch */
> >> + u8 which; /* selects the current callback list */
> >> + struct rcu_list cblist[2]; /* current & previous callback lists */
> >> +} ____cacheline_aligned_in_smp;
> >> +
> >> +static struct rcu_data rcu_data[NR_CPUS];
> >
> > Why not DEFINE_PER_CPU(struct rcu_data, rcu_data)?
>
> All part of being lockless. I didn't want to have to tie
> into cpu onlining and offlining and wanted to eliminate
> sprinking special tests and/or online locks throughout
> the code. Also, note the single for_each_present_cpu(cpu)
> statement in JRCU .. this loops over all offline cpus and
> gradually expires any residuals they have left behind.
OK, but the per-CPU variables do not come and go with the CPUs.
So I am still not seeing how the array is helping compared to
per-CPU variables.
> >> +/*
> >> + * Return our CPU id or zero if we are too early in the boot process to
> >> + * know what that is. For RCU to work correctly, a cpu named '0' must
> >> + * eventually be present (but need not ever be online).
> >> + */
> >> +static inline int rcu_cpu(void)
> >> +{
> >> + return current_thread_info()->cpu;
> >
> > OK, I'll bite... Why not smp_processor_id()?
>
> Until recently, it was :) but it was a multiline thing,
> with 'if' stmts and such, to handle early boot conditions
> when smp_processor_id() isn't valid.
>
> JRCU, perhaps quixotically, tries to do something
> meaningful all the way back to the first microsecond of
> existance, when the CPU is switched from 16 to 32 bit mode.
> In that early epoch, things like 'cpus' and 'interrupts'
> and 'tasks' don't quite yet exist in the form we are used
> to for them.
OK.
> > And what to do about the architectures that put the CPU number somewhere
> > else?
>
> I confess I keep forgetting to look at that other 21 or
> so other architectures, I had thought they all had ->cpu.
> I look into it and, at least for those, reintroduce the
> old smp_processor_id() expression.
But those that have ->cpu can safely use smp_processor_id(), right?
> >> +void rcu_barrier(void)
> >> +{
> >> + struct rcu_synchronize rcu;
> >> +
> >> + if (!rcu_scheduler_active)
> >> + return;
> >> +
> >> + init_completion(&rcu.completion);
> >> + call_rcu(&rcu.head, wakeme_after_rcu);
> >> + wait_for_completion(&rcu.completion);
> >> + atomic_inc(&rcu_stats.nbarriers);
> >> +
> >> +}
> >> +EXPORT_SYMBOL_GPL(rcu_barrier);
> >
> > The rcu_barrier() function must wait on all RCU callbacks, regardless of
> > which CPU they are queued on. This is important when unloading modules
> > that use call_rcu(). In contrast, the above looks to me like it waits
> > only on the current CPU's callbacks.
>
> Oops. I'll come up with an alternate mechanism. Thanks for finding this.
NP. ;-)
> > So, what am I missing?
>
> Nothing. You were right :)
>
> >> + /*
> >> + * Swap current and previous lists. Other cpus must not see this
> >> + * out-of-order w.r.t. the just-completed plist init, hence the above
> >> + * smp_wmb().
> >> + */
> >> + rd->which++;
> >
> > You do seem to have interrupts disabled when sampling ->which, but
> > this is not safe for cross-CPU accesses to ->which, right? The other
> > CPU might queue onto the wrong element. This would mean that you
> > would not be guaranteed a full 50ms delay from quiescent state to
> > corresponding RCU callback invocation.
> >
> > Or am I missing something subtle here?
>
> JRCU expects updates to the old queue to continue for a
> while, it only requires that they end and a trailing wmb
> be fully executed before the next sampling period ends.
OK. I remain skeptical, but mainly because similar setups have proven
buggy in the past.
> >> + /*
> >> + * End the current RCU batch and start a new one.
> >> + */
> >> + for_each_present_cpu(cpu) {
> >> + rd = &rcu_data[cpu];
> >
> > And here we get the cross-CPU accesses that I was worried about above.
>
> Yep. This is one of the trio of reasons why JRCU is for
> small SMP systems. It's the tradeoff I made to move the
> entire RCU load off onto one CPU. If that is not important
> (and it won't be to any but to specialized systems), one
> is expected to use RCU_TREE.
OK, so here is one of the things that you are trying to do with JRCU that
the current in-kernel RCU implementations do not do, namely, cause RCU
callbacks to be consistently invoked on some other CPU. If I remember
correctly, Frederic's changes do not support this either, because his
workloads run almost entirely in user space, and are therefore not
expected to generate RCU callbacks.
Oddly enough, user-space RCU -does- support this notion. ;-)
> The other two of the trio of reasons: doing kfree's on the
> 'wrong' cpu puts the freed buffer in the 'wrong' per-cpu
> free queue, and putting all the load on one cpu means
> that cpu could hit 100% cpu utilization just doing rcu
> callbacks, for systems with thousands of cpus and have
> the io fabrics necessary to keep those cpus busy.
That would be one reason I have been reluctant to implement callback
offloading in advance of a definite need!
Though I must admit that I am unsure how JRCU's call_rcu() implementation
does this safely -- it looks to me like it is only excluding irqs,
not other CPUs.
Thanx, Paul
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/