Re: Kernel RCU: shrink the size of the struct rcu_head

From: Paul E. McKenney
Date: Fri Oct 23 2009 - 12:36:20 EST


On Fri, Oct 23, 2009 at 08:29:38AM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney (paulmck@xxxxxxxxxxxxxxxxxx) wrote:
> > On Wed, Oct 21, 2009 at 10:53:15AM -0400, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney (paulmck@xxxxxxxxxxxxxxxxxx) wrote:
> > > > On Sun, Oct 18, 2009 at 07:29:18PM -0400, Mathieu Desnoyers wrote:
> > > > > Hi Paul,
> > > > >
> > > > > I noticed that you already discussed the possibility of shrinking the
> > > > > struct rcu_head by removing the function pointer.
> > > > > (http://kernel.org/pub/linux/kernel/people/paulmck/rcutodo.html)
> > > > >
> > > > > The ideas brought in so far require having per-callback lists, which
> > > > > involves a bit of management overhead and don't permit keeping the
> > > > > call_rcu() in cpu order.
> > > >
> > > > But please note that this is on the "Possibly Dubious Changes" list. ;-)
> > > >
> > > > > You might want to look into the Userspace RCU urcu-defer.c
> > > > > implementation, where I perform pointer encoding to compact the usual
> > > > > case, expected to be the same callback passed as parameter multiple
> > > > > times in a row to call_rcu(). This is very typical with multiple free()
> > > > > calls for different data structures next to each other.
> > > > >
> > > > > This typically keeps the size of the information to encode per callback
> > > > > down to a minimum: the size of a single pointer. It would be good to
> > > > > trace the kernel usage of call_rcu() to see if my assumption holds.
> > > > >
> > > > > I just thought I should tell you before you start looking at this
> > > > > issue further.
> > > >
> > > > So the idea is to maintain a per-CPU queue of function pointers, but
> > > > with the pointers on this queue encoded to save space, correct?
> > >
> > > Yes, exactly.
> >
> > OK, I will add something to this effect on my rcutodo page.
> >
> > > > If I
> > > > understand correctly, the user-level rcu-defer implementation relies on
> > > > the following:
> > > >
> > > > 1. It is illegal to call _rcu_defer_queue() within an RCU read-side
> > > > critical section (due to the call to rcu_defer_barrier_thread()
> > > > which in turn calls synchronize_rcu(). This is necessary to
> > > > handle queue overflow. (Which appears to be why you introduce
> > > > a new API, as it is legal to invoke call_rcu() from within an
> > > > RCU read-side critical section.)
> > >
> > > When dealing with queue overflow, I figured we have 4 alternatives.
> > > Either:
> > >
> > > 1, 2, 3) We proceed to execution of {the single, all, thread local}
> > > callback(s) on the spot after a synchronize_rcu().
> > >
> > > 4) We expand the queue by allocating more memory.
> > >
> > > The idea of pointer encoding to save space could be used with any of 1,
> > > 2, 3, or 4. As you say, call_rcu() requires (4), because it tolerates
> > > being called from an rcu read-side C.S.. 1, 2, 3 are incompatible with
> > > read-side C.S. context because they require to use synchronize_rcu()
> > > within the C.S., which would deadlock on its calling context.
> > >
> > > Now, there is a rationale for the choice of (3) in my urcu-defer
> > > implementation:
> > >
> > > * It's how I can deal with memory full (-ENOMEM) without letting the
> > > system die with exit(). How does the kernel call_rcu() deal with this
> > > currently ? BUG_ON, WARN_ON ?
> >
> > It doesn't have to do anything -- the caller of call_rcu() is responsible
> > for allocating any required memory. So call_rcu() never allocates memory
> > and thus never needs to worry about a memory-allocation failure.
>
> I see. So it puts the burden of memory allocation onto the creation of
> the original object we are dealing with. I understand why it's done like
> that now: it does not add another memory allocation failure point.
>
> > > * It acts as a rate limiter for urcu_defer_queue(). Basically, if a
> > > thread starts enqueuing callbacks too fast, it will eventually fill its
> > > queue and have to empty it itself. AFAIK, It's not possible to do that
> > > if you allow call_rcu() to be called from read-side C.S..
> >
> > Yep! ;-)
>
> From userland point of view, it would be good to provide an API that
> allows rate-limiting, especially to deal with DoS types of attacks. I
> I just renamed rcu_defer_queue() to defer_rcu() and removed the define
> for call_rcu(), because this last mapping introduced an API with the
> same name as the kernel API, but with different parameters.
>
> So the idea is to have:
>
> * defer_rcu(fct, p): fixed-sized queue rcu work deferral. Cannot be
> called from rcu read-side C.S..
> * defer_rcu_ratelimit(fct, p, rl): same as above, but with added rate
> limiter callback.
>
> and, eventually, to add back a standard call_rcu(), which can be called
> from within RCU read-side C.S., but which provides no rate limiting
> whatsoever.

These sound to me like interesting experiments, but I strongly
suggest publishing them as "experimental, may change" or whatever.
One thing to keep in mind is that the rate-limiting should not be used
in real-time applications. Alternatively, you could allow defer_rcu()
and defer_rcu_ratelimit() to return failure if the callback could not
be queued immediately. Not that figuring out what to do in the failure
case is particularly attractive...

Of course, the memory savings will depend on the workload. The
defer_rcu*() primitives eliminate the need for a struct rcu_head in
each data structure, but requires that the per-CPU/task callback queues
be present.

> > > I could even extend rcu_defer_queue() to take a second rate-limiter
> > > callback, which would check if the thread went over some threshold and
> > > give a more precise limit (e.g. amount of memory to be freed) on the
> > > rate than the "4096 callbacks in flight max", which have been chosen by
> > > benchmarks, but is a bit arbitrary in terms of overall callback effect.
> > >
> > > How important is it to permit enqueuing callbacks from within rcu
> > > read-side C.S. in terms of real-life usage ? If it is really that
> > > important to fill this use-case, then I could have a mode for call_rcu()
> > > that expands the RCU callback queue upon overflow. But as I argue above,
> > > I really prefer the control we have with a fixed-sized queue.
> >
> > There are occurrences of this in the Linux kernel. In theory, you could
> > always hang onto the object until leaving the outermost RCU read-side
> > critical section, but in practice this is not always consistent with
> > good software-engineering practice.
> >
> > One use case is when you have an RCU-protected list, each element of
> > which has an RCU-protected list hanging off it. In this case, you might
> > scan the upper-level list under RCU protection, but during the scan you
> > might need to remove elements from the lower-level list and pass them
> > to call_rcu().
> >
> > So it really needs to be legal for call_rcu() to be invoked from within
> > an RCU read-side critical section.
>
> I agree that not being able to use call_rcu() from a read-side C.S.
> could really be a problem here.
>
> This is why I think aving two interfaces, one permitting calling
> call_rcu() from within C.S., but requiring the added struct rcu_head,
> and the other using per-thread queues with maximum size, rate limiting,
> but which cannot be used in read-side C.S. seems like a good tradeoff.

The defer_rcu() interfaces might work out quite well, but it would be
good to see how they do in real life. ;-)

> > > > 2. It is OK to wait for a grace period when a thread calls
> > > > rcu_defer_unregister_thread() while exiting. In the kernel,
> > > > this is roughly equivalent to the CPU_DYING notifier, which
> > > > cannot block, thus cannot wait for a grace period.
> > > >
> > > > I could imagine copying the per-CPU buffer somewhere, though
> > > > my experience with the RCU/CPU-hotplug interface does not
> > > > encourage me in this direction. ;-)
> > >
> > > As you say, we don't _have_ to empty the queue before putting a
> > > thread/cpu offline. We could simply copy the unplugged cpu queue to an
> > > orphan queue, as you currently do in your implementation. I agree that
> > > it would be more suitable to the cpu hotplug CPU_DYING execution
> > > context, due to its inherent trickiness.
> >
> > Especially if you want something like rcu_barrier() to continue working.
> >
> > Hmmm... Can user applications unload dynamically linked libraries? ;-)
>
> Yes. With dlclose(). But I expect library developers to call
> rcu_defer_barrier() in their destructor if they have queued any work.
> (/me sneaks a README update in the git tree to that effect) ;)

;-)

Thanx, Paul

> Thanks,
>
> Mathieu
>
> >
> > Thanx, Paul
> >
> > > Thanks,
> > >
> > > Mathieu
> > >
> > > >
> > > > Thanx, Paul
> > >
> > > --
> > > Mathieu Desnoyers
> > > OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
> --
> Mathieu Desnoyers
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/