Re: [PATCH RESEND v2 1/1] percpu_rw_semaphore: reimplement to notblock the readers unnecessarily

From: Mikulas Patocka
Date: Thu Nov 08 2012 - 18:41:25 EST




On Thu, 8 Nov 2012, Paul E. McKenney wrote:

> On Thu, Nov 08, 2012 at 12:07:00PM -0800, Andrew Morton wrote:
> > On Thu, 8 Nov 2012 14:48:49 +0100
> > Oleg Nesterov <oleg@xxxxxxxxxx> wrote:
> >
> > > Currently the writer does msleep() plus synchronize_sched() 3 times
> > > to acquire/release the semaphore, and during this time the readers
> > > are blocked completely. Even if the "write" section was not actually
> > > started or if it was already finished.
> > >
> > > With this patch down_write/up_write does synchronize_sched() twice
> > > and down_read/up_read are still possible during this time, just they
> > > use the slow path.
> > >
> > > percpu_down_write() first forces the readers to use rw_semaphore and
> > > increment the "slow" counter to take the lock for reading, then it
> > > takes that rw_semaphore for writing and blocks the readers.
> > >
> > > Also. With this patch the code relies on the documented behaviour of
> > > synchronize_sched(), it doesn't try to pair synchronize_sched() with
> > > barrier.
> > >
> > > ...
> > >
> > > include/linux/percpu-rwsem.h | 83 +++++------------------------
> > > lib/Makefile | 2 +-
> > > lib/percpu-rwsem.c | 123 ++++++++++++++++++++++++++++++++++++++++++
> >
> > The patch also uninlines everything.
> >
> > And it didn't export the resulting symbols to modules, so it isn't an
> > equivalent. We can export thing later if needed I guess.
> >
> > It adds percpu-rwsem.o to lib-y, so the CONFIG_BLOCK=n kernel will
> > avoid including the code altogether, methinks?
> >
> > >
> > > ...
> > >
> > > --- /dev/null
> > > +++ b/lib/percpu-rwsem.c
> > > @@ -0,0 +1,123 @@
> >
> > That was nice and terse ;)
> >
> > > +#include <linux/percpu-rwsem.h>
> > > +#include <linux/rcupdate.h>
> > > +#include <linux/sched.h>
> >
> > This list is nowhere near sufficient to support this file's
> > requirements. atomic.h, percpu.h, rwsem.h, wait.h, errno.h and plenty
> > more. IOW, if it compiles, it was sheer luck.
> >
> > > +int percpu_init_rwsem(struct percpu_rw_semaphore *brw)
> > > +{
> > > + brw->fast_read_ctr = alloc_percpu(int);
> > > + if (unlikely(!brw->fast_read_ctr))
> > > + return -ENOMEM;
> > > +
> > > + mutex_init(&brw->writer_mutex);
> > > + init_rwsem(&brw->rw_sem);
> > > + atomic_set(&brw->slow_read_ctr, 0);
> > > + init_waitqueue_head(&brw->write_waitq);
> > > + return 0;
> > > +}
> > > +
> > > +void percpu_free_rwsem(struct percpu_rw_semaphore *brw)
> > > +{
> > > + free_percpu(brw->fast_read_ctr);
> > > + brw->fast_read_ctr = NULL; /* catch use after free bugs */
> > > +}
> > > +
> > > +static bool update_fast_ctr(struct percpu_rw_semaphore *brw, unsigned int val)
> > > +{
> > > + bool success = false;
> > > +
> > > + preempt_disable();
> > > + if (likely(!mutex_is_locked(&brw->writer_mutex))) {
> > > + __this_cpu_add(*brw->fast_read_ctr, val);
> > > + success = true;
> > > + }
> > > + preempt_enable();
> > > +
> > > + return success;
> > > +}
> > > +
> > > +/*
> > > + * Like the normal down_read() this is not recursive, the writer can
> > > + * come after the first percpu_down_read() and create the deadlock.
> > > + */
> > > +void percpu_down_read(struct percpu_rw_semaphore *brw)
> > > +{
> > > + if (likely(update_fast_ctr(brw, +1)))
> > > + return;
> > > +
> > > + down_read(&brw->rw_sem);
> > > + atomic_inc(&brw->slow_read_ctr);
> > > + up_read(&brw->rw_sem);
> > > +}
> > > +
> > > +void percpu_up_read(struct percpu_rw_semaphore *brw)
> > > +{
> > > + if (likely(update_fast_ctr(brw, -1)))
> > > + return;
> > > +
> > > + /* false-positive is possible but harmless */
> > > + if (atomic_dec_and_test(&brw->slow_read_ctr))
> > > + wake_up_all(&brw->write_waitq);
> > > +}
> > > +
> > > +static int clear_fast_ctr(struct percpu_rw_semaphore *brw)
> > > +{
> > > + unsigned int sum = 0;
> > > + int cpu;
> > > +
> > > + for_each_possible_cpu(cpu) {
> > > + sum += per_cpu(*brw->fast_read_ctr, cpu);
> > > + per_cpu(*brw->fast_read_ctr, cpu) = 0;
> > > + }
> > > +
> > > + return sum;
> > > +}
> > > +
> > > +/*
> > > + * A writer takes ->writer_mutex to exclude other writers and to force the
> > > + * readers to switch to the slow mode, note the mutex_is_locked() check in
> > > + * update_fast_ctr().
> > > + *
> > > + * After that the readers can only inc/dec the slow ->slow_read_ctr counter,
> > > + * ->fast_read_ctr is stable. Once the writer moves its sum into the slow
> > > + * counter it represents the number of active readers.
> > > + *
> > > + * Finally the writer takes ->rw_sem for writing and blocks the new readers,
> > > + * then waits until the slow counter becomes zero.
> > > + */
> >
> > Some overview of how fast/slow_read_ctr are supposed to work would be
> > useful. This comment seems to assume that the reader already knew
> > that.
> >
> > > +void percpu_down_write(struct percpu_rw_semaphore *brw)
> > > +{
> > > + /* also blocks update_fast_ctr() which checks mutex_is_locked() */
> > > + mutex_lock(&brw->writer_mutex);
> > > +
> > > + /*
> > > + * 1. Ensures mutex_is_locked() is visible to any down_read/up_read
> > > + * so that update_fast_ctr() can't succeed.
> > > + *
> > > + * 2. Ensures we see the result of every previous this_cpu_add() in
> > > + * update_fast_ctr().
> > > + *
> > > + * 3. Ensures that if any reader has exited its critical section via
> > > + * fast-path, it executes a full memory barrier before we return.
> > > + */
> > > + synchronize_sched();
> >
> > Here's where I get horridly confused. Your patch completely deRCUifies
> > this code, yes? Yet here we're using an RCU primitive. And we seem to
> > be using it not as an RCU primitive but as a handy thing which happens
> > to have desirable side-effects. But the implementation of
> > synchronize_sched() differs considerably according to which rcu
> > flavor-of-the-minute you're using.
>
> The trick is that the preempt_disable() call in update_fast_ctr()
> acts as an RCU read-side critical section WRT synchronize_sched().
>
> The algorithm would work given rcu_read_lock()/rcu_read_unlock() and
> synchronize_rcu() in place of preempt_disable()/preempt_enable() and
> synchronize_sched(). The real-time guys would prefer the change
> to rcu_read_lock()/rcu_read_unlock() and synchronize_rcu(), now that
> you mention it.
>
> Oleg, Mikulas, any reason not to move to rcu_read_lock()/rcu_read_unlock()
> and synchronize_rcu()?
>
> Thanx, Paul

preempt_disable/preempt_enable is faster than
rcu_read_lock/rcu_read_unlock for preemptive kernels.

Regarding real-time response - the region blocked with
preempt_disable/preempt_enable contains a few instructions (one test for
mutex_is_locked and one increment of percpu variable), so it isn't any
threat to real time response. There are plenty of longer regions in the
kernel that are executed with interrupts or preemption disabled.

Mikulas
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/