Re: [PATCH 1/2] brw_mutex: big read-write mutex

From: Mikulas Patocka
Date: Fri Oct 19 2012 - 11:33:22 EST

On Fri, 19 Oct 2012, Peter Zijlstra wrote:

> On Thu, 2012-10-18 at 15:28 -0400, Mikulas Patocka wrote:
> >
> > On Thu, 18 Oct 2012, Oleg Nesterov wrote:
> >
> > > Ooooh. And I just noticed include/linux/percpu-rwsem.h which does
> > > something similar. Certainly it was not in my tree when I started
> > > this patch... percpu_down_write() doesn't allow multiple writers,
> > > but the main problem it uses msleep(1). It should not, I think.
> >
> > synchronize_rcu() can sleep for hundred milliseconds, so msleep(1) is not
> > a big problem.
> That code is beyond ugly though.. it should really not have been merged.
> There's absolutely no reason for it to use RCU except to make it more

So if you can do an alternative implementation without RCU, show it. The
goal is - there should be no LOCK instructions on the read path and as
few barriers as possible.

> complicated. And as Oleg pointed out that msleep() is very ill
> considered.
> The very worst part of it seems to be that nobody who's usually involved
> with locking primitives was ever consulted (Linus, PaulMck, Oleg, Ingo,
> tglx, dhowells and me). It doesn't even have lockdep annotations :/
> So the only reason you appear to use RCU is because you don't actually
> have a sane way to wait for count==0. And I'm contesting rcu_sync() is
> sane here -- for the very simple reason you still need while (count)
> loop right after it.
> So it appears you want an totally reader biased, sleepable rw-lock like
> thing?


> So did you consider keeping the inc/dec on the same per-cpu variable?
> Yes this adds a potential remote access to dec and requires you to use
> atomics, but I would not be surprised if the inc/dec were mostly on the
> same cpu most of the times -- which might be plenty fast for what you
> want.

Yes, I tried this approach - it involves doing LOCK instruction on read
lock, remembering the cpu and doing another LOCK instruction on read
unlock (which will hopefully be on the same CPU, so no cacheline bouncing
happens in the common case). It was slower than the approach without any
LOCK instructions (43.3 seconds seconds for the implementation with
per-cpu LOCKed access, 42.7 seconds for this implementation without atomic
instruction; the benchmark involved doing 512-byte direct-io reads and
writes on a ramdisk with 8 processes on 8-core machine).

> If you've got coherent per-cpu counts, you can better do the
> waitqueue/wake condition for write_down.

synchronize_rcu() is way slower than msleep(1) - so I don't see a reason
why should it be complicated to avoid msleep(1).

> It might also make sense to do away with the mutex, there's no point in
> serializing the wakeups in the p->locked case of down_read.

The advantage of a mutex is that it is already protected against
starvation. If I replace the mutex with a wait queue and retry, there is
no starvation protection.

> Furthermore,
> p->locked seems a complete duplicate of the mutex state, so removing the
> mutex also removes that duplication.

We could replace if (p->locked) with if (mutex_is_locked(p->mtx))

> Also, that CONFIG_x86 thing.. *shudder*...

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at