Re: [PATCH] locking/percpu_rwsem: Rewrite to not use rwsem

From: Peter Zijlstra
Date: Tue Oct 29 2019 - 15:06:40 EST


On Wed, Aug 07, 2019 at 03:45:34PM +0100, Will Deacon wrote:
> Hi Peter,
>
> I've mostly been spared the joys of pcpu rwsem, but I took a look anyway.
> Comments of questionable quality below.

Thanks for having a look, and sorry for being tardy.

> On Mon, Aug 05, 2019 at 04:02:41PM +0200, Peter Zijlstra wrote:

> > +/*
> > + * Called with preemption disabled, and returns with preemption disabled,
> > + * except when it fails the trylock.
> > + */
> > +bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
> > {
> > /*
> > * Due to having preemption disabled the decrement happens on
> > * the same CPU as the increment, avoiding the
> > * increment-on-one-CPU-and-decrement-on-another problem.
> > *
> > - * If the reader misses the writer's assignment of readers_block, then
> > - * the writer is guaranteed to see the reader's increment.
> > + * If the reader misses the writer's assignment of sem->block, then the
> > + * writer is guaranteed to see the reader's increment.
> > *
> > * Conversely, any readers that increment their sem->read_count after
> > - * the writer looks are guaranteed to see the readers_block value,
> > - * which in turn means that they are guaranteed to immediately
> > - * decrement their sem->read_count, so that it doesn't matter that the
> > - * writer missed them.
> > + * the writer looks are guaranteed to see the sem->block value, which
> > + * in turn means that they are guaranteed to immediately decrement
> > + * their sem->read_count, so that it doesn't matter that the writer
> > + * missed them.
> > */
> >
> > +again:
> > smp_mb(); /* A matches D */
> >
> > /*
> > - * If !readers_block the critical section starts here, matched by the
> > + * If !sem->block the critical section starts here, matched by the
> > * release in percpu_up_write().
> > */
> > - if (likely(!smp_load_acquire(&sem->readers_block)))
> > - return 1;
> > + if (likely(!atomic_read_acquire(&sem->block)))
> > + return true;
> >
> > /*
> > * Per the above comment; we still have preemption disabled and
> > * will thus decrement on the same CPU as we incremented.
> > */
> > - __percpu_up_read(sem);
> > + __percpu_up_read(sem); /* implies preempt_enable() */
>
> Irritatingly, it also implies an smp_mb() which I don't think we need here.

Hurm.. yes, I think you're right. We have:

__this_cpu_inc()
smp_mb()
if (!atomic_read_acquire(&sem->block))
return true;
__percpu_up_read();

Between that smp_mb() and the ACQUIRE there is no way the
__this_cpu_dec() can get hoisted. That is, except if we rely on that
stronger transitivity guarantee (I forgot what we ended up calling it,
and I can't ever find anything in a hurry in memory-barriers.txt).

That said, with Oleg's suggestion getting here is much reduced and when
we do, the cost of actual wakeups is much higher than that of the memory
barrier, so I'm inclined to leave things as they are.

> > if (try)
> > - return 0;
> > + return false;
> >
> > - /*
> > - * We either call schedule() in the wait, or we'll fall through
> > - * and reschedule on the preempt_enable() in percpu_down_read().
> > - */
> > - preempt_enable_no_resched();
> > + wait_event(sem->waiters, !atomic_read_acquire(&sem->block));
>
> Why do you need acquire semantics here? Is the control dependency to the
> increment not enough?

Uuuhhh... let me think about that. Clearly we want ACQUIRE for the
'return true' case, but I'm not sure why I used it here.

> Talking of control dependencies, could we replace the smp_mb() in
> readers_active_check() with smp_acquire__after_ctrl_dep()? In fact, perhaps
> we could remove it altogether, given that our writes will be ordered by
> the dependency and I don't think we care about ordering our reads wrt
> previous readers. Hmm. Either way, clearly not for this patch.

I think we can, but we've never had the need to optimize the slow path
here. The design of this thing has always been that if you hit the
slow-path, you're doing it wrong.

That said, I think cgroups use a variant of percpu-rwsem that wreck rss
on purpose and always take the slowpaths. So it might actually be worth
it.