Re: [PATCH] locking/percpu_rwsem: Rewrite to not use rwsem

From: Will Deacon
Date: Wed Aug 07 2019 - 10:45:42 EST


Hi Peter,

I've mostly been spared the joys of pcpu rwsem, but I took a look anyway.
Comments of questionable quality below.

On Mon, Aug 05, 2019 at 04:02:41PM +0200, Peter Zijlstra wrote:
> The filesystem freezer uses percpu_rwsem in a way that is effectively
> write_non_owner() and achieves this with a few horrible hacks that
> rely on the rwsem (!percpu) implementation.
>
> When -RT re-implements rwsem this house of cards comes undone.
>
> Re-implement percpu_rwsem without relying on rwsem.
>
> Reported-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx>
> Reported-by: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@xxxxxxxxxxxxx>
> Tested-by: Juri Lelli <juri.lelli@xxxxxxxxxx>
> Cc: Clark Williams <williams@xxxxxxxxxx>
> Cc: Daniel Bristot de Oliveira <bristot@xxxxxxxxxx>
> Cc: Thomas Gleixner <tglx@xxxxxxxxxxxxx>
> Cc: Waiman Long <longman@xxxxxxxxxx>
> Cc: Davidlohr Bueso <dave@xxxxxxxxxxxx>
> Cc: Oleg Nesterov <oleg@xxxxxxxxxx>
> Cc: jack@xxxxxxxx
> ---
> include/linux/percpu-rwsem.h | 72 +++++++++++++-------------
> include/linux/wait.h | 3 +
> kernel/cpu.c | 4 -
> kernel/locking/percpu-rwsem.c | 116 +++++++++++++++++++++++++-----------------
> 4 files changed, 112 insertions(+), 83 deletions(-)

[...]

> +/*
> + * Called with preemption disabled, and returns with preemption disabled,
> + * except when it fails the trylock.
> + */
> +bool __percpu_down_read(struct percpu_rw_semaphore *sem, bool try)
> {
> /*
> * Due to having preemption disabled the decrement happens on
> * the same CPU as the increment, avoiding the
> * increment-on-one-CPU-and-decrement-on-another problem.
> *
> - * If the reader misses the writer's assignment of readers_block, then
> - * the writer is guaranteed to see the reader's increment.
> + * If the reader misses the writer's assignment of sem->block, then the
> + * writer is guaranteed to see the reader's increment.
> *
> * Conversely, any readers that increment their sem->read_count after
> - * the writer looks are guaranteed to see the readers_block value,
> - * which in turn means that they are guaranteed to immediately
> - * decrement their sem->read_count, so that it doesn't matter that the
> - * writer missed them.
> + * the writer looks are guaranteed to see the sem->block value, which
> + * in turn means that they are guaranteed to immediately decrement
> + * their sem->read_count, so that it doesn't matter that the writer
> + * missed them.
> */
>
> +again:
> smp_mb(); /* A matches D */
>
> /*
> - * If !readers_block the critical section starts here, matched by the
> + * If !sem->block the critical section starts here, matched by the
> * release in percpu_up_write().
> */
> - if (likely(!smp_load_acquire(&sem->readers_block)))
> - return 1;
> + if (likely(!atomic_read_acquire(&sem->block)))
> + return true;
>
> /*
> * Per the above comment; we still have preemption disabled and
> * will thus decrement on the same CPU as we incremented.
> */
> - __percpu_up_read(sem);
> + __percpu_up_read(sem); /* implies preempt_enable() */

Irritatingly, it also implies an smp_mb() which I don't think we need here.

> if (try)
> - return 0;
> + return false;
>
> - /*
> - * We either call schedule() in the wait, or we'll fall through
> - * and reschedule on the preempt_enable() in percpu_down_read().
> - */
> - preempt_enable_no_resched();
> + wait_event(sem->waiters, !atomic_read_acquire(&sem->block));

Why do you need acquire semantics here? Is the control dependency to the
increment not enough?

Talking of control dependencies, could we replace the smp_mb() in
readers_active_check() with smp_acquire__after_ctrl_dep()? In fact, perhaps
we could remove it altogether, given that our writes will be ordered by
the dependency and I don't think we care about ordering our reads wrt
previous readers. Hmm. Either way, clearly not for this patch.

Anyway, general shape of the patch looks good to me.

Will