Re: [PATCH] locking/rwsem: Remove reader optimistic lock stealing

From: Peng Wang

Date: Thu May 21 2026 - 23:34:00 EST

On Thu, May 21, 2026 at 10:08:58PM -0400, Waiman Long wrote:
> On 5/21/26 5:59 AM, Peng Wang wrote:
> > Reader optimistic lock stealing, introduced by commit 1a728dff855a
> > ("locking/rwsem: Enable reader optimistic lock stealing") and made more
> > aggressive by commit 617f3ef95177 ("locking/rwsem: Remove reader
> > optimistic spinning"), allows a reader entering the slowpath to bypass
> > the wait queue and acquire the lock directly when WRITER_LOCKED and
> > HANDOFF bits are not set.
> >
> > This causes severe writer starvation in workloads where readers hold
> > the lock for extended periods, such as Direct I/O operations which
> > hold inode->i_rwsem for the entire duration of iomap_dio_rw(). A
> > common example is log-structured storage where one thread appends via
> > DIO writes while another thread tails the log via DIO reads -- a
> > pattern seen in database redo-log replay and shared-storage
> > replication.
>
> It is generally assume that reader lock critical section is shorter than
> that of writer. In this particular case, does the reader critical section
> run longer than the writer's one?

Hi Longman,
Thanks for the quick reply.

In this workload both reader and writer hold i_rwsem across iomap_dio_rw(),
so the critical sections are not short, and they cover the full disk I/O latency
about hundreds of microseconds on NVMe

>
> Reader lock stealing should only happen if the previous lock owner is a
> writer. So readers and writer should at most alternately own the lock if
> there are many readers waiting. Of course, if a reader own the lock, it will
> wake up the remaining readers in the wait queue.

Because the steal (an atomic operation, ~50ns) is orders of magnitude faster than
the writer wakeup path (context switch, ~10-100us), the new reader wins this race
every time the lock becomes free.
This repeats with each subsequent reader arrival until the HANDOFF timeout.

>
> >
> > The problem occurs because:
> >
> > 1. A reader entering the slowpath (due to RWSEM_FLAG_WAITERS being set)
> > can still steal the lock as the steal condition only checks
> > WRITER_LOCKED and HANDOFF, not WAITERS. This is inconsistent with
> > the fast path which already blocks readers when WAITERS is set (via
> > RWSEM_READ_FAILED_MASK).
> >
> > 2. In the window between the last reader releasing the lock and the
> > waiting writer being scheduled (~10-100us), a new reader can steal
> > the lock in ~50-100ns. This race is structurally inevitable due to
> > the 1000x speed difference between atomic operations vs context
> > switching.
> >
> > 3. Each stolen read lock is held for the full DIO duration (potentially
> > milliseconds), and the pattern repeats until the 4ms HANDOFF timeout.
> > This effectively taxes every write operation with a ~4ms penalty.
> >
> > Performance impact measured with a DIO mixed read/write workload
> > (1 writer + 1 reader, O_DIRECT, ext4):
> >
> > NVMe SSD: Before After
> > Write-only baseline: 397 MB/s 397 MB/s (no change)
> > Mixed write throughput: 11 MB/s 350 MB/s (+31x)
> > Mixed write latency: 880 us 23 us (-38x)
> > Mixed read throughput: 95 MB/s 95 MB/s (no change)
>
> Does the reader wait for something else and taking other locks after
> acquiring the read lock? Can you point me of the reader critical section for
> this particular workload?

The reader does not wait on anything else or take other locks after acquiring i_rwsem.
The critical section is as below:

ext4_dio_read_iter() (fs/ext4/file.c:70)
inode_lock_shared(inode);
ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL, 0, NULL, 0);
inode_unlock_shared(inode);

The lock is held across iomap_dio_rw() which submits the bio and waits for I/O completion synchronously.
The hold time is essentially the device latency for a single DIO read.

>
> Cheers,
> Longman
>
> >
> > Fixes: 617f3ef95177 ("locking/rwsem: Remove reader optimistic spinning")
> > Signed-off-by: Peng Wang <peng_wang@xxxxxxxxxxxxxxxxx>
> > ---
> > kernel/locking/lock_events_list.h | 1 -
> > kernel/locking/rwsem.c | 33 ---------------------------------
> > 2 files changed, 34 deletions(-)
> >
> > diff --git a/kernel/locking/lock_events_list.h b/kernel/locking/lock_events_list.h
> > index 97fb6f3f840a..35b45576bee4 100644
> > --- a/kernel/locking/lock_events_list.h
> > +++ b/kernel/locking/lock_events_list.h
> > @@ -65,7 +65,6 @@ LOCK_EVENT(rwsem_opt_lock) /* # of opt-acquired write locks */
> > LOCK_EVENT(rwsem_opt_fail) /* # of failed optspins */
> > LOCK_EVENT(rwsem_opt_nospin) /* # of disabled optspins */
> > LOCK_EVENT(rwsem_rlock) /* # of read locks acquired */
> > -LOCK_EVENT(rwsem_rlock_steal) /* # of read locks by lock stealing */
> > LOCK_EVENT(rwsem_rlock_fast) /* # of fast read locks acquired */
> > LOCK_EVENT(rwsem_rlock_fail) /* # of failed read lock acquisitions */
> > LOCK_EVENT(rwsem_rlock_handoff) /* # of read lock handoffs */
> > diff --git a/kernel/locking/rwsem.c b/kernel/locking/rwsem.c
> > index bda5577339c0..40b141c5765f 100644
> > --- a/kernel/locking/rwsem.c
> > +++ b/kernel/locking/rwsem.c
> > @@ -1017,42 +1017,9 @@ static struct rw_semaphore __sched *
> > rwsem_down_read_slowpath(struct rw_semaphore *sem, long count, unsigned int state)
> > {
> > long adjustment = -RWSEM_READER_BIAS;
> > - long rcnt = (count >> RWSEM_READER_SHIFT);
> > struct rwsem_waiter waiter, *first;
> > DEFINE_WAKE_Q(wake_q);
> >
> > - /*
> > - * To prevent a constant stream of readers from starving a sleeping
> > - * writer, don't attempt optimistic lock stealing if the lock is
> > - * very likely owned by readers.
> > - */
> > - if ((atomic_long_read(&sem->owner) & RWSEM_READER_OWNED) &&
> > - (rcnt > 1) && !(count & RWSEM_WRITER_LOCKED))
> > - goto queue;
> > -
> > - /*
> > - * Reader optimistic lock stealing.
> > - */
> > - if (!(count & (RWSEM_WRITER_LOCKED | RWSEM_FLAG_HANDOFF))) {
> > - rwsem_set_reader_owned(sem);
> > - lockevent_inc(rwsem_rlock_steal);
> > -
> > - /*
> > - * Wake up other readers in the wait queue if it is
> > - * the first reader.
> > - */
> > - if ((rcnt == 1) && (count & RWSEM_FLAG_WAITERS)) {
> > - raw_spin_lock_irq(&sem->wait_lock);
> > - if (sem->first_waiter)
> > - rwsem_mark_wake(sem, RWSEM_WAKE_READ_OWNED,
> > - &wake_q);
> > - raw_spin_unlock_irq(&sem->wait_lock);
> > - wake_up_q(&wake_q);
> > - }
> > - return sem;
> > - }
> > -
> > -queue:
> > waiter.task = current;
> > waiter.type = RWSEM_WAITING_FOR_READ;
> > waiter.timeout = jiffies + RWSEM_WAIT_TIMEOUT;
> > --
> > 2.39.3
> >
>