Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems

From: Jaegeuk Kim
Date: Tue Jan 11 2022 - 19:25:13 EST


On 01/11, Waiman Long wrote:
> On 1/11/22 12:07, Jaegeuk Kim wrote:
> > On 01/11, Waiman Long wrote:
> > > On 1/11/22 01:53, Tim Murray wrote:
> > > > On Mon, Jan 10, 2022 at 8:15 PM Waiman Long <longman@xxxxxxxxxx> wrote:
> > > > > That is not how rwsem works. A reader which fails to get the lock
> > > > > because it is write-locked will remove its reader count before going to
> > > > > sleep. So the reader count will be zero eventually. Of course, there is
> > > > > a short period of time where the reader count will be non-zero until the
> > > > > reader removes its own reader count. So if a new writer comes in at that
> > > > > time, it will fail its initial trylock and probably go to optimistic
> > > > > spinning mode. If the writer that owns the lock release it at the right
> > > > > moment, the reader may acquire the read lock.
> > > > Thanks for the correction, that makes sense. I haven't spent too much
> > > > time on rwsem internals and I'm not confident about when flags are set
> > > > and cleared in sem->count; is there a case where sem->count after
> > > > up_write() could be nonzero?
> > > >
> > > > An example from one trace:
> > > >
> > > > 1. Low-priority userspace thread 4764 is blocked in f2fs_unlink,
> > > > probably at f2fs_lock_op, which is a wrapper around
> > > > down_read(cp_rwsem).
> > > > 2. f2fs-ckpt runs at t=0ms and wakes thread 4764, making it runnable.
> > > > 3. At t=1ms, f2fs-ckpt enters uninterruptible sleep and blocks at
> > > > rwsem_down_write_slowpath per sched_blocked_reason.
> > > > 4. At t=26ms, thread 4764 runs for the first time since being made
> > > > runnable. Within 40us, thread 4764 unblocks f2fs-ckpt and makes it
> > > > runnable.
> > > >
> > > > Since thread 4764 is awakened by f2fs-ckpt but never runs before it
> > > > unblocks f2fs-ckpt in down_write_slowpath(), the only idea I had is
> > > > that cp_rwsem->count is nonzero after f2fs-ckpt's up_write() in step 2
> > > > (maybe because of rwsem_mark_wake()?).
> > > >
> > > > > I do have a question about the number of readers in such a case compared
> > > > > with the number of writers. Are there a large number of low priority
> > > > > hanging around? What is an average read lock hold time?
> > > > >
> > > > > Blocking for 9.7s for a write lock is quite excessive and we need to
> > > > > figure out how this happen.,
> > > > Just to be 100% clear, it's not a single 9.7s stall, it's many smaller
> > > > stalls of 10-500+ms in f2fs-ckpt that add up to 9.7s over that range.
> > > >
> > > > f2fs is not my area of expertise, but my understanding is that
> > > > cp_rwsem in f2fs has many (potentially unbounded) readers and a single
> > > > writer. Arbitrary userspace work (fsync, creating/deleting/truncating
> > > > files, atomic writes) may grab the read lock, but assuming the
> > > > merge_checkpoint option is enabled, only f2fs-ckpt will ever grab the
> > > > write lock during normal operation. However, in this particular
> > > > example, it looks like there may have been 5-10 threads blocked on
> > > > f2fs-ckpt that were awakened alongside thread 4764 in step 2.
> > > >
> > > > I'll defer to the f2fs experts on the average duration that the read
> > > > lock is held.
> > > Thanks for the explanation.
> > >
> > > Another question that I have is whether the test result is based on the
> > > latest upstream kernel or earlier kernel version. We used to allow reader
> > > optimistic spinning which was then removed in later kernel. Reader
> > > optimistic spinning may further increase writer wait time.
> > It's on 5.10 kernel having all the upstream f2fs patches, and yes, we wanted
> > to get higher priority on writer over many readers since the writer, checkpoint,
> > is the most latency-critical operation that can block all the other filesystem
> > operations.
>
> v5.10 kernel still have reader optimistic spinning enabled in rwsem which
> may have worsen the writer wait time. Could you try with a more up-to-date
> kernel or backport the relevant rwsem patches into your test kernel to see
> how much it can help?

We're using https://android.googlesource.com/kernel/common/+/refs/heads/android12-5.10.
By any chance, may I ask which upstream patches we need to backport?

>
>
> > > Anyway, AFAICS, this patch keeps readers out of the rwsem wait queue and so
> > > only writers can go into it. We can make an unfair rwsem to give preference
> > > to writers in the wait queue and wake up readers only if there is no more
> > > writers in the wait queue or even in the optimistic spinning queue. That
> > > should achieve the same effect as this patch.
> > Can we get a patch for the variant to test a bit? Meanwhile, I think we can
> > merge this patch to add a wraper first and switches to it later?
>
> Give me a week or so and I can make a RFC patch to support unfair rwsem for
> you to try out. You do need to try it on the latest kernel, though.

Thank you so much. My thought flow is applying this in f2fs for all the old
kernels shipped in Android devices. Then, we can try to backport upstream
rwsem patches and yours to switch finally. Let me know if you have any concern.

>
> Cheers,
> longman