Re: [PATCH] f2fs: move f2fs to use reader-unfair rwsems

From: Waiman Long
Date: Tue Jan 11 2022 - 10:04:44 EST



On 1/11/22 01:53, Tim Murray wrote:
On Mon, Jan 10, 2022 at 8:15 PM Waiman Long <longman@xxxxxxxxxx> wrote:
That is not how rwsem works. A reader which fails to get the lock
because it is write-locked will remove its reader count before going to
sleep. So the reader count will be zero eventually. Of course, there is
a short period of time where the reader count will be non-zero until the
reader removes its own reader count. So if a new writer comes in at that
time, it will fail its initial trylock and probably go to optimistic
spinning mode. If the writer that owns the lock release it at the right
moment, the reader may acquire the read lock.
Thanks for the correction, that makes sense. I haven't spent too much
time on rwsem internals and I'm not confident about when flags are set
and cleared in sem->count; is there a case where sem->count after
up_write() could be nonzero?

An example from one trace:

1. Low-priority userspace thread 4764 is blocked in f2fs_unlink,
probably at f2fs_lock_op, which is a wrapper around
down_read(cp_rwsem).
2. f2fs-ckpt runs at t=0ms and wakes thread 4764, making it runnable.
3. At t=1ms, f2fs-ckpt enters uninterruptible sleep and blocks at
rwsem_down_write_slowpath per sched_blocked_reason.
4. At t=26ms, thread 4764 runs for the first time since being made
runnable. Within 40us, thread 4764 unblocks f2fs-ckpt and makes it
runnable.

Since thread 4764 is awakened by f2fs-ckpt but never runs before it
unblocks f2fs-ckpt in down_write_slowpath(), the only idea I had is
that cp_rwsem->count is nonzero after f2fs-ckpt's up_write() in step 2
(maybe because of rwsem_mark_wake()?).

I do have a question about the number of readers in such a case compared
with the number of writers. Are there a large number of low priority
hanging around? What is an average read lock hold time?

Blocking for 9.7s for a write lock is quite excessive and we need to
figure out how this happen.,
Just to be 100% clear, it's not a single 9.7s stall, it's many smaller
stalls of 10-500+ms in f2fs-ckpt that add up to 9.7s over that range.

f2fs is not my area of expertise, but my understanding is that
cp_rwsem in f2fs has many (potentially unbounded) readers and a single
writer. Arbitrary userspace work (fsync, creating/deleting/truncating
files, atomic writes) may grab the read lock, but assuming the
merge_checkpoint option is enabled, only f2fs-ckpt will ever grab the
write lock during normal operation. However, in this particular
example, it looks like there may have been 5-10 threads blocked on
f2fs-ckpt that were awakened alongside thread 4764 in step 2.

I'll defer to the f2fs experts on the average duration that the read
lock is held.

Thanks for the explanation.

Another question that I have is whether the test result is based on the latest upstream kernel or earlier kernel version. We used to allow reader optimistic spinning which was then removed in later kernel. Reader optimistic spinning may further increase writer wait time.

Anyway, AFAICS, this patch keeps readers out of the rwsem wait queue and so only writers can go into it. We can make an unfair rwsem to give preference to writers in the wait queue and wake up readers only if there is no more writers in the wait queue or even in the optimistic spinning queue. That should achieve the same effect as this patch.

Cheers,
Longman