Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

From: van der Linden, Frank
Date: Wed Aug 22 2018 - 12:45:00 EST

On 8/22/18 7:27 AM, Jens Axboe wrote:
> On 8/22/18 6:54 AM, Holger Hoffstätte wrote:
>> On 08/22/18 06:10, Jens Axboe wrote:
>>> [...]
>>> If you have time, please look at the 3 patches I posted earlier today.
>>> Those are for mainline, so should be OK :-)
>> I'm just playing along at home but with those 3 I get repeatable
>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>> device; for inexplicable reasons some other devices with ext4/xfs flush
>> properly. Yes, that surprised me too, but it's repeatable.
>> Now this may or may not have something to do with some of my in-testing
>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>> is golden again. Not eager to repeat since it hangs sync & requires a
>> hard reboot.. :(
>> Just thought you'd like to know.
> Thanks, that's very useful info! I'll see if I can reproduce that.
I think it might be useful to kind of give a dump of what we discussed
before this patch was sent, there was a little more than was in the

We saw hangs and heavy lock contention in the wbt code under a specific
workload, on XFS. Crash dump analysis showed the following issues:

1) wbt_done uses wake_up_all, which causes a thundering herd
2) __wbt_wait sets up a wait queue with the auto remove wake function
(via DEFINE_WAIT), which caused two problems:
* combined with the use of wake_up_all, the wait queue would
essentially be randomly reordered for tasks that did not get to run
* the waitqueue_active check in may_queue was not valid with the auto
remove function, which could lead incoming tasks with requests to
overtake existing requests

1) was fixed by using a plain wake_up
2) was fixed by keeping tasks on the queue until they could break out of
the wait loop in __wbt_wait

The random reordering, causing task starvation in __wbt_wait, was the
main problem. Simply not using the auto remove wait function, e.g.
*only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,
default_wake_function), fixed the hang / task starvation issue in our
tests. But there was still more lock contention than there should be, so
we also changed wake_up_all to wake_up.

It might be useful to run your tests with only the DEFINE_WAIT change I
describe above added to the original code to see if that still has any
problems. That would give a good datapoint whether any remaining issues
are due to missed wakeups or not.

There is the issue of making forward progress, or at least making it
fast enough. With the changes as they stand now, you could come up with
a scenario where the throttling limit is hit, but then is raised. Since
there might still be a wait queue, you could end up putting each
incoming task to sleep, even though it's not needed.

One way to guarantee that the wait queue clears up as fast as possible,
without resorting to wakeup_all, is to use wakeup_nr, where the number
of tasks to wake up is (limit - inflight).

Also, to avoid having tasks going back to sleep in the loop, you could
do what you already proposed - always just sleep at most once, and
unconditionally proceed after waking up. To avoid incoming tasks
overtaking the ones that are being woken up, you could have wbt_done
increment inflight, effectively reserving a spot for the tasks that are
about to be woken up.

Another thing I thought about was recording the number of waiters in the
wait queue, and modify the check from (inflight < limit) to (inflight <
(limit - nwaiters)), and no longer use any waitqueue_active checks.

The condition checks are of course complicated by the fact that
condition manipulation is not always done under the same lock (e.g.
wbt_wait can be called with a NULL lock).

So, these are just some of the things to consider here - maybe there's
nothing in there that you hadn't already considered, but I thought it'd
be useful to summarize them.

Thanks for looking in to this!