Re: [PATCH] blk-wbt: Avoid lock contention and thundering herd issue in wbt_wait

From: Jens Axboe
Date: Wed Aug 22 2018 - 17:06:05 EST

On 8/22/18 2:26 PM, Anchal Agarwal wrote:
> On Wed, Aug 22, 2018 at 11:30:40AM -0600, Jens Axboe wrote:
>> On 8/22/18 10:42 AM, van der Linden, Frank wrote:
>>> On 8/22/18 7:27 AM, Jens Axboe wrote:
>>>> On 8/22/18 6:54 AM, Holger Hoffst??tte wrote:
>>>>> On 08/22/18 06:10, Jens Axboe wrote:
>>>>>> [...]
>>>>>> If you have time, please look at the 3 patches I posted earlier today.
>>>>>> Those are for mainline, so should be OK :-)
>>>>> I'm just playing along at home but with those 3 I get repeatable
>>>>> hangs & writeback not starting at all, but curiously *only* on my btrfs
>>>>> device; for inexplicable reasons some other devices with ext4/xfs flush
>>>>> properly. Yes, that surprised me too, but it's repeatable.
>>>>> Now this may or may not have something to do with some of my in-testing
>>>>> patches for btrfs itself, but if I remove those 3 wbt fixes, everything
>>>>> is golden again. Not eager to repeat since it hangs sync & requires a
>>>>> hard reboot.. :(
>>>>> Just thought you'd like to know.
>>>> Thanks, that's very useful info! I'll see if I can reproduce that.
>>> I think it might be useful to kind of give a dump of what we discussed
>>> before this patch was sent, there was a little more than was in the
>>> description.
>>> We saw hangs and heavy lock contention in the wbt code under a specific
>>> workload, on XFS. Crash dump analysis showed the following issues:
>>> 1) wbt_done uses wake_up_all, which causes a thundering herd
>>> 2) __wbt_wait sets up a wait queue with the auto remove wake function
>>> (via DEFINE_WAIT), which caused two problems:
>>> * combined with the use of wake_up_all, the wait queue would
>>> essentially be randomly reordered for tasks that did not get to run
>>> * the waitqueue_active check in may_queue was not valid with the auto
>>> remove function, which could lead incoming tasks with requests to
>>> overtake existing requests
>>> 1) was fixed by using a plain wake_up
>>> 2) was fixed by keeping tasks on the queue until they could break out of
>>> the wait loop in __wbt_wait
>>> The random reordering, causing task starvation in __wbt_wait, was the
>>> main problem. Simply not using the auto remove wait function, e.g.
>>> *only* changing DEFINE_WAIT(wait) to DEFINE_WAIT_FUNC(wait,
>>> default_wake_function), fixed the hang / task starvation issue in our
>>> tests. But there was still more lock contention than there should be, so
>>> we also changed wake_up_all to wake_up.
>>> It might be useful to run your tests with only the DEFINE_WAIT change I
>>> describe above added to the original code to see if that still has any
>>> problems. That would give a good datapoint whether any remaining issues
>>> are due to missed wakeups or not.
>>> There is the issue of making forward progress, or at least making it
>>> fast enough. With the changes as they stand now, you could come up with
>>> a scenario where the throttling limit is hit, but then is raised. Since
>>> there might still be a wait queue, you could end up putting each
>>> incoming task to sleep, even though it's not needed.
>>> One way to guarantee that the wait queue clears up as fast as possible,
>>> without resorting to wakeup_all, is to use wakeup_nr, where the number
>>> of tasks to wake up is (limit - inflight).
>>> Also, to avoid having tasks going back to sleep in the loop, you could
>>> do what you already proposed - always just sleep at most once, and
>>> unconditionally proceed after waking up. To avoid incoming tasks
>>> overtaking the ones that are being woken up, you could have wbt_done
>>> increment inflight, effectively reserving a spot for the tasks that are
>>> about to be woken up.
>>> Another thing I thought about was recording the number of waiters in the
>>> wait queue, and modify the check from (inflight < limit) to (inflight <
>>> (limit - nwaiters)), and no longer use any waitqueue_active checks.
>>> The condition checks are of course complicated by the fact that
>>> condition manipulation is not always done under the same lock (e.g.
>>> wbt_wait can be called with a NULL lock).
>>> So, these are just some of the things to consider here - maybe there's
>>> nothing in there that you hadn't already considered, but I thought it'd
>>> be useful to summarize them.
>>> Thanks for looking in to this!
>> It turned out to be an unrelated problem with rq reordering in blk-mq,
>> mainline doesn't have it.
>> So I think the above change is safe and fine, but we definitely still
>> want the extra change of NOT allowing a queue token for the initial loop
>> inside __wbt_wait() for when we have current sleepers on the queue.
>> Without that, the initial check in __wbt_wait() is not useful at all.
>> --
>> Jens Axboe
> Hi Jens,
> I tested your patches in my environment and they look good. There is no sudden increase in
> lock contention either. Thanks for catching the corner case though.

Thanks for testing. Can I add your tested-by to the 3 patches?

Jens Axboe