Re: [PATCH 10/12] writeback: only allow one inflight and pending full flush
From: Jens Axboe
Date: Thu Sep 28 2017 - 20:16:08 EST
On 09/28/2017 11:41 PM, Andrew Morton wrote:
> On Wed, 27 Sep 2017 14:13:57 -0600 Jens Axboe <axboe@xxxxxxxxx> wrote:
>
>> When someone calls wakeup_flusher_threads() or
>> wakeup_flusher_threads_bdi(), they schedule writeback of all dirty
>> pages in the system (or on that bdi). If we are tight on memory, we
>> can get tons of these queued from kswapd/vmscan. This causes (at
>> least) two problems:
>>
>> 1) We consume a ton of memory just allocating writeback work items.
>> We've seen as much as 600 million of these writeback work items
>> pending. That's a lot of memory to pointlessly hold hostage,
>> while the box is under memory pressure.
>>
>> 2) We spend so much time processing these work items, that we
>> introduce a softlockup in writeback processing. This is because
>> each of the writeback work items don't end up doing any work (it's
>> hard when you have millions of identical ones coming in to the
>> flush machinery), so we just sit in a tight loop pulling work
>> items and deleting/freeing them.
>>
>> Fix this by adding a 'start_all' bit to the writeback structure, and
>> set that when someone attempts to flush all dirty pages. The bit is
>> cleared when we start writeback on that work item. If the bit is
>> already set when we attempt to queue !nr_pages writeback, then we
>> simply ignore it.
>>
>> This provides us one full flush in flight, with one pending as well,
>> and makes for more efficient handling of this type of writeback.
>>
>> ...
>>
>> @@ -953,12 +954,27 @@ static void wb_start_writeback(struct bdi_writeback *wb, bool range_cyclic,
>> return;
>>
>> /*
>> + * All callers of this function want to start writeback of all
>> + * dirty pages. Places like vmscan can call this at a very
>> + * high frequency, causing pointless allocations of tons of
>> + * work items and keeping the flusher threads busy retrieving
>> + * that work. Ensure that we only allow one of them pending and
>> + * inflight at the time. It doesn't matter if we race a little
>> + * bit on this, so use the faster separate test/set bit variants.
>> + */
>> + if (test_bit(WB_start_all, &wb->state))
>> + return;
>> +
>> + set_bit(WB_start_all, &wb->state);
>
> test_and_set_bit()?
Like Linus says, this is done purposely. I've even included a bit about
it in the comment above, though maybe it's not clear enough. I've used
this trick in blk-mq quite a bit as well, and for high frequency calls,
it can make a substantial difference not to redirty that cache line if
you can avoid it.
If you do care about atomicity, this works really well too:
if (test_bit(bit, addr) || test_and_set_bit(bit, addr))
...
just to avoid the locked operation. Also see this commit:
commit 7fcbbaf18392f0b17c95e2f033c8ccf87eecde1d
Author: Jens Axboe <axboe@xxxxxx>
Date: Thu May 22 11:54:16 2014 -0700
mm/filemap.c: avoid always dirtying mapping->flags on O_DIRECT
where there are some actual numbers on a specific case.
For the case at hand, we don't even need to do the test_and_set
case, since we don't care about a small race there.
--
Jens Axboe