On Tue, Mar 22, 2016 at 04:03:28PM -0600, Jens Axboe wrote:
On 03/22/2016 03:51 PM, Dave Chinner wrote:
On Tue, Mar 22, 2016 at 11:55:14AM -0600, Jens Axboe wrote:
This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
Of course not. The IO scheduler is supposed to determine how we
meter out bulk vs latency sensitive IO that is queued. That's what
all the things like anticipatory scheduling for read requests was
supposed to address....
I'm guessing you're seeing problems like this because blk-mq has no
IO scheduler infrastructure and so no way of prioritising,
scheduling and/or throttling different types of IO? Would that be
It's not just that, but obviously the IO scheduler would be one
place to throttle it. This, in a way, is a way of scheduling the
writeback writes better. But most of the reports I get on writeback
sucking is not using scsi/blk-mq, they end up being "classic" on
things like deadline.
Deadline doesn't have anticipatory read scheduling, right?
Really, I'm just trying to understand why this isn't being added as
part of the IO scheduler infrastructure, but is instead adding
another layer of non-optional IO scheduling to the block layer...
The read starts out fine, but goes to shit when we start bacckground
flushing. The reader experiences latency spikes in the seconds range.
With this set of patches applies, the situation looks like this instead:
--io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
33544 0 8650 17204 0 1 97 2 0
42488 0 10856 21756 0 0 97 3 0
42032 0 10719 21384 0 0 97 3 0
42544 12 10838 21631 0 0 97 3 0
42620 0 10982 21727 0 3 95 3 0
46392 0 11923 23597 0 3 94 3 0
36268 512000 9907 20044 0 3 91 5 0
31572 696324 8840 18248 0 1 91 7 0
30748 626692 8617 17636 0 2 91 6 0
31016 618504 8679 17736 0 3 91 6 0
30612 648196 8625 17624 0 3 91 6 0
30992 650296 8738 17859 0 3 91 6 0
30680 604075 8614 17605 0 3 92 6 0
30592 595040 8572 17564 0 2 92 6 0
31836 539656 8819 17962 0 2 92 5 0
And now it runs at ~600MB/s, slowing down the rate at which memory
is cleaned by 60%.
Which is the point, correct... If we're not anywhere near being
tight on memory AND nobody is waiting for this IO, then by
definition, the foreground activity is the important one. For the
case used here, that's the application doing reads.
Unless, of course, we are in a situation where there is also large
memory demand, and we need to clean memory fast....
Given that background writeback is relied on by memory reclaim to
clean memory faster than the LRUs are cycled, I suspect this is
going to have a big impact on low memory behaviour and balance,
which will then feed into IO breakdown problems caused by writeback
being driven from the LRUs rather than the flusher threads.....
You're missing the part where the intent is to only throttle it
heavily when it's pure background writeback. Of course, if we are
low on memory and doing reclaim, we should get much closer to device
A demonstration, please.
I didn't see anything in the code that treated low memory conditions
differently - that just uses
do_try_to_free_pages() to trigger background writeback to run and
clean pages, so I'm interested to see exactly how that works out...
If I run the above dd without the reader running, I'm already at 90%
of the device bandwidth - not quite all the way there, since I still
want to quickly be able to inject reads (or other IO) without having
to wait for the queues to purge thousands of requests.
So, essentially, the model is to run background write at "near
starvation" queue depths, which works fine when the system is mostly
idle and we can dispatch more IO immediately. My concern with this
model is that under heavy IO and CPU load, writeback dispatch often
has significant delays (e.g. for allocation, etc). This is when we
need deeper queue depths to maintain throughput across dispatch
Many production workloads don't care about read latency, but do care
about bulk page cache throughput. Such workloads are going to be
adversely affected by a fundamental block layer IO dispatch model
change like this. This is why we have the pluggable IO schedulers in
the first place - one size does not fit all.
Hence I'm thinking that this should not be applied to all block
devices as this patch does, but instead be a part of the io
scheduling infrastructure we already have (and need for blk-mq).