[no subject]
From: Jens Axboe
Date: Wed Mar 23 2016 - 11:27:32 EST
This patchset isn't as much a final solution, as it's demonstration
of what I believe is a huge issue. Since the dawn of time, our
background buffered writeback has sucked. When we do background
buffered writeback, it should have little impact on foreground
activity. That's the definition of background activity... But for as
long as I can remember, heavy buffered writers has not behaved like
that. For instance, if I do something like this:
$ dd if=/dev/zero of=foo bs=1M count=10k
on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts data base reads or sync writes. When that happens, I get people
yelling at me.
A quick demonstration - a fio job that reads a a file, while someone
else issues the above 'dd'. Run on a flash device, using XFS. The
vmstat output looks something like this:
--io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
156 4648 58 151 0 1 98 1 0
0 0 64 83 0 0 100 0 0
0 32 76 119 0 0 100 0 0
26616 0 7574 13907 7 0 91 2 0
41992 0 10811 21395 0 2 95 3 0
46040 0 11836 23395 0 3 94 3 0
19376 1310736 5894 10080 0 4 93 3 0
116 1974296 1858 455 0 4 93 3 0
124 2020372 1964 545 0 4 92 4 0
112 1678356 1955 620 0 3 93 3 0
8560 405508 3759 4756 0 1 96 3 0
42496 0 10798 21566 0 0 97 3 0
42476 0 10788 21524 0 0 97 3 0
The read starts out fine, but goes to shit when we start background
flushing. The reader experiences latency spikes in the seconds range.
On flash.
With this set of patches applies, the situation looks like this instead:
--io---- -system-- ------cpu-----
bi bo in cs us sy id wa st
33544 0 8650 17204 0 1 97 2 0
42488 0 10856 21756 0 0 97 3 0
42032 0 10719 21384 0 0 97 3 0
42544 12 10838 21631 0 0 97 3 0
42620 0 10982 21727 0 3 95 3 0
46392 0 11923 23597 0 3 94 3 0
36268 512000 9907 20044 0 3 91 5 0
31572 696324 8840 18248 0 1 91 7 0
30748 626692 8617 17636 0 2 91 6 0
31016 618504 8679 17736 0 3 91 6 0
30612 648196 8625 17624 0 3 91 6 0
30992 650296 8738 17859 0 3 91 6 0
30680 604075 8614 17605 0 3 92 6 0
30592 595040 8572 17564 0 2 92 6 0
31836 539656 8819 17962 0 2 92 5 0
and the reader never sees latency spikes above a few miliseconds.
The above was the why. The how is basically throttling background
writeback. We still want to issue big writes from the vm side of things,
so we get nice and big extents on the file system end. But we don't need
to flood the device with THOUSANDS of requests for background writeback.
For most devices, we don't need a whole lot to get decent throughput.
This adds some simple blk-wb code that keeps limits how much buffered
writeback we keep in flight on the device end. The default is pretty
low. If we end up switching to WB_SYNC_ALL, we up the limits. If the
dirtying task ends up being throttled in balance_dirty_pages(), we up
the limit. If we need to reclaim memory, we up the limit. The cases
that need to clean memory at or near device speeds, they get to do
that. We still don't need thousands of requests to accomplish that.
And for the cases where we don't need to be near device limits, we
can clean at a more reasonable pace. Currently there are two tunables
associated with this, see the last patch for descriptions of those.
I welcome testing. The end goal here would be having much of this
auto-tuned, so that we don't lose substantial bandwidth for background
writes, while still maintaining decent non-wb performance and latencies.
The patchset should be fully stable, I have not observed problems. It
passes full xfstest runs, and a variety of benchmarks as well. It
should work equally well on blk-mq/scsi-mq, and "classic" setups.
You can also find this in a branch in the block git repo:
git://git.kernel.dk/linux-block.git wb-buf-throttle-v2
Patches are against current Linus' git, 4.5.0+.
Changes since v1
- Drop sync() WB_SYNC_NONE -> WB_SYNC_ALL change
- wb_start_writeback() fills in background/reclaim/sync info in
the writeback work, based on writeback reason.
- Use WRITE_SYNC for reclaim/sync IO
- Split balance_dirty_pages() sleep change into separate patch
- Drop get_request() u64 flag change, set the bit on the request
directly after-the-fact.
- Fix wrong sysfs return value
- Various small cleanups
block/Makefile | 2
block/blk-core.c | 15 ++
block/blk-mq.c | 32 +++++
block/blk-settings.c | 11 +
block/blk-sysfs.c | 123 +++++++++++++++++++++
block/blk-wb.c | 219 +++++++++++++++++++++++++++++++++++++++
block/blk-wb.h | 27 ++++
drivers/nvme/host/core.c | 1
drivers/scsi/sd.c | 5
fs/block_dev.c | 2
fs/buffer.c | 2
fs/f2fs/data.c | 2
fs/f2fs/node.c | 2
fs/fs-writeback.c | 17 +++
fs/gfs2/meta_io.c | 3
fs/mpage.c | 9 -
fs/xfs/xfs_aops.c | 2
include/linux/backing-dev-defs.h | 2
include/linux/blk_types.h | 2
include/linux/blkdev.h | 7 +
include/linux/writeback.h | 8 +
mm/page-writeback.c | 2
22 files changed, 479 insertions(+), 16 deletions(-)
--
Jens Axboe