Re: [PATCHSET v5] Make background writeback great again for the first time

From: Jens Axboe
Date: Thu Apr 28 2016 - 14:46:50 EST


On 04/28/2016 05:54 AM, Jan Kara wrote:
On Wed 27-04-16 14:59:15, Jens Axboe wrote:
On Wed, Apr 27 2016, Jens Axboe wrote:
On Wed, Apr 27 2016, Jens Axboe wrote:
On 04/27/2016 12:01 PM, Jan Kara wrote:
Hi,

On Tue 26-04-16 09:55:23, Jens Axboe wrote:
Since the dawn of time, our background buffered writeback has sucked.
When we do background buffered writeback, it should have little impact
on foreground activity. That's the definition of background activity...
But for as long as I can remember, heavy buffered writers have not
behaved like that. For instance, if I do something like this:

$ dd if=/dev/zero of=foo bs=1M count=10k

on my laptop, and then try and start chrome, it basically won't start
before the buffered writeback is done. Or, for server oriented
workloads, where installation of a big RPM (or similar) adversely
impacts database reads or sync writes. When that happens, I get people
yelling at me.

I have posted plenty of results previously, I'll keep it shorter
this time. Here's a run on my laptop, using read-to-pipe-async for
reading a 5g file, and rewriting it. You can find this test program
in the fio git repo.

I have tested your patchset on my test system. Generally I have observed
noticeable drop in average throughput for heavy background writes without
any other disk activity and also somewhat increased variance in the
runtimes. It is most visible on this simple testcases:

dd if=/dev/zero of=/mnt/file bs=1M count=10000

and

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync

The machine has 4GB of ram, /mnt is an ext3 filesystem that is freshly
created before each dd run on a dedicated disk.

Without your patches I get pretty stable dd runtimes for both cases:

dd if=/dev/zero of=/mnt/file bs=1M count=10000
Runtimes: 87.9611 87.3279 87.2554

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtimes: 93.3502 93.2086 93.541

With your patches the numbers look like:

dd if=/dev/zero of=/mnt/file bs=1M count=10000
Runtimes: 108.183, 97.184, 99.9587

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtimes: 104.9, 102.775, 102.892

I have checked whether the variance is due to some interaction with CFQ
which is used for the disk. When I switched the disk to deadline, I still
get some variance although, the throughput is still ~10% lower:

dd if=/dev/zero of=/mnt/file bs=1M count=10000
Runtimes: 100.417 100.643 100.866

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtimes: 104.208 106.341 105.483

The disk is rotational SATA drive with writeback cache, queue depth of the
disk reported in /sys/block/sdb/device/queue_depth is 1.

So I think we still need some tweaking on the low end of the storage
spectrum so that we don't lose 10% of throughput for simple cases like
this.

Thanks for testing, Jan! I haven't tried old QD=1 SATA. I wonder if
you are seeing smaller requests, and that is why it both varies and
you get lower throughput? I'll try and setup a test here similar to
yours.

Jan, care to try the below patch? I can't fully reproduce your issue on
a SCSI disk limited to QD=1, but I have a feeling this might help. It's
a bit of a hack, but the general idea is to allow one more request to
build up for QD=1 devices. That eliminates wait time between one request
finishing, and the next being submitted.

That accidentally added a potentially stall, this one is both cleaner
and should have that fixed.

..
- rwb->wb_max = 1 + ((depth - 1) >> min(31U, rwb->scale_step));
- rwb->wb_normal = (rwb->wb_max + 1) / 2;
- rwb->wb_background = (rwb->wb_max + 3) / 4;
+ if (rwb->queue_depth == 1) {
+ rwb->wb_max = rwb->wb_normal = 2;
+ rwb->wb_background = 1;

This breaks the detection of too big scale_step in scale_up() where we key
of wb_max == 1 value. However even with that fixed no luck :(:

Yeah, I need to look at that. For QD=1, I think the only sensible values for max/normal/bg is 2/2/1 and 1/1/1 if we step down.

dd if=/dev/zero of=/mnt/file bs=1M count=10000 conv=fsync
Runtime: 105.126 107.125 105.641

So about the same as before. I'll try to debug this later today...

Thanks, I'm very interested in what you find!

--
Jens Axboe