Re: Write throughput impaired by touching dirty_ratio

From: Mark Hills
Date: Wed Jun 24 2015 - 18:52:57 EST


On Wed, 24 Jun 2015, Vlastimil Babka wrote:

> [add some CC's]
>
> On 06/19/2015 05:16 PM, Mark Hills wrote:
> > I noticed that any change to vm.dirty_ratio causes write throuput to
> > plummet -- to around 5Mbyte/sec.
> >
> > <system bootup, kernel 4.0.5>
> >
> > # dd if=/dev/zero of=/path/to/file bs=1M
> >
> > # sysctl vm.dirty_ratio
> > vm.dirty_ratio = 20
> > <all ok; writes at ~150Mbyte/sec>
> >
> > # sysctl vm.dirty_ratio=20
> > <all continues to be ok>
> >
> > # sysctl vm.dirty_ratio=21
> > <writes drop to ~5Mbyte/sec>
> >
> > # sysctl vm.dirty_ratio=20
> > <writes continue to be slow at ~5Mbyte/sec>
> >
> > The test shows that return to the previous value does not restore the old
> > behaviour. I return the system to usable state with a reboot.
> >
> > Reads continue to be fast and are not affected.
> >
> > A quick look at the code suggests differing behaviour from
> > writeback_set_ratelimit on startup. And that some of the calculations (eg.
> > global_dirty_limit) is badly behaved once the system has booted.
>
> Hmm, so the only thing that dirty_ratio_handler() changes except the
> vm_dirty_ratio itself, is ratelimit_pages through writeback_set_ratelimit(). So
> I assume the problem is with ratelimit_pages. There's num_online_cpus() used in
> the calculation, which I think would differ between the initial system state
> (where we are called by page_writeback_init()) and later when all CPU's are
> onlined. But I don't see CPU onlining code updating the limit (unlike memory
> hotplug which does that), so that's suspicious.
>
> Another suspicious thing is that global_dirty_limits() looks at current
> process's flag. It seems odd to me that the process calling the sysctl would
> determine a value global to the system.

Yes, I also spotted this. The fragment of code is:

tsk = current;
if (tsk->flags & PF_LESS_THROTTLE || rt_task(tsk)) {
background += background / 4;
dirty += dirty / 4;
}

It seems to imply the code was not always used from the /proc interface.
It's relevant in a moment...

> If you are brave enough (and have kernel configured properly and with
> debuginfo),

I'm brave... :) I hadn't seen this tool before, thanks for introducing me
to it, I will use it more now, I'm sure.

> you can verify how value of ratelimit_pages variable changes on the live
> system, using the crash tool. Just start it, and if everything works,
> you can inspect the live system. It's a bit complicated since there are
> two static variables called "ratelimit_pages" in the kernel so we can't
> print them easily (or I don't know how). First we have to get the
> variable address:
>
> crash> sym ratelimit_pages
> ffffffff81e67200 (d) ratelimit_pages
> ffffffff81ef4638 (d) ratelimit_pages
>
> One will be absurdly high (probably less on your 32bit) so it's not the one we want:
>
> crash> rd -d ffffffff81ef4638 1
> ffffffff81ef4638: 4294967328768
>
> The second will have a smaller value:
> (my system after boot with dirty ratio = 20)
> crash> rd -d ffffffff81e67200 1
> ffffffff81e67200: 1577
>
> (after changing to 21)
> crash> rd -d ffffffff81e67200 1
> ffffffff81e67200: 1570
>
> (after changing back to 20)
> crash> rd -d ffffffff81e67200 1
> ffffffff81e67200: 1496

In my case there's only one such symbol (perhaps because this kernel
config is quite slimmed down?)

crash> sym ratelimit_pages
c148b618 (d) ratelimit_pages

(bootup with dirty_ratio 20)
crash> rd -d ratelimit_pages
c148b618: 78

(after changing to 21)
crash> rd -d ratelimit_pages
c148b618: 16

(after changing back to 20)
crash> rd -d ratelimit_pages
c148b618: 16

Compared to your system, even the bootup value seems pretty low.

So I am new to this code, but I took a look. Seems like we're basically
hitting the lower bound of 16.

void writeback_set_ratelimit(void)
{
unsigned long background_thresh;
unsigned long dirty_thresh;
global_dirty_limits(&background_thresh, &dirty_thresh);
global_dirty_limit = dirty_thresh;
ratelimit_pages = dirty_thresh / (num_online_cpus() * 32);
if (ratelimit_pages < 16)
ratelimit_pages = 16;
}

>From this code, we don't have dirty_thresh preserved, but we do have
global_dirty_limit:

crash> rd -d global_dirty_limit
c1545080: 0

And if that is zero then:

ratelimit_pages = 0 / (num_online_cpus() * 32)
= 0

So it seems like this is the path to follow.

The function global_dirty_limits() produces the value for dirty_thresh
and, aside from a potential multiply by 0.25 (the 'task dependent'
mentioned before) the value is derived as:

if (vm_dirty_bytes)
dirty = DIV_ROUND_UP(vm_dirty_bytes, PAGE_SIZE);
else
dirty = (vm_dirty_ratio * available_memory) / 100;

I checked the vm_dirty_bytes codepath and that works:

(vm.dirty_bytes = 1048576000, 1000Mb)
crash> rd -d ratelimit_pages
c148b618: 1000

Therefore it's the 'else' case, and this points to available_memory is
zero, or near it (in my case < 5). This value is the direct result of
global_dirtyable_memory(), which I've annotated with some values:

static unsigned long global_dirtyable_memory(void)
{
unsigned long x;

x = global_page_state(NR_FREE_PAGES); // 2648091
x -= min(x, dirty_balance_reserve); // - 175522

x += global_page_state(NR_INACTIVE_FILE); // + 156369
x += global_page_state(NR_ACTIVE_FILE); // + 3475 = 2632413

if (!vm_highmem_is_dirtyable)
x -= highmem_dirtyable_memory(x);

return x + 1; /* Ensure that we never return 0 */
}

If I'm correct here, global includes the highmem stuff, and it implies
that highmem_dirtyable_memory() is returning a value only slightly less
than or equal to the sum of the others.

To test, I flipped the vm_highmem_is_dirtyable (which had no effect until
I forced it to re-evaluate ratelimit_pages):

$ echo 1 > /proc/sys/vm/highmem_is_dirtyable
$ echo 21 > /proc/sys/vm/dirty_ratio
$ echo 20 > /proc/sys/vm/dirty_ratio

crash> rd -d ratelimit_pages
c148b618: 2186

The value is now healthy, more so than even the value we started
with on bootup.

My questions and observations are:

* What does highmem_is_dirtyable actually mean, and should it really
default to 1?

Is it actually a misnomer? Since it's only used in
global_dirtyable_memory(), it doesn't actually prevent dirtying of
highmem, it just attempts to place a limit that corresponds to the
amount of non-highmem.I have limited understanding at the moment, but
that would be something different.

* That the codepaths around setting highmem_is_dirtyable from /proc
is broken; it also needs to make a call to writeback_set_ratelimit()

* Even with highmem_is_dirtyable=1, there's still a sizeable difference
between the value on bootup (78) and the evaluation once booted (2186).
This goes the wrong direction and is far too big a difference to be
solely nr_cpus_online() switching from 1 to 8.

The machine is 32-bit with 12GiB of RAM.

For info, I posted a typical zoneinfo, below.

> So yes, it does differ but not drastically. A difference between 1 and 8
> online CPU's would look differently I think. So my theory above is
> questionable. But you might try what it looks like on your system...
>
> >
> > The system is an HP xw6600, running i686 kernel. This happens whether
> > internal SATA HDD, SSD or external USB drive is used. I first saw this on
> > kernel 4.0.4, and 4.0.5 is also affected.
>
> So what was the last version where you did change the dirty ratio and it worked
> fine?

Sorry, I don't know when it broke. I don't immediately have access to an
old kernel to test, but I could do that if necessary.

> > It would suprise me if I'm the only person who was setting dirty_ratio.
> >
> > Have others seen this behaviour? Thanks
> >
>

Thanks, I hope you find this useful.

--
Mark


Node 0, zone DMA
pages free 1566
min 196
low 245
high 294
scanned 0
spanned 4095
present 3989
managed 3970
nr_free_pages 1566
nr_alloc_batch 49
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 163
nr_active_file 1129
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 0
nr_file_pages 1292
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 842
nr_slab_unreclaimable 162
nr_page_table_pages 17
nr_kernel_stack 4
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 661
nr_written 661
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 377, 12165, 12165)
pagesets
cpu: 0
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 1
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 2
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 3
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 4
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 5
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 6
count: 0
high: 0
batch: 1
vm stats threshold: 8
cpu: 7
count: 0
high: 0
batch: 1
vm stats threshold: 8
all_unreclaimable: 0
start_pfn: 1
inactive_ratio: 1
Node 0, zone Normal
pages free 37336
min 4789
low 5986
high 7183
scanned 0
spanned 123902
present 123902
managed 96773
nr_free_pages 37336
nr_alloc_batch 331
nr_inactive_anon 0
nr_active_anon 0
nr_inactive_file 4016
nr_active_file 26672
nr_unevictable 0
nr_mlock 0
nr_anon_pages 0
nr_mapped 1
nr_file_pages 30684
nr_dirty 4
nr_writeback 0
nr_slab_reclaimable 19865
nr_slab_unreclaimable 4673
nr_page_table_pages 1027
nr_kernel_stack 281
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 0
nr_dirtied 14354
nr_written 21672
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_transparent_hugepages 0
nr_free_cma 0
protection: (0, 0, 94302, 94302)
pagesets
cpu: 0
count: 78
high: 186
batch: 31
vm stats threshold: 24
cpu: 1
count: 140
high: 186
batch: 31
vm stats threshold: 24
cpu: 2
count: 116
high: 186
batch: 31
vm stats threshold: 24
cpu: 3
count: 100
high: 186
batch: 31
vm stats threshold: 24
cpu: 4
count: 70
high: 186
batch: 31
vm stats threshold: 24
cpu: 5
count: 82
high: 186
batch: 31
vm stats threshold: 24
cpu: 6
count: 144
high: 186
batch: 31
vm stats threshold: 24
cpu: 7
count: 59
high: 186
batch: 31
vm stats threshold: 24
all_unreclaimable: 0
start_pfn: 4096
inactive_ratio: 1
Node 0, zone HighMem
pages free 2536526
min 128
low 37501
high 74874
scanned 0
spanned 3214338
present 3017668
managed 3017668
nr_free_pages 2536526
nr_alloc_batch 10793
nr_inactive_anon 2118
nr_active_anon 118021
nr_inactive_file 80138
nr_active_file 273523
nr_unevictable 3475
nr_mlock 3475
nr_anon_pages 119672
nr_mapped 48158
nr_file_pages 357567
nr_dirty 0
nr_writeback 0
nr_slab_reclaimable 0
nr_slab_unreclaimable 0
nr_page_table_pages 0
nr_kernel_stack 0
nr_unstable 0
nr_bounce 0
nr_vmscan_write 0
nr_vmscan_immediate_reclaim 0
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 2766
nr_dirtied 1882996
nr_written 1695681
nr_pages_scanned 0
workingset_refault 0
workingset_activate 0
workingset_nodereclaim 0
nr_anon_transparent_hugepages 151
nr_free_cma 0
protection: (0, 0, 0, 0)
pagesets
cpu: 0
count: 171
high: 186
batch: 31
vm stats threshold: 64
cpu: 1
count: 80
high: 186
batch: 31
vm stats threshold: 64
cpu: 2
count: 91
high: 186
batch: 31
vm stats threshold: 64
cpu: 3
count: 173
high: 186
batch: 31
vm stats threshold: 64
cpu: 4
count: 114
high: 186
batch: 31
vm stats threshold: 64
cpu: 5
count: 159
high: 186
batch: 31
vm stats threshold: 64
cpu: 6
count: 130
high: 186
batch: 31
vm stats threshold: 64
cpu: 7
count: 62
high: 186
batch: 31
vm stats threshold: 64
all_unreclaimable: 0
start_pfn: 127998
inactive_ratio: 10
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/