Re: page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another

From: Duncan
Date: Tue Dec 13 2016 - 18:29:33 EST

Next message: Stephen Rothwell: "Re: [GIT PULL] Block core changes for 4.10"
Previous message: Stephen Boyd: "Re: [PATCH v6 1/3] clk: x86: Add Atom PMC platform clocks"
In reply to: David Arendt: "Re: page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another"
Next in thread: Michal Hocko: "Re: page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

David Arendt posted on Tue, 13 Dec 2016 21:26:04 +0100 as excerpted:

> The crash is not an isolated one as I already had this crash multiple
> times with -rc7 and -rc8. It seems only to occur when copying from
> 7200rpm harddisks to 5600rpm ones, and never when copying between two
> 7200rpm or two 5400rpm.

That reads very much like a bug previously reported here and on LKML
itself (with Linus and other high-level kernel devs responding) that
resulted in a(nother) discussion of whether the writecache knobs in /proc/
sys/dirty_* should be updated.

It's generally accepted wisdom among kernel devs and sysadmins[1] that
the existing dirty* write-cache defaults, set at a time when common
system memories measured in the MiB, not the GiB of today, are no longer
appropriate and should be lowered, but the lack of agreement as to
precisely what the settings should be, combined with inertia and the lack
of practical pressure given that those who know about the problem have
long since adjusted their own systems accordingly, means the existing now
generally agreed to be inappropriate defaults continue to remain. =:^(

These knobs can be tweaked in several ways. For temporary
experimentation, it's generally easiest to write (as root) updated values
directly to the /proc/sys/vm/dirty_* files themselves. Once you find
values you are comfortable with, most distros have an existing sysctl
config[2] that can be altered as appropriate, so the settings get
reapplied at each boot.

Various articles with the details are easily googled so I'll be brief
here, but here's the apropos settings and comments from my own
/etc/sysctl.conf and a brief explanation:

# write-cache, foreground/background flushing
# vm.dirty_ratio = 10 (% of RAM)
# make it 3% of 16G ~ half a gig
vm.dirty_ratio = 3
# vm.dirty_bytes = 0

# vm.dirty_background_ratio = 5 (% of RAM)
# make it 1% of 16G ~ 160 M
vm.dirty_background_ratio = 1
# vm.dirty_background_bytes = 0

# vm.dirty_expire_centisecs = 2999 (30 sec)
# vm.dirty_writeback_centisecs = 499 (5 sec)
# make it 10 sec
vm.dirty_writeback_centisecs = 1000

The *_bytes and *_ratio files configure the same thing in different ways,
ratio being percentage of RAM, bytes being... bytes. Set one or the
other as you prefer and the other one will be automatically zeroed out.
The vm.dirty_background_* settings control when the kernel starts lower
priority flushing, while high priority vm.dirty_* (not background)
settings control when the kernel forces threads trying to do further
writes to wait until some currently in-flight writes are completed.

But those values only apply to size up until the expiry time has
occurred, at which point writeback is still forced. That's where that
setting comes in.

The problem is that memory has gotten bigger much faster than the speed
of actually writing out to slow spinning rust has increased. (Fast ssds
have far less issues in this regard, tho slow flash like common USB thumb
drives remain affected, indeed, sometimes even more so.) Common random-
write spinning rust write speeds are 100 MiB/sec and may be as low as 30
MiB/sec. Meanwhile, the default 10% dirty_ratio, at 16 GiB memory size,
approaches[3] 1.6 GiB, ~1600 MiB. At 100 MiB/sec that's 16 seconds worth
of writeback to clear. At 30 MiB/sec, that's... well beyond the 30
second expiry time!

To be clear, there's still a bug if the system crashes as a result -- the
normal case should simply be a system that at worst doesn't respond for
the writeback period, to be sure a problem in itself when that period
exceeds double-digit seconds, but surely less of one than a total crash,
as long as the system /does/ come back after perhaps half a minute or so.

Anyway, as you can see from the above excerpt from my own sysctl.conf,
for my 16 GiB system, I use a much more reasonable 1% background writeback
trigger, ~160 MiB on 16 GiB, and 3% high-priority/foreground, ~ half a
GiB on 16 GiB. I actually set those long ago, before I switched to btrfs
and before I switched to ssd as well, but even tho ssd should work far
better with the defaults than spinning rust does, those settings don't
hurt on ssd either, and I've seen no reason to change them.

So try 1% background and 3% foreground flushing ratios on your 32 GiB
system as well, and see if that helps, or possibly try setting the _bytes
values instead, since 1% is still quite huge in writeback time terms, on
32 GiB. Tweaking those down on the previously reported bug certainly
helped there as he couldn't reproduce after that, and it looks like
you're running 2+ GiB dirty based on your posted meminfo now, so it
should reduce that, and hopefully eliminate the trigger for you, tho of
course it won't fix the root bug. As I said it shouldn't crash in any
case, even if it goes unresponsive for half a minute or so at a time, so
there's certainly a bug to fix, but that will hopefully let you work
without running into it.

Again, you can write the new values direct to the proc interface without
rebooting, for experimentation. Once you find values appropriate for
you, however, write them to sysctl.conf or whatever your distro uses
instead, so they get applied automatically at each boot.

---
[1] Sysadmins: Like me, no claim to dev here, nor am I a professional
sysadmin, but arguably I do take the responsibility of adminning my own
systems more seriously than most appear to, enough to claim sysadmin as
an appropriate descriptor.

[2] Sysctl config. Look in /etc/sysctl.d/* and/or /etc/sysctl.conf, as
appropriate to your distro.

[3] Approaches: The memory figure used for calculating this percentage
excludes some things so it won't actually reach 10% of total memory. But
the exclusions are small enough that they can be hand-waved away for
purposes of this discussion.

--
Duncan - List replies preferred. No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master." Richard Stallman

Next message: Stephen Rothwell: "Re: [GIT PULL] Block core changes for 4.10"
Previous message: Stephen Boyd: "Re: [PATCH v6 1/3] clk: x86: Add Atom PMC platform clocks"
In reply to: David Arendt: "Re: page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another"
Next in thread: Michal Hocko: "Re: page allocation stall in kernel 4.9 when copying files from one btrfs hdd to another"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]