Re: [PATCH] dm: Avoid sleeping while holding the dm_bufio lock

From: Doug Anderson
Date: Mon Dec 12 2016 - 19:08:20 EST


On Thu, Dec 8, 2016 at 3:20 PM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
> On Wed, 7 Dec 2016, Doug Anderson wrote:
>> Hi,
>> On Wed, Nov 23, 2016 at 12:57 PM, Mikulas Patocka <mpatocka@xxxxxxxxxx> wrote:
>> > Hi
>> >
>> > The GFP_NOIO allocation frees clean cached pages. The GFP_NOWAIT
>> > allocation doesn't. Your patch would incorrectly reuse buffers in a
>> > situation when the memory is filled with clean cached pages.
>> >
>> > Here I'm proposing an alternate patch that first tries GFP_NOWAIT
>> > allocation, then drops the lock and tries GFP_NOIO allocation.
>> >
>> > Note that the root cause why you are seeing this stacktrace is, that your
>> > block device is congested - i.e. there are too many requests in the
>> > device's queue - and note that fixing this wait won't fix the root cause
>> > (congested device).
>> >
>> > The congestion limits are set in blk_queue_congestion_threshold to 7/8 to
>> > 13/16 size of the nr_requests value.
>> >
>> > If you don't want your device to report the congested status, you can
>> > increase /sys/block/<device>/queue/nr_requests - you should test if your
>> > chromebook is faster of slower with this setting increased. But note that
>> > this setting won't increase the IO-per-second of the device.
>> Cool, thanks for the insight!
>> Can you clarify which block device is relevant here? Is this the DM
>> block device, the underlying block device, or the swap block device?
>> I'm not at all an expert on DM, but I think we have:
>> 1. /dev/mmcblk0 - the underlying storage device.
>> 2. /dev/dm-0 - The verity device that's run atop /dev/mmcblk0p3
>> 3. /dev/zram0 - Our swap device
> The /dev/mmcblk0 device is congested. You can see the number of requests
> in /sys/block/mmcblk0/inflight

Hrm. Doing some tests doesn't show that to be true.

I ran this command while doing my test (the "balloon" test described
in my patch, not the running the computer as a real user):

while true;
do grep -v '0 *0' /sys/block/*/inflight;
sleep .1;

The largest numbers I ever saw was:

/sys/block/mmcblk0/inflight: 6 0

...and in one case when I saw a "long bufio" mesage I was seeing:

/sys/block/zram0/inflight: 1 0
/sys/block/zram0/inflight: 0 1
/sys/block/zram0/inflight: 0 1
/sys/block/zram0/inflight: 1 1
/sys/block/zram0/inflight: 0 1

>> As stated in the original email, I'm running on a downstream kernel
>> (kernel-4.4) with bunches of local patches, so it's plausible that
>> things have changed in the meantime, but:
>> * At boot time the "nr_requests" for all block devices was 128
>> * I was unable to set the "nr_requests" for dm-0 and zram0 (it just
>> gives an error in sysfs).
>> * When I set "nr_requests" to 4096 for /dev/mmcblk0 it didn't seem to
>> affect the problem.
> The eMMC has some IOPS and the IOPS can't be improved. Use faster block
> device - but it will cost more money.

It's so strange since when I see this failure I'm not even doing lots
of eMMC traffic. As per above my swap goes to zram and there's no
disk backing, so there should be very minimal disk usage.

If I run iostat now after having been running my test for a while (and
having seen a bunch of "long bufio" messages, you can see that there
hasn't exactly been lots of activity.

localhost debug # iostat
Linux 4.4.21 (localhost) 12/09/16 _aarch64_ (6 CPU)

avg-cpu: %user %nice %system %iowait %steal %idle
6.11 33.87 36.63 1.78 0.00 21.61

Device: tps kB_read/s kB_wrtn/s kB_read kB_wrtn
loop0 2.74 39.01 68.95 33529 59264
loop1 0.07 1.23 0.00 1055 0
loop2 0.04 0.13 0.00 112 0
loop3 0.00 0.00 0.00 4 0
loop4 0.00 0.01 0.00 8 0
loop5 0.00 0.01 0.00 8 0
mmcblk1 0.56 44.45 0.00 38202 0
mmcblk0 14.37 636.29 38.74 546888 33300
mmcblk0rpmb 0.01 0.03 0.00 28 0
mmcblk0boot1 0.03 0.15 0.00 128 0
mmcblk0boot0 0.03 0.15 0.00 128 0
dm-0 11.50 419.25 0.00 360344 0
dm-1 2.15 36.53 68.95 31401 59264
zram0 42103.26 84111.64 84301.42 72293956 72457068


OK, so I just put a printk in wait_iff_congested() and it didn't show
me waiting for the timeout (!). I know that I saw
wait_iff_congested() in the originally reproduction of this problem,
but it appears that in my little "balloon" reproduction it's not
actually involved...

...I dug further and it appears that __alloc_pages_direct_reclaim() is
actually what's slow. Specifically it looks as if shrink_zone() can
actually take quite a while. As I've said, I'm not an expert on the
memory manager but I'm not convinced that it's wrong for the direct
reclaim path to be pretty slow at times, especially when I'm putting
an abnormally high amount of stress on it.

I'm going to take this as further evidence that the patch being
discussed in this thread is a good one (AKA don't hold the dm bufio
lock while allocating memory). :) If it's unexpected that
shrink_zone() might take several seconds when under extreme memory
pressure then I can do some additional digging. Do note that I am
running with "zram" and remember that I'm on an ancient 4.4-based
kernel, so perhaps one of those two factors causes problems.

> If you want to handle such a situation where you run 4 tasks each eating
> 900MB, just use more memory, don't expect that this will work smoothly on
> 4GB machine.
> If you want to protect the chromebook from runaway memory allocations, you
> can detect this situation in some watchdog process and either kill the
> process that consumes most memory with the kill syscall or trigger the
> kernel OOM killer by writing 'f' to /proc/sysrq-trigger.
> The question is what you really want - handle this situation smoothly
> (then, you must add more memory) or protect chromeOS from applications
> allocating too much memory?

These are definitely important things to think about. In the end user
bug report, though, we were seeing this super slow / janky behavior
_without_ a runaway process--they were just using memory (and swap)
normally. I've simulated it with the "balloon" process, but users
were seeing it without anything so crazy.