Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

From: Tetsuo Handa
Date: Sun Oct 25 2015 - 06:53:11 EST

Next message: Geert Uytterhoeven: "CONFIG_CGROUP_WRITEBACK=y/n complexity (was: Re: Build regressions/improvements in v4.3-rc7)"
Previous message: Yaniv Gardi: "[PATCH v8 4/8] add ufshcd_get_variant ufshcd_set_variant"
Next in thread: Tejun Heo: "Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Tejun Heo wrote:
> If this is an actual problem, a better approach would be something
> which detects the stall condition and kicks off the next work item but
> if we do that I think I'd still trigger a warning there. I don't
> know. Don't go busy waiting in kernel.

Busy waiting in kernel refers several cases.

(1) Wait for something with interrupts disabled.

(2) Wait for something with interrupts enabled but
without calling cond_resched() etc.

(3) Wait for something with interrupts enabled and
with calling cond_resched() etc.

(4) Wait for something with interrupts enabled and
with calling schedule_timeout() etc.

Kernel code tries to minimize (1). Kernel code does (2) if they are
not allowed to sleep. But kernel code is allowed to do (3) if they
are allowed to sleep, as long as cond_resched() is sometimes called.
And currently page allocator does (3). But kernel code invoked via
workqueue is expected to do (4) than (3).

This means that any kernel code which invokes a __GFP_WAIT allocation
might fail to do (4) when invoked via workqueue, regardless of flags
passed to alloc_workqueue()?

Michal Hocko wrote:
> On Fri 23-10-15 06:42:43, Tetsuo Handa wrote:
> > Tejun Heo wrote:
> > > On Thu, Oct 22, 2015 at 05:49:22PM +0200, Michal Hocko wrote:
> > > > I am confused. What makes rescuer to not run? Nothing seems to be
> > > > hogging CPUs, we are just out of workers which are loopin in the
> > > > allocator but that is preemptible context.
> > >
> > > It's concurrency management. Workqueue thinks that the pool is making
> > > positive forward progress and doesn't schedule anything else for
> > > execution while that work item is burning cpu cycles.
> >
> > Then, isn't below change easier to backport which will also alleviate
> > needlessly burning CPU cycles?
>
> This is quite obscure. If the vmstat_update fix needs workqueue tweaks
> as well then I would vote for your original patch which is clear,
> straightforward and easy to backport.

I think that inserting a short sleep into page allocator is better
because the vmstat_update fix will not require workqueue tweaks if
we sleep inside page allocator. Also, from the point of view of
protecting page allocator from going unresponsive when hundreds of tasks
started busy-waiting at __alloc_pages_slowpath() because we can observe
that XXX value in the "MemAlloc-Info: XXX stalling task," line grows
when we are unable to make forward progress.

----------------------------------------
>From a2f34850c26b5bb124d44983f5a2020b51249d53 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
Date: Sun, 25 Oct 2015 19:42:15 +0900
Subject: [PATCH] mm,page_alloc: Insert an uninterruptible sleep before
retrying.

Since "struct zone"->vm_stat[] is array of atomic_long_t, an effort
to reduce frequency of updating values in vm_stat[] is made by using
per cpu variables "struct per_cpu_pageset"->vm_stat_diff[].
Values in vm_stat_diff[] are merged into vm_stat[] periodically
using vmstat_update workqueue item (struct delayed_work vmstat_work).

When a task attempted to allocate memory and reached direct reclaim
path, shrink_zones() checks whether there are reclaimable pages by
calling zone_reclaimable(). zone_reclaimable() makes decision based
on values in vm_stat[] by calling zone_page_state(). This is usually
fine because values in vm_stat_diff[] are expected to be merged into
vm_stat[] shortly.

But workqueue and page allocator have different assumptions.

(A) The workqueue defers processing of other items unless currently
in-flight item enters into !TASK_RUNNING state.

(B) The page allocator never enters into !TASK_RUNNING state if there
is nothing to reclaim. (The page allocator calls cond_resched()
via wait_iff_congested(), but cond_resched() does not make the
task enter into !TASK_RUNNING state.)

Therefore, if a workqueue item which is processed before vmstat_update
item is processed got stuck inside memory allocation request, values in
vm_stat_diff[] cannot be merged into vm_stat[].

As a result, zone_reclaimable() continues using outdated vm_stat[] values
and the task which is doing direct reclaim path thinks that there are
still reclaimable pages and therefore continues looping.

The consequence is a silent livelock (hang up without any kernel messages)
because the OOM killer will not be invoked. We can hit such livelock by
e.g. disk_events_workfn workqueue item doing memory allocation from
bio_copy_kern().

----------------------------------------
[ 255.054205] kworker/3:1 R running task 0 45 2 0x00000008
[ 255.056063] Workqueue: events_freezable_power_ disk_events_workfn
[ 255.057715] ffff88007f805680 ffff88007c55f6d0 ffffffff8116463d ffff88007c55f758
[ 255.059705] ffff88007f82b870 ffff88007c55f6e0 ffffffff811646be ffff88007c55f710
[ 255.061694] ffffffff811bdaf0 ffff88007f82b870 0000000000000400 0000000000000000
[ 255.063690] Call Trace:
[ 255.064664] [<ffffffff8116463d>] ? __list_lru_count_one.isra.4+0x1d/0x80
[ 255.066428] [<ffffffff811646be>] ? list_lru_count_one+0x1e/0x20
[ 255.068063] [<ffffffff811bdaf0>] ? super_cache_count+0x50/0xd0
[ 255.069666] [<ffffffff8114ecf6>] ? shrink_slab.part.38+0xf6/0x2a0
[ 255.071313] [<ffffffff81151f78>] ? shrink_zone+0x2c8/0x2e0
[ 255.072845] [<ffffffff81152316>] ? do_try_to_free_pages+0x156/0x6d0
[ 255.074527] [<ffffffff810bc6b6>] ? mark_held_locks+0x66/0x90
[ 255.076085] [<ffffffff816ca797>] ? _raw_spin_unlock_irq+0x27/0x40
[ 255.077727] [<ffffffff810bc7d9>] ? trace_hardirqs_on_caller+0xf9/0x1c0
[ 255.079451] [<ffffffff81152924>] ? try_to_free_pages+0x94/0xc0
[ 255.081045] [<ffffffff81145b4a>] ? __alloc_pages_nodemask+0x72a/0xdb0
[ 255.082761] [<ffffffff8118cd06>] ? alloc_pages_current+0x96/0x1b0
[ 255.084407] [<ffffffff8133985d>] ? bio_alloc_bioset+0x20d/0x2d0
[ 255.086032] [<ffffffff8133aba4>] ? bio_copy_kern+0xc4/0x180
[ 255.087584] [<ffffffff81344f20>] ? blk_rq_map_kern+0x70/0x130
[ 255.089161] [<ffffffff814a334d>] ? scsi_execute+0x12d/0x160
[ 255.090696] [<ffffffff814a3474>] ? scsi_execute_req_flags+0x84/0xf0
[ 255.092466] [<ffffffff814b55f2>] ? sr_check_events+0xb2/0x2a0
[ 255.094042] [<ffffffff814c3223>] ? cdrom_check_events+0x13/0x30
[ 255.095634] [<ffffffff814b5a35>] ? sr_block_check_events+0x25/0x30
[ 255.097278] [<ffffffff813501fb>] ? disk_check_events+0x5b/0x150
[ 255.098865] [<ffffffff81350307>] ? disk_events_workfn+0x17/0x20
[ 255.100451] [<ffffffff810890b5>] ? process_one_work+0x1a5/0x420
[ 255.102046] [<ffffffff81089051>] ? process_one_work+0x141/0x420
[ 255.103625] [<ffffffff8108944b>] ? worker_thread+0x11b/0x490
[ 255.105159] [<ffffffff816c4e95>] ? __schedule+0x315/0xac0
[ 255.106643] [<ffffffff81089330>] ? process_one_work+0x420/0x420
[ 255.108217] [<ffffffff8108f4e9>] ? kthread+0xf9/0x110
[ 255.109634] [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
[ 255.111307] [<ffffffff816cb35f>] ? ret_from_fork+0x3f/0x70
[ 255.112785] [<ffffffff8108f3f0>] ? kthread_create_on_node+0x230/0x230
(...snipped...)
[ 273.930846] Showing busy workqueues and worker pools:
[ 273.932299] workqueue events: flags=0x0
[ 273.933465] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[ 273.935120] pending: vmpressure_work_fn, vmstat_shepherd, vmstat_update, vmw_fb_dirty_flush [vmwgfx]
[ 273.937489] workqueue events_freezable: flags=0x4
[ 273.938795] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 273.940446] pending: vmballoon_work [vmw_balloon]
[ 273.941973] workqueue events_power_efficient: flags=0x80
[ 273.943491] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 273.945167] pending: check_lifetime
[ 273.946422] workqueue events_freezable_power_: flags=0x84
[ 273.947890] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 273.949579] in-flight: 45:disk_events_workfn
[ 273.951103] workqueue ipv6_addrconf: flags=0x8
[ 273.952447] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/1
[ 273.954121] pending: addrconf_verify_work
[ 273.955541] workqueue xfs-reclaim/sda1: flags=0x4
[ 273.957036] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 273.958847] pending: xfs_reclaim_worker
[ 273.960392] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=3 idle: 186 26
----------------------------------------

Three approaches are proposed for fixing this silent livelock problem.

(1) Use zone_page_state_snapshot() instead of zone_page_state()
when doing zone_reclaimable() checks. This approach is clear,
straightforward and easy to backport. So far I cannot reproduce
this livelock using this change. But there might be more locations
which should use zone_page_state_snapshot().

(2) Use a dedicated workqueue for vmstat_update item which is guaranteed
to be processed immediately. So far I cannot reproduce this livelock
using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI
(patch proposed by Christoph Lameter). But according to Tejun Heo,
if we want to guarantee that nobody can reproduce this livelock, we
need to modify workqueue API because commit 3270476a6c0c ("workqueue:
reimplement WQ_HIGHPRI using a separate worker_pool") which went to
Linux 3.6 lost the guarantee.

(3) Use a !TASK_RUNNING sleep inside page allocator side. This approach
is easy to backport. So far I cannot reproduce this livelock using
this approach. And I think that nobody can reproduce this livelock
because this changes the page allocator to obey the workqueue's
expectations. Even if we leave this livelock problem aside, not
entering into !TASK_RUNNING state for too long is an exclusive
occupation of workqueue which will make other items in the workqueue
needlessly deferred. We don't need to defer other items which do not
invoke a __GFP_WAIT allocation.

This patch does approach (3), by inserting an uninterruptible sleep into
page allocator side before retrying, in order to make sure that other
workqueue items (especially vmstat_update item) are given a chance to be
processed.

Although a different problem, by using approach (3), we can alleviate
needlessly burning CPU cycles even when we hit OOM-killer livelock problem
(hang up after the OOM-killer messages are printed because the OOM victim
cannot terminate due to dependency).

Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
---
mm/oom_kill.c | 8 +-------
mm/page_alloc.c | 19 +++++++++++++++++--
2 files changed, 18 insertions(+), 9 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d13a339..877b5a5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -722,15 +722,9 @@ bool out_of_memory(struct oom_control *oc)
dump_header(oc, NULL, NULL);
panic("Out of memory and no killable processes...\n");
}
- if (p && p != (void *)-1UL) {
+ if (p && p != (void *)-1UL)
oom_kill_process(oc, p, points, totalpages, NULL,
"Out of memory");
- /*
- * Give the killed process a good chance to exit before trying
- * to allocate memory again.
- */
- schedule_timeout_killable(1);
- }
return true;
}

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3687f4c..047ebda 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2726,7 +2726,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*/
if (!mutex_trylock(&oom_lock)) {
*did_some_progress = 1;
- schedule_timeout_uninterruptible(1);
return NULL;
}

@@ -3385,6 +3384,15 @@ retry:
((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
/* Wait for some write requests to complete then retry */
wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
+ /*
+ * Give other workqueue items (especially vmstat_update item)
+ * a chance to be processed. There is no need to wait if I was
+ * chosen by the OOM killer, for I will leave this function
+ * using ALLOC_NO_WATERMARKS. But I need to wait even if I have
+ * SIGKILL pending, for I can't leave this function.
+ */
+ if (!test_thread_flag(TIF_MEMDIE))
+ schedule_timeout_uninterruptible(1);
goto retry;
}

@@ -3394,8 +3402,15 @@ retry:
goto got_pg;

/* Retry as long as the OOM killer is making progress */
- if (did_some_progress)
+ if (did_some_progress) {
+ /*
+ * Give the OOM victim a chance to leave this function
+ * before trying to allocate memory again.
+ */
+ if (!test_thread_flag(TIF_MEMDIE))
+ schedule_timeout_uninterruptible(1);
goto retry;
+ }

noretry:
/*
--
1.8.3.1
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Geert Uytterhoeven: "CONFIG_CGROUP_WRITEBACK=y/n complexity (was: Re: Build regressions/improvements in v4.3-rc7)"
Previous message: Yaniv Gardi: "[PATCH v8 4/8] add ufshcd_get_variant ufshcd_set_variant"
Next in thread: Tejun Heo: "Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]