Re: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().

From: Tetsuo Handa
Date: Fri Jul 27 2018 - 11:47:53 EST


On 2018/07/26 20:39, Michal Hocko wrote:
> On Thu 26-07-18 20:06:24, Tetsuo Handa wrote:
>> Before applying "an OOM lockup mitigation patch", I want to apply this
>> "another OOM lockup avoidance" patch.
>
> I do not really see why. All these are borderline interesting as the
> system is basically dead by the time you reach this state.

I don't like your "borderline interesting". We still don't have a watchdog
which tells something went wrong. Thus, "borderline interesting" has to be
examined and fixed.

>
>> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20180726.txt.xz
>> (which was captured with
>>
>> --- a/mm/oom_kill.c
>> +++ b/mm/oom_kill.c
>> @@ -1071,6 +1071,12 @@ bool out_of_memory(struct oom_control *oc)
>> {
>> unsigned long freed = 0;
>> bool delay = false; /* if set, delay next allocation attempt */
>> + static unsigned long last_warned;
>> + if (!last_warned || time_after(jiffies, last_warned + 10 * HZ)) {
>> + pr_warn("%s(%d) gfp_mask=%#x(%pGg), order=%d\n", current->comm,
>> + current->pid, oc->gfp_mask, &oc->gfp_mask, oc->order);
>> + last_warned = jiffies;
>> + }
>>
>> oc->constraint = CONSTRAINT_NONE;
>> if (oom_killer_disabled)
>>
>> in order to demonstrate that the GFP_NOIO allocation from disk_events_workfn() is
>> calling out_of_memory() rather than by error failing to give up direct reclaim).
>>
>> [ 258.619119] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [ 268.622732] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [ 278.635344] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [ 288.639360] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>> [ 298.642715] kworker/0:0(5) gfp_mask=0x600000(GFP_NOIO), order=0
>
> Hmm, so there is no other memory allocation to trigger the oom or they
> all just back off on the oom_lock trylock? In other words what is
> preventing from the oom killer invocation?

All __GFP_FS allocations got stuck at direct reclaim or workqueue.

[ 310.265293] systemd D12240 1 0 0x00000000
[ 310.268118] Call Trace:
[ 310.269894] ? __schedule+0x245/0x7f0
[ 310.271901] ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[ 310.274187] schedule+0x23/0x80
[ 310.275965] schedule_preempt_disabled+0x5/0x10
[ 310.278107] __mutex_lock+0x3f5/0x9b0
[ 310.280070] ? xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[ 310.282451] xfs_reclaim_inodes_ag+0x3b8/0x470 [xfs]
[ 310.284730] ? lock_acquire+0x51/0x70
[ 310.286678] ? xfs_ail_push_all+0x13/0x60 [xfs]
[ 310.288844] xfs_reclaim_inodes_nr+0x2c/0x40 [xfs]
[ 310.291092] super_cache_scan+0x14b/0x1a0
[ 310.293131] do_shrink_slab+0xfc/0x190
[ 310.295082] shrink_slab+0x240/0x2c0
[ 310.297028] shrink_node+0xe3/0x460
[ 310.298899] do_try_to_free_pages+0xcb/0x380
[ 310.300945] try_to_free_pages+0xbb/0xf0
[ 310.302874] __alloc_pages_slowpath+0x3c1/0xc50
[ 310.304931] ? lock_acquire+0x51/0x70
[ 310.306753] ? set_page_refcounted+0x40/0x40
[ 310.308686] __alloc_pages_nodemask+0x2a6/0x2c0
[ 310.310663] filemap_fault+0x437/0x8e0
[ 310.312446] ? lock_acquire+0x51/0x70
[ 310.314150] ? xfs_ilock+0x86/0x190 [xfs]
[ 310.315915] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[ 310.318089] __do_fault+0x13/0x126
[ 310.319696] __handle_mm_fault+0xc8d/0x11c0
[ 310.321493] handle_mm_fault+0x17a/0x390
[ 310.323156] __do_page_fault+0x24c/0x4d0
[ 310.324782] do_page_fault+0x2a/0x70
[ 310.326289] ? page_fault+0x8/0x30
[ 310.327745] page_fault+0x1e/0x30

[ 313.714537] systemd-journal D12600 498 1 0x00000000
[ 313.716538] Call Trace:
[ 313.717683] ? __schedule+0x245/0x7f0
[ 313.719221] schedule+0x23/0x80
[ 313.720586] schedule_timeout+0x21f/0x400
[ 313.722163] wait_for_completion+0xb2/0x130
[ 313.723750] ? wake_up_q+0x70/0x70
[ 313.725134] flush_work+0x1db/0x2b0
[ 313.726535] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[ 313.728336] ? page_alloc_cpu_dead+0x30/0x30
[ 313.729936] drain_all_pages+0x174/0x1e0
[ 313.731446] __alloc_pages_slowpath+0x428/0xc50
[ 313.733120] __alloc_pages_nodemask+0x2a6/0x2c0
[ 313.734826] filemap_fault+0x437/0x8e0
[ 313.736296] ? lock_acquire+0x51/0x70
[ 313.737769] ? xfs_ilock+0x86/0x190 [xfs]
[ 313.739309] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[ 313.741291] __do_fault+0x13/0x126
[ 313.742667] __handle_mm_fault+0xc8d/0x11c0
[ 313.744245] handle_mm_fault+0x17a/0x390
[ 313.745755] __do_page_fault+0x24c/0x4d0
[ 313.747290] do_page_fault+0x2a/0x70
[ 313.748728] ? page_fault+0x8/0x30
[ 313.750148] page_fault+0x1e/0x30

[ 324.987240] workqueue events_freezable_power_: flags=0x84
[ 324.989292] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 324.991482] in-flight: 5:disk_events_workfn
[ 324.993371] workqueue mm_percpu_wq: flags=0x8
[ 324.995167] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 324.997363] pending: vmstat_update, drain_local_pages_wq BAR(498)

[ 378.984241] INFO: task systemd-journal:498 blocked for more than 120 seconds.
[ 378.986657] Not tainted 4.18.0-rc6-next-20180724+ #248
[ 378.988741] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 378.991343] systemd-journal D12600 498 1 0x00000000
[ 378.993434] Call Trace:
[ 378.994643] ? __schedule+0x245/0x7f0
[ 378.996158] schedule+0x23/0x80
[ 378.997528] schedule_timeout+0x21f/0x400
[ 378.999112] wait_for_completion+0xb2/0x130
[ 379.000728] ? wake_up_q+0x70/0x70
[ 379.002170] flush_work+0x1db/0x2b0
[ 379.003667] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[ 379.005507] ? page_alloc_cpu_dead+0x30/0x30
[ 379.007176] drain_all_pages+0x174/0x1e0
[ 379.008836] __alloc_pages_slowpath+0x428/0xc50
[ 379.010625] __alloc_pages_nodemask+0x2a6/0x2c0
[ 379.012364] filemap_fault+0x437/0x8e0
[ 379.013881] ? lock_acquire+0x51/0x70
[ 379.015373] ? xfs_ilock+0x86/0x190 [xfs]
[ 379.016950] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[ 379.019019] __do_fault+0x13/0x126
[ 379.020439] __handle_mm_fault+0xc8d/0x11c0
[ 379.022055] handle_mm_fault+0x17a/0x390
[ 379.023623] __do_page_fault+0x24c/0x4d0
[ 379.025167] do_page_fault+0x2a/0x70
[ 379.026621] ? page_fault+0x8/0x30
[ 379.028022] page_fault+0x1e/0x30

[ 432.759743] systemd-journal D12600 498 1 0x00000000
[ 432.761748] Call Trace:
[ 432.762892] ? __schedule+0x245/0x7f0
[ 432.764352] schedule+0x23/0x80
[ 432.765673] schedule_timeout+0x21f/0x400
[ 432.767428] wait_for_completion+0xb2/0x130
[ 432.769207] ? wake_up_q+0x70/0x70
[ 432.770764] flush_work+0x1db/0x2b0
[ 432.772223] ? flush_workqueue_prep_pwqs+0x1b0/0x1b0
[ 432.774125] ? page_alloc_cpu_dead+0x30/0x30
[ 432.775873] drain_all_pages+0x174/0x1e0
[ 432.777528] __alloc_pages_slowpath+0x428/0xc50
[ 432.779306] __alloc_pages_nodemask+0x2a6/0x2c0
[ 432.780977] filemap_fault+0x437/0x8e0
[ 432.782503] ? lock_acquire+0x51/0x70
[ 432.784012] ? xfs_ilock+0x86/0x190 [xfs]
[ 432.785556] __xfs_filemap_fault.constprop.21+0x37/0xc0 [xfs]
[ 432.787537] __do_fault+0x13/0x126
[ 432.788930] __handle_mm_fault+0xc8d/0x11c0
[ 432.790570] handle_mm_fault+0x17a/0x390
[ 432.792162] __do_page_fault+0x24c/0x4d0
[ 432.793773] do_page_fault+0x2a/0x70
[ 432.795196] ? page_fault+0x8/0x30
[ 432.796572] page_fault+0x1e/0x30

[ 444.234824] workqueue events_freezable_power_: flags=0x84
[ 444.236937] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 444.239138] in-flight: 5:disk_events_workfn
[ 444.241022] workqueue mm_percpu_wq: flags=0x8
[ 444.242829] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 444.245057] pending: vmstat_update, drain_local_pages_wq BAR(498)

>
> [...]
>
>> Since the patch shown below was suggested by Michal Hocko at
>> https://marc.info/?l=linux-mm&m=152723708623015 , it is from Michal Hocko.
>>
>> >From cd8095242de13ace61eefca0c3d6f2a5a7b40032 Mon Sep 17 00:00:00 2001
>> From: Michal Hocko <mhocko@xxxxxxxx>
>> Date: Thu, 26 Jul 2018 14:40:03 +0900
>> Subject: [PATCH] mm,page_alloc: PF_WQ_WORKER threads must sleep at should_reclaim_retry().
>>
>> Tetsuo Handa has reported that it is possible to bypass the short sleep
>> for PF_WQ_WORKER threads which was introduced by commit 373ccbe5927034b5
>> ("mm, vmstat: allow WQ concurrency to discover memory reclaim doesn't make
>> any progress") and moved by commit ede37713737834d9 ("mm: throttle on IO
>> only when there are too many dirty and writeback pages") and lock up the
>> system if OOM.
>>
>> This is because we are implicitly counting on falling back to
>> schedule_timeout_uninterruptible() in __alloc_pages_may_oom() when
>> schedule_timeout_uninterruptible() in should_reclaim_retry() was not
>> called due to __zone_watermark_ok() == false.
>
> How do we rely on that?

Unless disk_events_workfn() (PID=5) sleeps at schedule_timeout_uninterruptible()
in __alloc_pages_may_oom(), drain_local_pages_wq() which a thread doing __GFP_FS
allocation (PID=498) is waiting for cannot be started.

>
>> However, schedule_timeout_uninterruptible() in __alloc_pages_may_oom() is
>> not called if all allocating threads but a PF_WQ_WORKER thread got stuck at
>> __GFP_FS direct reclaim, for mutex_trylock(&oom_lock) by that PF_WQ_WORKER
>> thread succeeds and out_of_memory() remains no-op unless that PF_WQ_WORKER
>> thread is doing __GFP_FS allocation.
>
> I have really hard time to parse and understand this.

You can write as you like.

>
>> Tetsuo is observing that GFP_NOIO
>> allocation request from disk_events_workfn() is preventing other pending
>> works from starting.
>
> What about any other allocation from !PF_WQ_WORKER context? Why those do
> not jump in?

All __GFP_FS allocations got stuck at direct reclaim or workqueue.

>
>> Since should_reclaim_retry() should be a natural reschedule point,
>> let's do the short sleep for PF_WQ_WORKER threads unconditionally
>> in order to guarantee that other pending works are started.
>
> OK, this is finally makes some sense. But it doesn't explain why it
> handles the live lock.

As explained above, if disk_events_workfn() (PID=5) explicitly sleeps,
vmstat_update and drain_local_pages_wq from WQ_MEM_RECLAIM workqueue will
start, and unblock PID=498 which is waiting for drain_local_pages_wq and
allow PID=498 to invoke the OOM killer.

>
>> Reported-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
>> Signed-off-by: Michal Hocko <mhocko@xxxxxxxx>
>
> Your s-o-b is missing again. I have already told you that previously
> when you were posting the patch.

I'm waiting for you to post this change with your wording.

>
> I do not mind this change per se but I am not happy about _your_ changelog.
> It doesn't explain the underlying problem IMHO. Having a natural and
> unconditional scheduling point in should_reclaim_retry is a reasonable
> thing. But how the hack it relates to the livelock you are seeing. So
> namely the changelog should explain
> 1) why nobody is able to make forward progress during direct reclaim

Because GFP_NOIO allocation from one workqueue prevented WQ_MEM_RECLAIM
workqueues from starting, for it did not call schedule_timeout_*() because
mutex_trylock(&oom_lock) did not fail because nobody else could call
__alloc_pages_may_oom().

> 2) why nobody is able to trigger oom killer as the last resort

Because only one !__GFP_FS allocating thread which did not get stuck at
direct reclaim was able to call __alloc_pages_may_oom().