Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

From: Michal Hocko
Date: Tue Oct 27 2015 - 05:22:38 EST

Next message: Jon Hunter: "Re: [PATCH V3 2/2] dmaengine: tegra-adma: Add support for Tegra210 ADMA"
Previous message: Lorenzo Pieralisi: "Re: arm64: PCI: HAVE_PCI_MMAP not defined"
In reply to: Tejun Heo: "Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks"
Next in thread: Tejun Heo: "Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sun 25-10-15 19:52:59, Tetsuo Handa wrote:
[...]
> Three approaches are proposed for fixing this silent livelock problem.
>
> (1) Use zone_page_state_snapshot() instead of zone_page_state()
> when doing zone_reclaimable() checks. This approach is clear,
> straightforward and easy to backport. So far I cannot reproduce
> this livelock using this change. But there might be more locations
> which should use zone_page_state_snapshot().
>
> (2) Use a dedicated workqueue for vmstat_update item which is guaranteed
> to be processed immediately. So far I cannot reproduce this livelock
> using a dedicated workqueue created with WQ_MEM_RECLAIM|WQ_HIGHPRI
> (patch proposed by Christoph Lameter). But according to Tejun Heo,
> if we want to guarantee that nobody can reproduce this livelock, we
> need to modify workqueue API because commit 3270476a6c0c ("workqueue:
> reimplement WQ_HIGHPRI using a separate worker_pool") which went to
> Linux 3.6 lost the guarantee.
>
> (3) Use a !TASK_RUNNING sleep inside page allocator side. This approach
> is easy to backport. So far I cannot reproduce this livelock using
> this approach. And I think that nobody can reproduce this livelock
> because this changes the page allocator to obey the workqueue's
> expectations. Even if we leave this livelock problem aside, not
> entering into !TASK_RUNNING state for too long is an exclusive
> occupation of workqueue which will make other items in the workqueue
> needlessly deferred. We don't need to defer other items which do not
> invoke a __GFP_WAIT allocation.
>
> This patch does approach (3), by inserting an uninterruptible sleep into
> page allocator side before retrying, in order to make sure that other
> workqueue items (especially vmstat_update item) are given a chance to be
> processed.
>
> Although a different problem, by using approach (3), we can alleviate
> needlessly burning CPU cycles even when we hit OOM-killer livelock problem
> (hang up after the OOM-killer messages are printed because the OOM victim
> cannot terminate due to dependency).

I really dislike this approach. Waiting without having an event to
wait for is just too ugly. I think 1) is easiest to backport to
stable kernels without causing any other regressions. 2) is the way
to move forward for next kernels and we should really think whether
WQ_MEM_RECLAIM should imply also WQ_HIGHPRI by default. If there is a
general consensus that there are legitimate WQ_MEM_RECLAIM users which
can do without the other flag then I am perfectly OK to use it for
vmstat and oom sysrq dedicated workqueues.

> Signed-off-by: Tetsuo Handa <penguin-kernel@xxxxxxxxxxxxxxxxxxx>
[...]
--
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Jon Hunter: "Re: [PATCH V3 2/2] dmaengine: tegra-adma: Add support for Tegra210 ADMA"
Previous message: Lorenzo Pieralisi: "Re: arm64: PCI: HAVE_PCI_MMAP not defined"
In reply to: Tejun Heo: "Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks"
Next in thread: Tejun Heo: "Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]