Re: OOM detection regressions since 4.7
From: Michal Hocko
Date: Tue Aug 23 2016 - 03:34:02 EST
On Tue 23-08-16 13:52:45, Joonsoo Kim wrote:
[...]
> Hello, Michal.
>
> I agree with partial revert but revert should be a different form.
> Below change try to reuse should_compact_retry() version for
> !CONFIG_COMPACTION but it turned out that it also causes regression in
> Markus report [1].
I would argue that CONFIG_COMPACTION=n behaves so arbitrary for high
order workloads that calling any change in that behavior a regression
is little bit exaggerated. Disabling compaction should have a very
strong reason. I haven't heard any so far. I am even wondering whether
there is a legitimate reason for that these days.
> Theoretical reason for this regression is that it would stop retry
> even if there are enough lru pages. It only checks if freepage
> excesses min watermark or not for retry decision. To prevent
> pre-mature OOM killer, we need to keep allocation loop when there are
> enough lru pages. So, logic should be something like that.
>
> should_compact_retry()
> {
> for_each_zone_zonelist_nodemask {
> available = zone_reclaimable_pages(zone);
> available += zone_page_state_snapshot(zone, NR_FREE_PAGES);
> if (__zone_watermark_ok(zone, *0*, min_wmark_pages(zone),
> ac_classzone_idx(ac), alloc_flags, available))
> return true;
>
> }
> }
>
> I suggested it before and current situation looks like it is indeed
> needed.
this just opens doors for an unbounded reclaim/threshing becacause
you can reclaim as much as you like and there is no guarantee of a
forward progress. The reason why !COMPACTION should_compact_retry only
checks for the min_wmark without the reclaimable bias is that this will
guarantee a retry if we are failing due to high order wmark check rather
than a lack of memory. This condition is guaranteed to converge and the
probability of the unbounded reclaim is much more reduced.
> And, I still think that your OOM detection rework has some flaws.
>
> 1) It doesn't consider freeable objects that can be freed by shrink_slab().
> There are many subsystems that cache many objects and they will be
> freed by shrink_slab() interface. But, you don't account them when
> making the OOM decision.
I fully rely on the reclaim and compaction feedback. And that is the
place where we should strive for improvements. So if we are growing way
too many slab objects we should take care about that in the slab reclaim
which is tightly coupled with the LRU reclaim rather than up the layer
in the page allocator.
> Think about following situation that we are trying to find order-2
> freepage and some subsystem has order-2 freepage. It can be freed by
> shrink_slab(). Your logic doesn't guarantee that shrink_slab() is
> invoked to free this order-2 freepage in that subsystem. OOM would be
> triggered when compaction fails even if there is a order-2 freeable
> page. I think that if decision is made before whole lru list is
> scanned and then shrink_slab() is invoked for whole freeable objects,
> it would cause pre-mature OOM.
I do not see why we would need to scan through the whole LRU list when
we are under a high order pressure. It is true, though, that slab
shrinkers can and should be more sensitive to the requested order to
help release higher order pages preferably.
> It seems that you already knows this issue [2].
>
> 2) 'OOM detection rework' depends on compaction too much. Compaction
> algorithm is racy and has some limitation. It's failure doesn't mean we
> are in OOM situation.
As long as this is the only reliable source of higher order pages then
we do not have any other choice in order to have deterministic behavior.
> Even if Vlastimil's patchset and mine is
> applied, it is still possible that compaction scanner cannot find enough
> freepage due to race condition and return pre-mature failure. To
> reduce this race effect, I hope to give more chances to retry even if
> full compaction is failed.
Than we can improve compaction_failed() heuristic and do not call it the
end of the day after a single attempt to get a high order page after
scanning the whole memory. But to me this all sounds like an internal
implementation detail of the compaction and the OOM detection in the
page allocator should be as much independent on it as possible - same as
it is independent on the internal reclaim decisions. That was the whole
point of my rework. To actually melt "do something as long as at least a
single page is reclaimed" into an actual algorithm which can be measured
and reason about.
> We can remove this heuristic when we make sure that compaction is
> stable enough.
How do we know that, though, if we do not rely on it? Artificial tests
do not exhibit those corner cases. I was bashing my testing systems to
cause as much fragmentation as possible, yet I wasn't able to trigger
issues reported recently by real world workloads. Do not take me wrong,
I understand your concerns but OOM detection will never be perfect. We
can easily get to one or other extremes. We should strive to make it
work in most workloads. So far it seems that there were no regressions
for order-0 pressure and we can improve compaction to cover higher
orders. I am willing to reconsider this after we hit a cliff where we
cannot do much more in the compaction proper and still hit pre-mature
oom killer invocations in not-so-insane workloads, though.
I believe that Vlastimil's patches show the path to go longterm. Get rid
of the latency heuristics for allocations where that matters in the
first step. Then try to squeeze as much for reliability for !costly
orders as possible.
I also believe that these issues will be less of the problem once we
switch to vmalloc stacks because this is the primary source of high
order allocations these days. Most others are more an optimization than
a reliability thing.
> As you know, I said these things several times but isn't accepted.
> Please consider them more deeply at this time.
>
> Thanks.
>
> [1] http://lkml.kernel.org/r/20160731051121.GB307@x4
> [2] https://bugzilla.opensuse.org/show_bug.cgi?id=994066
--
Michal Hocko
SUSE Labs