Re: [PATCH] mm: thp: move deferred split queue to memcg's nodeinfo

From: Yang Shi
Date: Tue Oct 08 2019 - 19:09:45 EST

Next message: Rafael J. Wysocki: "Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems"
Previous message: Paul Walmsley: "[PATCH] Documentation: admin-guide: add earlycon documentation for the sifive serial driver"
In reply to: Michal Hocko: "Re: [PATCH] mm: thp: move deferred split queue to memcg's nodeinfo"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 10/8/19 7:44 AM, Kirill A. Shutemov wrote:

On Mon, Oct 07, 2019 at 04:30:30PM +0200, Michal Hocko wrote:

On Mon 07-10-19 16:19:59, Vlastimil Babka wrote:

On 10/2/19 10:43 AM, Michal Hocko wrote:

On Wed 02-10-19 06:16:43, Yang Shi wrote:

The commit 87eaceb3faa59b9b4d940ec9554ce251325d83fe ("mm: thp: make
deferred split shrinker memcg aware") makes deferred split queue per
memcg to resolve memcg pre-mature OOM problem. But, all nodes end up
sharing the same queue instead of one queue per-node before the commit.
It is not a big deal for memcg limit reclaim, but it may cause global
kswapd shrink THPs from a different node.

And, 0-day testing reported -19.6% regression of stress-ng's madvise
test [1]. I didn't see that much regression on my test box (24 threads,
48GB memory, 2 nodes), with the same test (stress-ng --timeout 1
--metrics-brief --sequential 72 --class vm --exclude spawn,exec), I saw
average -3% (run the same test 10 times then calculate the average since
the test itself may have most 15% variation according to my test)
regression sometimes (not every time, sometimes I didn't see regression
at all).

This might be caused by deferred split queue lock contention. With some
configuration (i.e. just one root memcg) the lock contention my be worse
than before (given 2 nodes, two locks are reduced to one lock).

So, moving deferred split queue to memcg's nodeinfo to make it NUMA
aware again.

With this change stress-ng's madvise test shows average 4% improvement
sometimes and I didn't see degradation anymore.

My concern about this getting more and more complex
(http://lkml.kernel.org/r/20191002084014.GH15624@xxxxxxxxxxxxxx) holds
here even more. Can we step back and reconsider the whole thing please?

What about freeing immediately after split via workqueue and also have a
synchronous version called before going oom? Maybe there would be also
other things that would benefit from this scheme instead of traditional
reclaim and shrinkers?

That is exactly what we have discussed some time ago.

Yes, I've posted the patch:

http://lkml.kernel.org/r/20190827125911.boya23eowxhqmopa@box

But I still not sure that the approach is right. I expect it to trigger
performance regressions. For system with pleanty of free memory, we will
just pay split cost for nothing in many cases.

This is exactly what I'm concerned about as well.

Next message: Rafael J. Wysocki: "Re: [RFC/RFT][PATCH v8] cpuidle: New timer events oriented governor for tickless systems"
Previous message: Paul Walmsley: "[PATCH] Documentation: admin-guide: add earlycon documentation for the sifive serial driver"
In reply to: Michal Hocko: "Re: [PATCH] mm: thp: move deferred split queue to memcg's nodeinfo"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]