Re: 答复: [PATCH v3] mm/compaction:let proactive compaction order configurable

From: Khalid Aziz
Date: Tue May 11 2021 - 11:01:59 EST


On 5/11/21 1:48 AM, Chu,Kaiping wrote:


-----邮件原件-----
发件人: Khalid Aziz <khalid.aziz@xxxxxxxxxx>
发送时间: 2021年5月7日 5:27
收件人: David Rientjes <rientjes@xxxxxxxxxx>; Chu,Kaiping
<chukaiping@xxxxxxxxx>
抄送: mcgrof@xxxxxxxxxx; keescook@xxxxxxxxxxxx; yzaikin@xxxxxxxxxx;
akpm@xxxxxxxxxxxxxxxxxxxx; vbabka@xxxxxxx; nigupta@xxxxxxxxxx;
bhe@xxxxxxxxxx; iamjoonsoo.kim@xxxxxxx; mateusznosek0@xxxxxxxxx;
sh_def@xxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; linux-fsdevel@xxxxxxxxxxxxxxx;
linux-mm@xxxxxxxxx
主题: Re: [PATCH v3] mm/compaction:let proactive compaction order
configurable

On 4/25/21 9:15 PM, David Rientjes wrote:
On Sun, 25 Apr 2021, chukaiping wrote:

Currently the proactive compaction order is fixed to
COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
normal 4KB memory, but it's too high for the machines with small
normal memory, for example the machines with most memory configured
as 1GB hugetlbfs huge pages. In these machines the max order of free
pages is often below 9, and it's always below 9 even with hard
compaction. This will lead to proactive compaction be triggered very
frequently. In these machines we only care about order of 3 or 4.
This patch export the oder to proc and let it configurable by user,
and the default value is still COMPACTION_HPAGE_ORDER.


As asked in the review of the v1 of the patch, why is this not a
userspace policy decision? If you are interested in order-3 or
order-4 fragmentation, for whatever reason, you could periodically
check /proc/buddyinfo and manually invoke compaction on the system.

In other words, why does this need to live in the kernel?


I have struggled with this question. Fragmentation and allocation stalls are
significant issues on large database systems which also happen to use memory
in similar ways (90+% of memory is allocated as hugepages) leaving just
enough memory to run rest of the userspace processes. I had originally
proposed a kernel patch to monitor, do a trend analysis of memory usage and
take proactive action -
<https://lore.kernel.org/lkml/20190813014012.30232-1-khalid.aziz@oracle.c
om/>. Based upon feedback, I moved the implementation to userspace -
<https://github.com/oracle/memoptimizer>. Test results across multiple
workloads have been very good. Results from one of the workloads are in this
blog - <https://blogs.oracle.com/linux/anticipating-your-memory-needs>. It
works well from userspace but it has limited ways to influence reclamation and
compaction. It uses watermark_scale_factor to boost watermarks and cause
reclamation to kick in earlier and run longer. It uses
/sys/devices/system/node/node%d/compact to force compaction on the node
expected to reach high level of fragmentation soon. Neither of these is very
efficient from userspace even though they get the job done. Scaling watermark
has longer lasting impact than raising scanning priority in balance_pgdat()
temporarily. I plan to experiment with watermark_boost_factor to see if I can
use it in place of /sys/devices/system/node/node%d/compact and get the
same results. Doing all of this in the kernel can be more efficient and lessen
potential negative impact on the system. On the other hand, it is easier to fix
and update such policies in userspace although at the cost of having a
performance critical component live outside the kernel and thus not be active
on the system by default.

I studied your memoptimizer these days, I also agree to move the implementation into kernel to co-work with current proactive compaction mechanism to get higher efficiency.
By the way I am interested about the memoptimizer, I want to have a test of it, but how to evaluate its effectiveness?



If you look at this blog I wrote on memoptimizer - <https://blogs.oracle.com/linux/anticipating-your-memory-needs>, under the section "Measuring stalls" I describe the workload I used to measure its effectiveness. The metric I use is number of allocation/compaction stalls over a multi-hour run of the workload. Number of allocation/compaction stalls gives an idea of how effective system is at keeping free order 0 and larger pages available proactively. Any workload that runs into significant number of stalls is a good workload to use to measure effectiveness.

--
Khalid