Re: [PATCH v3] mm/compaction:let proactive compaction order configurable

From: Khalid Aziz
Date: Thu May 06 2021 - 17:27:56 EST

On 4/25/21 9:15 PM, David Rientjes wrote:
On Sun, 25 Apr 2021, chukaiping wrote:

Currently the proactive compaction order is fixed to
COMPACTION_HPAGE_ORDER(9), it's OK in most machines with lots of
normal 4KB memory, but it's too high for the machines with small
normal memory, for example the machines with most memory configured
as 1GB hugetlbfs huge pages. In these machines the max order of
free pages is often below 9, and it's always below 9 even with hard
compaction. This will lead to proactive compaction be triggered very
frequently. In these machines we only care about order of 3 or 4.
This patch export the oder to proc and let it configurable
by user, and the default value is still COMPACTION_HPAGE_ORDER.

As asked in the review of the v1 of the patch, why is this not a userspace
policy decision? If you are interested in order-3 or order-4
fragmentation, for whatever reason, you could periodically check
/proc/buddyinfo and manually invoke compaction on the system.

In other words, why does this need to live in the kernel?

I have struggled with this question. Fragmentation and allocation stalls are significant issues on large database systems which also happen to use memory in similar ways (90+% of memory is allocated as hugepages) leaving just enough memory to run rest of the userspace processes. I had originally proposed a kernel patch to monitor, do a trend analysis of memory usage and take proactive action - <>. Based upon feedback, I moved the implementation to userspace - <>. Test results across multiple workloads have been very good. Results from one of the workloads are in this blog - <>. It works well from userspace but it has limited ways to influence reclamation and compaction. It uses watermark_scale_factor to boost watermarks and cause reclamation to kick in earlier and run longer. It uses /sys/devices/system/node/node%d/compact to force compaction on the node expected to reach high level of fragmentation soon. Neither of these is very efficient from userspace even though they get the job done. Scaling watermark has longer lasting impact than raising scanning priority in balance_pgdat() temporarily. I plan to experiment with watermark_boost_factor to see if I can use it in place of /sys/devices/system/node/node%d/compact and get the same results. Doing all of this in the kernel can be more efficient and lessen potential negative impact on the system. On the other hand, it is easier to fix and update such policies in userspace although at the cost of having a performance critical component live outside the kernel and thus not be active on the system by default.