Re: [RFC] mm: Proactive compaction

From: David Rientjes
Date: Tue Sep 17 2019 - 16:26:56 EST


On Tue, 17 Sep 2019, John Hubbard wrote:

> > We've had good success with periodically compacting memory on a regular
> > cadence on systems with hugepages enabled. The cadence itself is defined
> > by the admin but it causes khugepaged[*] to periodically wakeup and invoke
> > compaction in an attempt to keep zones as defragmented as possible
>
> That's an important data point, thanks for reporting it.
>
> And given that we have at least one data point validating it, I think we
> should feel fairly comfortable with this approach. Because the sys admin
> probably knows when are the best times to steal cpu cycles and recover
> some huge pages. Unlike the kernel, the sys admin can actually see the
> future sometimes, because he/she may know what is going to be run.
>
> It's still sounding like we can expect excellent results from simply
> defragmenting from user space, via a chron job and/or before running
> important tests, rather than trying to have the kernel guess whether
> it's a performance win to defragment at some particular time.
>
> Are you using existing interfaces, or did you need to add something? How
> exactly are you triggering compaction?
>

It's possible to do this through a cron job but there are a fre reasons
that we preferred to do it through khugepaged:

- we use a lighter variation of compaction, MIGRATE_SYNC_LIGHT, than what
the per-node trigger provides since compact_node() forces MIGRATE_SYNC
and can stall for minutes and become disruptive under some
circumstances,

- we do not ignore the pageblock skip hint which compact_node() hardcodes
to ignore, and

- we didn't want to do this in process context so that the cpu time is
not taxed to any user cgroup since it's on behalf of the system as a
whole.

It seems much better to do this on a per-node basis rather than through
the sysctl to do it for the whole system to partition the work. Extending
the per-node interface to do MIGRATE_SYNC_LIGHT and not ignore pageblock
skip is possible but the work done would still be done in process context
so if done from userspace this would need to be attached to a cgroup that
does not tax that cgroup for usage done on behalf of the entire system.

Again, we're using khugepaged and allowing the period to be defined
through /sys/kernel/mm/transparent_hugepage/khugepaged but that is because
we only want to do this on systems where we want to dynamically allocate
hugepages on a regular basis.