Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom

From: Eiichi Tsukata
Date: Wed Feb 17 2021 - 06:09:00 EST


Hi All,

Firstly, thank you for your careful review and attention to my patch
(and apologies for top-posting!). Let me first explain why our use
case requires hugetlb over THP and then elaborate on the difficulty we
have to maintain the correct number of hugepages in the pool, finally
concluding with why the proposed approach would help us. Hopefully you
can extend it to other use cases and justify the proposal.

We use Linux to operate a KVM-based hypervisor. Using hugepages to
back VM memory significantly increases performance and density. Each
VM incurs a 4k regular page overhead which can vary drastically even
at runtime (eg. depending on network traffic). In addition, the
software doesn't know upfront if users will power on one large VM or
several small VMs.

To manage the varying balance of 4k pages vs. hugepages, we originally
leveraged THP. However, constant fragmentation due to VM power cycles,
the varying overhead I mentioned above, and other operations like
reconfiguration of NIC RX buffers resulted in two problems:
1) There were no guarantees hugepages would be used; and
2) Constant memory compaction incurred a measurable overhead.

Having a userspace service managing hugetlb gave us significant
performance advantages and much needed determinism. It chooses when to
try and create more hugepages as well as how many hugepages to go
after. Elements like how many hugepages it actually gets, combined
with what operations are happening on the host, allow our service to
make educated decisions about when to compact memory, drop caches, and
retry growing (or shrinking) the pool.

But that comes with a challenge: despite listening on cgroup for
pressure notifications (which happen from those runtime events we do
not control), the service is not guaranteed to sacrifice hugepages
fast enough and that causes an OOM. The killer will normally take out
a VM even if there are plenty of unused hugepages and that's obviously
disruptive for users. For us, free hugepages are almost always expendable.

For the bloat cases which are predictable, a memory management service
can adjust the hugepage pool size ahead of time. But it can be hard to
anticipate all scenarios, and some can be very volatile. Having a
failsafe mechanism as proposed in this patch offers invaluable
protection when things are missed.

The proposal solves this problem by sacrificing hugepages inline even
when the pressure comes from kernel allocations. The userspace service
can later readjust the pool size without being under pressure. Given
this is configurable, and defaults to being off, we thought it would
be a nice addition to the kernel and appreciated by other users that
may have similar requirements.

I welcome your comments and thank you again for your time!

Eiichi

> On Feb 17, 2021, at 16:57, Michal Hocko <mhocko@xxxxxxxx> wrote:
>
> On Tue 16-02-21 14:30:15, Mike Kravetz wrote:
> [...]
>> However, this is an 'opt in' feature. So, I would not expect anyone who
>> carefully plans the size of their hugetlb pool to enable such a feature.
>> If there is a use case where hugetlb pages are used in a non-essential
>> application, this might be of use.
>
> I would really like to hear about the specific usecase. Because it
> smells more like a misconfiguration. What would be non-essential hugetlb
> pages? This is not a resource to be pre-allocated just in case, right?
>
> --
> Michal Hocko
> SUSE Labs