Re: [PATCH v5 00/21] Free some vmemmap pages of hugetlb page

From: David Hildenbrand
Date: Fri Nov 20 2020 - 13:01:01 EST


On 20.11.20 18:45, Mike Kravetz wrote:
On 11/20/20 1:43 AM, David Hildenbrand wrote:
On 20.11.20 10:39, Michal Hocko wrote:
On Fri 20-11-20 10:27:05, David Hildenbrand wrote:
On 20.11.20 09:42, Michal Hocko wrote:
On Fri 20-11-20 14:43:04, Muchun Song wrote:
[...]

Thanks for improving the cover letter and providing some numbers. I have
only glanced through the patchset because I didn't really have more time
to dive depply into them.

Overall it looks promissing. To summarize. I would prefer to not have
the feature enablement controlled by compile time option and the kernel
command line option should be opt-in. I also do not like that freeing
the pool can trigger the oom killer or even shut the system down if no
oom victim is eligible.

One thing that I didn't really get to think hard about is what is the
effect of vmemmap manipulation wrt pfn walkers. pfn_to_page can be
invalid when racing with the split. How do we enforce that this won't
blow up?

I have the same concerns - the sections are online the whole time and
anybody with pfn_to_online_page() can grab them

I think we have similar issues with memory offlining when removing the
vmemmap, it's just very hard to trigger and we can easily protect by
grabbing the memhotplug lock.

I am not sure we can/want to span memory hotplug locking out to all pfn
walkers. But you are right that the underlying problem is similar but
much harder to trigger because vmemmaps are only removed when the
physical memory is hotremoved and that happens very seldom. Maybe it
will happen more with virtualization usecases. But this work makes it
even more tricky. If a pfn walker races with a hotremove then it would
just blow up when accessing the unmapped physical address space. For
this feature a pfn walker would just grab a real struct page re-used for
some unpredictable use under its feet. Any failure would be silent and
hard to debug.

Right, we don't want the memory hotplug locking, thus discussions regarding rcu. Luckily, for now I never saw a BUG report regarding this - maybe because the time between memory offlining (offline_pages()) and memory/vmemmap getting removed (try_remove_memory()) is just too long. Someone would have to sleep after pfn_to_online_page() for quite a while to trigger it.


[...]
To keep things easy, maybe simply never allow to free these hugetlb pages
again for now? If they were reserved during boot and the vmemmap condensed,
then just let them stick around for all eternity.

Not sure I understand. Do you propose to only free those vmemmap pages
when the pool is initialized during boot time and never allow to free
them up? That would certainly make it safer and maybe even simpler wrt
implementation.

Exactly, let's keep it simple for now. I guess most use cases of this (virtualization, databases, ...) will allocate hugepages during boot and never free them.

Not sure if I agree with that last statement. Database and virtualization
use cases from my employer allocate allocate hugetlb pages after boot. It
is shortly after boot, but still not from boot/kernel command line.

Right, but the ones that care about this optimization for now could be converted, I assume? I mean we are talking about "opt-in" from sysadmins, so requiring to specify a different cmdline parameter does not sound to weird to me. And it should simplify a first version quite a lot.

The more I think about this, the more I believe doing these vmemmap modifications after boot are very dangerous.


Somewhat related, but not exactly addressing this issue ...

One idea discussed in a previous patch set was to disable PMD/huge page
mapping of vmemmap if this feature was enabled. This would eliminate a bunch
of the complex code doing page table manipulation. It does not address
the issue of struct page pages going away which is being discussed here,
but it could be a way to simply the first version of this code. If this
is going to be an 'opt in' feature as previously suggested, then eliminating
the PMD/huge page vmemmap mapping may be acceptable. My guess is that
sysadmins would only 'opt in' if they expect most of system memory to be used
by hugetlb pages. We certainly have database and virtualization use cases
where this is true.

It sounds like a hack to me, which does not fully solve the problem. But yeah, it's a simplification.

--
Thanks,

David / dhildenb