Re: PROBLEM: Remapping hugepages mappings causes kernel to return EINVAL
From: C.Wehrmeyer
Date: Tue Oct 24 2017 - 03:42:55 EST
On 2017-10-23 20:02, Michal Hocko wrote:
On Mon 23-10-17 19:52:27, C.Wehrmeyer wrote:
[...]
or you can mmap a larger block and
munmap the initial unaligned part.
And how is that supposed to be transparent? When I hear "transparent" I
think of a mechanism which I can put under a system so that it benefits from
it, while the system does not notice or at least does not need to be aware
of it. The system also does not need to be changed for it.
How do you expect to get a huge page when the mapping itself is not
properly aligned?
There are four ways that I can think of from the top of my head, but
only one of them would be actually transparent.
1. Provide a flag to mmap, which might be something different from
MAP_HUGETLB. After all your question revolved merely around properly
aligned pages - we don't want to *force* the kernel to reserve
hugepages, we just want it to provide the proper alignment in this case.
That wouldn't be very transparent, but it would be the easiest route to
go (and mmap already kind-of supports such a thing).
2. Based on transparent_hugepage/enabled always churn out properly
aligned pages. In this case madvise(MADV_HUGEPAGE) becomes obsolete -
after all it's mmap which decides what kind of addresses we get. First
getting *some* mapping that isn't properly aligned for hugepages and
*then* trying to mitigate the damage by another syscall not only defies
the meaning of "transparent", but might also be hard to implement
kernel-side. Let's say I map 8 MiBs of memory, without mmap knowing that
I'd prefer this to be allocated via THPs. I could either go with your
route (map 8 MiBs and then some more, trim at the beginning and the end,
and then tell madvise that all of that is now going to be hugepages -
which is something that could easily be done in the kernel, especially
with the internal knowledge about what the actual page size is and
without all those context switches that one takes in by mapping,
munmapping, munmapping *again* and then *madvising* the actual memory),
or I'd go with my third option.
3. I map 8 MiBs, some some misaligned address from mmap, and then try to
mitigate the damage by telling madvise that all that is now supposed to
use hugepages. The dumb way of implementing this would be to split the
mapping - one section at the beginning has 256 4-KiB pages, the next one
utilises 3 2-MiB pages, and the last section has 256 4-KiB pages again
(or some such), effectively equalling 8 MiBs. I don't even know if Linux
supports variable-page-size mappings, and of course we're still carrying
512 4-KiBs pages with us that would have easily been mapped into one
2-MiB page, which is why I call it the dumb way.
4. Like three, but a wee bit smarter: introduce another system call that
works like madvise(MADV_HUGEPAGE), but let it return the address of a
properly aligned mapping, thus giving userspace 4 genuine 2-MiB pages.
Just like 3) that wouldn't be transparent, but at least it's only 4
context switches that don't give us half-baked hugepages. However, this
approach would effectively only be 1), just more complicated and
un-transparent.
tl; dr:
1. Provide mmap with some sort of flag (which would be redundant IMHO)
in order to churn out properly aligned pages (not transparent, but the
current MAP_HUGETLB flag isn't either).
2. Based on THP enabling status always churn out properly aligned pages,
and just failsafe to smaller pages if hugepages couldn't be allocated
(truly transparent).
3. Map in memory, then tell madvise to make as many hugepages out of it
as possible while still keeping the initial mapping (not transparent,
and not sure Linux can actually do that).
4. Introduce a new system call (not transparent from the get-go) to give
out properly aligned pages, or make them properly aligned while the
mapping is transformed from not-properly-aligned to properly-aligned.
Your call.