Re: [RFC PATCH 0/5] hugetlb: Change huge pmd sharing

From: Mike Kravetz
Date: Thu Apr 07 2022 - 12:18:24 EST


On 4/7/22 03:08, David Hildenbrand wrote:
> On 06.04.22 22:48, Mike Kravetz wrote:
>> hugetlb fault scalability regressions have recently been reported [1].
>> This is not the first such report, as regressions were also noted when
>> commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
>> synchronization") was added [2] in v5.7. At that time, a proposal to
>> address the regression was suggested [3] but went nowhere.
<snip>
>> Please help with comments or suggestions. I would like to come up with
>> something that is performant and safe.
>
> May I challenge the existence of huge PMD sharing? TBH I am not
> convinced that the code complexity is worth the benefit.
>

That is a fair question.
Huge PMD sharing is not a documented or well known feature. Most people would
not notice it going away. However, I suspect some people will notice.
> Let me know if I get something wrong:
>
> Let's assume a 4 TiB device and 2 MiB hugepage size. That's 2097152 huge
> pages. Each such PMD entry consumes 8 bytes. That's 16 MiB.
>
> Sure, with thousands of processes sharing that memory, the size of page
> tables required would increase with each and every process. But TBH,
> that's in no way different to other file systems where we're even
> dealing with PTE tables.

The numbers for a real use case I am frequently quoted are something like:
1TB shared mapping, 10,000 processes sharing the mapping
4K PMD Page per 1GB of shared mapping
4M saving for each shared process
9,999 * 4M ~= 39GB savings

However, if you look at commit 39dde65c9940c which introduced huge pmd sharing
it states that performance rather than memory savings was the primary
objective.

"For hugetlb, the saving on page table memory is not the primary
objective (as hugetlb itself already cuts down page table overhead
significantly), instead, the purpose of using shared page table on hugetlb is
to allow faster TLB refill and smaller cache pollution upon TLB miss.

With PT sharing, pte entries are shared among hundreds of processes, the
cache consumption used by all the page table is smaller and in return,
application gets much higher cache hit ratio. One other effect is that
cache hit ratio with hardware page walker hitting on pte in cache will be
higher and this helps to reduce tlb miss latency. These two effects
contribute to higher application performance."

That 'makes sense', but I have never tried to measure any such performance
benefit. It is easier to calculate the memory savings.

>
> Which results in me wondering if
>
> a) We should simply use gigantic pages for such extreme use case. Allows
> for freeing up more memory via vmemmap either way.

The only problem with this is that many processors in use today have
limited TLB entries for gigantic pages.

> b) We should instead look into reclaiming reconstruct-able page table.
> It's hard to imagine that each and every process accesses each and
> every part of the gigantic file all of the time.
> c) We should instead establish a more generic page table sharing
> mechanism.

Yes. I think that is the direction taken by mshare() proposal. If we have
a more generic approach we can certainly start deprecating hugetlb pmd
sharing.

>
>
> Consequently, I'd be much more in favor of ripping it out :/ but that's
> just my personal opinion.
>

--
Mike Kravetz