Re: [PATCH 5/5] hugetlbfs: Limit wait time when trying to share huge PMD

From: Mike Kravetz
Date: Wed Sep 11 2019 - 13:03:55 EST


On 9/11/19 8:44 AM, Waiman Long wrote:
> On 9/11/19 4:14 PM, Matthew Wilcox wrote:
>> On Wed, Sep 11, 2019 at 04:05:37PM +0100, Waiman Long wrote:
>>> When allocating a large amount of static hugepages (~500-1500GB) on a
>>> system with large number of CPUs (4, 8 or even 16 sockets), performance
>>> degradation (random multi-second delays) was observed when thousands
>>> of processes are trying to fault in the data into the huge pages. The
>>> likelihood of the delay increases with the number of sockets and hence
>>> the CPUs a system has. This only happens in the initial setup phase
>>> and will be gone after all the necessary data are faulted in.
>> Can;t the application just specify MAP_POPULATE?
>
> Originally, I thought that this happened in the startup phase when the
> pages were faulted in. The problem persists after steady state had been
> reached though. Every time you have a new user process created, it will
> have its own page table.

This is still at fault time. Although, for the particular application it
may be after the 'startup phase'.

> It is the sharing of the of huge page shared
> memory that is causing problem. Of course, it depends on how the
> application is written.

It may be the case that some applications would find the delays acceptable
for the benefit of shared pmds once they reach steady state. As you say, of
course this depends on how the application is written.

I know that Oracle DB would not like it if PMD sharing is disabled for them.
Based on what I know of their model, all processes which share PMDs perform
faults (write or read) during the startup phase. This is in environments as
big or bigger than you describe above. I have never looked at/for delays in
these environments around pmd sharing (page faults), but that does not mean
they do not exist. I will try to get the DB group to give me access to one
of their large environments for analysis.

We may want to consider making the timeout value and disable threshold user
configurable.
--
Mike Kravetz