RE: [External] RE: [PATCH v4 00/21] Free some vmemmap pages of hugetlb page
From: Song Bao Hua (Barry Song)
Date: Tue Nov 17 2020 - 06:08:30 EST
> -----Original Message-----
> From: Muchun Song [mailto:songmuchun@xxxxxxxxxxxxx]
> Sent: Tuesday, November 17, 2020 11:50 PM
> To: Song Bao Hua (Barry Song) <song.bao.hua@xxxxxxxxxxxxx>
> Cc: corbet@xxxxxxx; mike.kravetz@xxxxxxxxxx; tglx@xxxxxxxxxxxxx;
> mingo@xxxxxxxxxx; bp@xxxxxxxxx; x86@xxxxxxxxxx; hpa@xxxxxxxxx;
> dave.hansen@xxxxxxxxxxxxxxx; luto@xxxxxxxxxx; peterz@xxxxxxxxxxxxx;
> viro@xxxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; paulmck@xxxxxxxxxx;
> mchehab+huawei@xxxxxxxxxx; pawan.kumar.gupta@xxxxxxxxxxxxxxx;
> rdunlap@xxxxxxxxxxxxx; oneukum@xxxxxxxx; anshuman.khandual@xxxxxxx;
> jroedel@xxxxxxx; almasrymina@xxxxxxxxxx; rientjes@xxxxxxxxxx;
> willy@xxxxxxxxxxxxx; osalvador@xxxxxxx; mhocko@xxxxxxxx;
> duanxiongchun@xxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx;
> linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> linux-fsdevel@xxxxxxxxxxxxxxx
> Subject: Re: [External] RE: [PATCH v4 00/21] Free some vmemmap pages of
> hugetlb page
>
> On Tue, Nov 17, 2020 at 6:16 PM Song Bao Hua (Barry Song)
> <song.bao.hua@xxxxxxxxxxxxx> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: owner-linux-mm@xxxxxxxxx [mailto:owner-linux-mm@xxxxxxxxx] On
> > > Behalf Of Muchun Song
> > > Sent: Saturday, November 14, 2020 12:00 AM
> > > To: corbet@xxxxxxx; mike.kravetz@xxxxxxxxxx; tglx@xxxxxxxxxxxxx;
> > > mingo@xxxxxxxxxx; bp@xxxxxxxxx; x86@xxxxxxxxxx; hpa@xxxxxxxxx;
> > > dave.hansen@xxxxxxxxxxxxxxx; luto@xxxxxxxxxx; peterz@xxxxxxxxxxxxx;
> > > viro@xxxxxxxxxxxxxxxxxx; akpm@xxxxxxxxxxxxxxxxxxxx; paulmck@xxxxxxxxxx;
> > > mchehab+huawei@xxxxxxxxxx; pawan.kumar.gupta@xxxxxxxxxxxxxxx;
> > > rdunlap@xxxxxxxxxxxxx; oneukum@xxxxxxxx;
> anshuman.khandual@xxxxxxx;
> > > jroedel@xxxxxxx; almasrymina@xxxxxxxxxx; rientjes@xxxxxxxxxx;
> > > willy@xxxxxxxxxxxxx; osalvador@xxxxxxx; mhocko@xxxxxxxx
> > > Cc: duanxiongchun@xxxxxxxxxxxxx; linux-doc@xxxxxxxxxxxxxxx;
> > > linux-kernel@xxxxxxxxxxxxxxx; linux-mm@xxxxxxxxx;
> > > linux-fsdevel@xxxxxxxxxxxxxxx; Muchun Song
> <songmuchun@xxxxxxxxxxxxx>
> > > Subject: [PATCH v4 00/21] Free some vmemmap pages of hugetlb page
> > >
> > > Hi all,
> > >
> > > This patch series will free some vmemmap pages(struct page structures)
> > > associated with each hugetlbpage when preallocated to save memory.
> > >
> > > Nowadays we track the status of physical page frames using struct page
> > > structures arranged in one or more arrays. And here exists one-to-one
> > > mapping between the physical page frame and the corresponding struct
> page
> > > structure.
> > >
> > > The HugeTLB support is built on top of multiple page size support that
> > > is provided by most modern architectures. For example, x86 CPUs normally
> > > support 4K and 2M (1G if architecturally supported) page sizes. Every
> > > HugeTLB has more than one struct page structure. The 2M HugeTLB has
> 512
> > > struct page structure and 1G HugeTLB has 4096 struct page structures. But
> > > in the core of HugeTLB only uses the first 4 (Use of first 4 struct page
> > > structures comes from HUGETLB_CGROUP_MIN_ORDER.) struct page
> > > structures to
> > > store metadata associated with each HugeTLB. The rest of the struct page
> > > structures are usually read the compound_head field which are all the same
> > > value. If we can free some struct page memory to buddy system so that we
> > > can save a lot of memory.
> > >
> > > When the system boot up, every 2M HugeTLB has 512 struct page
> structures
> > > which size is 8 pages(sizeof(struct page) * 512 / PAGE_SIZE).
> > >
> > > hugetlbpage struct pages(8 pages) page
> > > frame(8 pages)
> > > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> > > | | | 0 | -------------> |
> 0
> > > |
> > > | | | 1 | -------------> |
> 1
> > > |
> > > | | | 2 | -------------> |
> 2
> > > |
> > > | | | 3 | -------------> |
> 3
> > > |
> > > | | | 4 | -------------> |
> 4
> > > |
> > > | 2M | | 5 | -------------> |
> > > 5 |
> > > | | | 6 | -------------> |
> 6
> > > |
> > > | | | 7 | -------------> |
> 7
> > > |
> > > | | +-----------+
> > > +-----------+
> > > | |
> > > | |
> > > +-----------+
> > >
> > >
> > > When a hugetlbpage is preallocated, we can change the mapping from
> above
> > > to
> > > bellow.
> > >
> > > hugetlbpage struct pages(8 pages) page
> > > frame(8 pages)
> > > +-----------+ ---virt_to_page---> +-----------+ mapping to +-----------+
> > > | | | 0 | -------------> |
> 0
> > > |
> > > | | | 1 | -------------> |
> 1
> > > |
> > > | | | 2 | ------------->
> > > +-----------+
> > > | | | 3 | -----------------^ ^
> ^ ^
> > > ^
> > > | | | 4 | -------------------+
> | |
> > > |
> > > | 2M | | 5 |
> ---------------------+ |
> > > |
> > > | | | 6 |
> -----------------------+ |
> > > | | | 7 |
> -------------------------+
> > > | | +-----------+
> > > | |
> > > | |
> > > +-----------+
> > >
> > > For tail pages, the value of compound_head is the same. So we can reuse
> > > first page of tail page structs. We map the virtual addresses of the
> > > remaining 6 pages of tail page structs to the first tail page struct,
> > > and then free these 6 pages. Therefore, we need to reserve at least 2
> > > pages as vmemmap areas.
> > >
> > > When a hugetlbpage is freed to the buddy system, we should allocate six
> > > pages for vmemmap pages and restore the previous mapping relationship.
> > >
> > > If we uses the 1G hugetlbpage, we can save 4088 pages(There are 4096
> pages
> > > for
> > > struct page structures, we reserve 2 pages for vmemmap and 8 pages for
> page
> > > tables. So we can save 4088 pages). This is a very substantial gain. On our
> > > server, run some SPDK/QEMU applications which will use 1024GB
> hugetlbpage.
> > > With this feature enabled, we can save ~16GB(1G hugepage)/~11GB(2MB
> > > hugepage)
> >
> > Hi Muchun,
> >
> > Do we really save 11GB for 2MB hugepage?
> > How much do we save if we only get one 2MB hugetlb from one 128MB
> mem_section?
> > It seems we need to get at least one page for the PTEs since we are splitting
> PMD of
> > vmemmap into PTE?
>
> There are 524288(1024GB/2MB) 2MB HugeTLB pages. We can save 6 pages for
> each
> 2MB HugeTLB page. So we can save 3145728 pages. But we need to split PMD
> page
> table for every one 128MB mem_section and every section need one page
> as PTE page
> table. So we need 8192(1024GB/128MB) pages as PTE page tables.
> Finally, we can save
> 3137536(3145728-8192) pages which is 11.97GB.
The worst case I can see is that:
if we get 100 hugetlb with 2MB size, but the 100 hugetlb comes from different
mem_section, we won't save 11.97GB. we only save 5/8 * 16GB=10GB.
Anyway, it seems 11GB is in the middle of 10GB and 11.97GB,
so sounds sensible :-)
ideally, we should be able to free PageTail if we change struct page in some way.
Then we will save much more for 2MB hugetlb. but it seems it is not easy.
Thanks
Barry