Re: [RFC PATCH 0/8] Reimplement huge pages without hugepd on powerpc 8xx

From: Christophe Leroy
Date: Wed May 22 2024 - 05:08:51 EST




Le 17/05/2024 à 16:27, Oscar Salvador a écrit :
> On Mon, Mar 25, 2024 at 03:55:53PM +0100, Christophe Leroy wrote:
>> This series reimplements hugepages with hugepd on powerpc 8xx.
>>
>> Unlike most architectures, powerpc 8xx HW requires a two-level
>> pagetable topology for all page sizes. So a leaf PMD-contig approach
>> is not feasible as such.
>>
>> Possible sizes are 4k, 16k, 512k and 8M.
>>
>> First level (PGD/PMD) covers 4M per entry. For 8M pages, two PMD entries
>> must point to a single entry level-2 page table. Until now that was
>> done using hugepd. This series changes it to use standard page tables
>> where the entry is replicated 1024 times on each of the two pagetables
>> refered by the two associated PMD entries for that 8M page.
>>
>> At the moment it has to look into each helper to know if the
>> hugepage ptep is a PTE or a PMD in order to know it is a 8M page or
>> a lower size. I hope this can me handled by core-mm in the future.
>>
>> There are probably several ways to implement stuff, so feedback is
>> very welcome.
>
>
> Hi Christophe,
>
> I have been looking into this because I am interested in the ongoing work of
> the hugetlb unification, but my knowledge of ppc pagetables tends to zero,
> So be prepared for some stupid questions.
>
> First, let me have a clear picture of the current situation:
>
> power8xx has 4KB, 16KB, 512KB, and 8MB page sizes, and operate on a 2Level
> pagetables. Wiki [1] mentions PGD + PTE, here you seem to be referring them
> as PMD + PTE though.
>
> And we can have 1024 PGDs, each of one covers 4MB, so we can cover a total of
> of 4GB.
>
> Looking at the page table diagram for power8xx, it seems power8xx has also some
> sort of CONTIG_PTE? (same as arm64 does) So we can have contig_ptes representing
> bigger page sizes?
> I also guess that although power8xx supports all these different sizes, only one
> of them can be active at any time, right?

Don't know what you mean by "active at any time". In a running system
with PAGE_SIZE defined as 4k, you can at any time have some hugepages of
size 16K, some 512K and some 8M.

>
> It also seems that this whole hugepd thing is only used when we are using 8MB
> PAGE_SIZE, right?

Today yes. In the past it was also used for 512K pages, until commit
b250c8c08c79 ("powerpc/8xx: Manage 512k huge pages as standard pages.")

> And when that is active, we do have 2 PGDs(4MB each) pointing to the same 8MB
> hugepd.
> E.g:
> ____________
> [PGD#0] ------------> | |
> | 8MB hugepd |
> [PGD#1] ------------> |____________|
>
> What you want to do with this work is to get rid of that hugepd abstraction
> because it is something power8xx/hugetlb specific and cannot be represented
> with our "normal" page table layout (PGD,P4D,PUD,PMD,PTE).

It is more than 8xx, also used on e500 and book3s/64 but for sure that's
specific to powerpc and it would help to get rid of it completely.

> I did not check, but I guess we cannot walk the hugepd thing with a normal
> page table walker, or can we? (how special is a hugepd? can you describe its
> internal layout?)

depends on what you mean by "normal". For instance
walk_page_range_novma() handles hugepd.

>
> So, what you proprose, is something like the following?
>
> [PGD#X] -----------> [PTE#0]
> -----------> [PTE..#1023]
> [PGD#Y] -----------> [PTE#0]
> -----------> [PTE..#1023]
>
> so a 8MB hugepage will be covered by PGD#X and PGD#Y using contiguos PTEs.
>
> The diagram at [1] for 8xx 16K seems a bit misleading to me (or maybe it is
> just me). They say that a Level2 table (aka PTE) covers 4KB chunks regardless
> of the pagesize, but either I read that wrong or..else.

What it means is that when PAGE_SIZE is 16k, the pte_t is a table of 4
long, see
https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/include/asm/pgtable-types.h#L11

> Because on 16K page size, they show that each pte covers for 16KB memory chunk.
> But that would mean 16KB << 10 = 16M each PGD, which is not really that, so what
> is the deal there? Or it is just that we always use 4KB PTEs, and use contiguous
> PTEs for bigger sizes?

In a way yes, we cheat the HW by defining a PTE as a table of 4 u32
values to behave like a cont-PTE.

>
> Now, it seems that power8xx has no-p4d, no-pud and no-pmd, right?
>
> Peter mentioned that we should have something like:
>
> X X
> [PGD] - [P4D] - [PUD] - [PMD] - [PTE]
>
> where the PMD and PTE would be the ones we use for representing the 2Level
> ptage table, and PGD,P4D and PUD would just be dummies.
>
> But, is not the convention to at least have PGD-PTE always, and have anything
> in between as optional? E.g:
>
> X ? ? ? X
> [PGD] - [P4D] - [PUD] - [PMD] - [PTE]
>

That's also my understanding hence the discussion with Peter. The fog is
that there is a mix between PGD and PMD, for instance to populate a PGD
entry you use pmd_populate(). To clear a PGD entry you use pmd_clear,
pgd_clear only exists when you have P4Ds.

> I mean, are page table walkers ready to deal with non-PGD? I thought they were
> not.

Yes that's how it is today but Peter was thinking for the future.

>
> Also, in patch#1, you mentioned:
>
> "At the time being, for 512k pages the flag is kept in the PTE and inserted in
> the PMD entry at TLB miss exception".
>
> Can you point out where to look for that in the code?

https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/kernel/head_8xx.S#L219

and

https://elixir.bootlin.com/linux/v6.9/source/arch/powerpc/kernel/head_8xx.S#L277

>
> Also, what exactly is the "sz" parameter that gets passed down to pmd_populate_size()?
> Is the size of the current mapping we are establishing?
> I see that you only make a distinction when the mapping size is 8MB.
> So the PMD will have _PMD_PAGE_8MB, so it works that all 1024 PTEs below are contiguous
> representing a 4MB chunk?
>
> I will start looking deeper into this series on Monday, but just wanted to have a better
> insight of what is going on.
>
> PD: I think we could make the changelog of the coverletter a bit fatter and cover some
> details in there, e.g: layout of page-tables for different page sizes, layout of hugepd,
> expected layout after the work, etc.
> I think it would help in reviewing this series.

I prefer keeping details in individual patches and keep the cover letter
for a high level summary and only administrative information because
information in cover letter are usually lost on the long term.

>
> Thanks!
>
> [1] https://github.com/linuxppc/wiki/wiki/Huge-pages
>
>