Re: [PATCH v1 0/5] mm, kpageflags: support folio and fix output for compound pages

From: Naoya Horiguchi
Date: Wed Oct 18 2023 - 01:25:27 EST


On Mon, Oct 16, 2023 at 12:36:22PM +0100, Ryan Roberts wrote:
> On 16/10/2023 11:13, David Hildenbrand wrote:
> >>>>> It does sound inconsistent. What exactly do you want to tell user space with
> >>>>> the new flag?
> >>>>
> >>>> The current most problematic behavior is to report folio as thp (order-2
> >>>> pagecache page is definitely a folio but not a thp), and this is what the
> >>>> new flag is intended to tell.
> >>>
> >>> We are currently considering calling these sub-PMD sized THPs "small-sized
> >>> THP". [1] Arguably, we're starting with the anon part where we won't get
> >>> around exposing them to the user in sysfs.
> >>>
> >>> So I wouldn't immediately say that these things are not THPs. They are not
> >>> PMD-sized THP. A slab/hugetlb is certainly not a thp but a folio. Whereby
> >>> slabs can also be order-0 folios, but hugetlb can't.
> >>
> >> I think this is a mistake.  Users expect THPs to be PMD sized.  We already
> >> have the term "large folio" in use for file-backed memory; why do we
> >> need to invent a new term for anon large folios?
> >
> > I changed my opinion two times, but I stabilized at "these are just huge pages
> > of different size" when it comes to user-visible features.
> >
> > Handling/calling them folios internally -- especially to abstract the page vs.
> > compound page and how we manage/handle the metadata -- is a reasonable thing to
> > do, because that's what we decided to pass around.
> >
> >
> > For future reference, here is a writeup about my findings and the reason for my
> > opinion:
> >
> >
> > (1) OS-independent concept
> >
> > Ignoring how the OS manages metadata (e.g., "struct page", "struct folio",
> > compound head/tail, memdesc, ...), the common term to describe a "the smallest
> > fixed-length contiguous block of physical memory into which memory pages are
> > mapped by the operating system.["[1] is a page frame -- people usually simplify
> > by dropping the "frame" part, so do I.
> >
> > Larger pages (which we call "huge pages", FreeBSD "superpages", Windows "large
> > pages") can come in different sizes and were traditionally based on architecture
> > support, whereby architectures can support multiple ones [1]; I think what we
> > see is that the OS might use intermediate sizes to manage memory more
> > efficiently, abstracting/evolving that concept from the actual hardware page
> > table mapping granularity.
> >
> > But the foundation is that we are dealing with "blocks of physical memory" in a
> > unit that is larger than the smallest page sizes. Larger pages.
> >
> > [the comment about SGI IRIX on [1] is an interesting read; so are "scattered
> > superpages"[3]]
> >
> > Users learned the difference between a "page" and a "huge page". I'm confident
> > that they can learn the difference between a "traditional huge page" and a
> > "small-sized huge page", just like they did with hugetlb (below).
> >
> > We just have to be careful with memory statistics and to default to the
> > traditional huge pages for now. Slowly, the term "THP" will become more generic.
> > Apart from that, I fail to see the big source of confusion.
> >
> > Note: FreeBSD currently similarly calls these things on arm64 "medium-sized
> > superpages", and did not invent new terms for that so far [2].
> >
> >
> > (2) hugetlb
> >
> > Traditional huge pages started out to be PMD-sized. Before 2008, we only
> > supported a single huge page size. Ever since, we added support for sizes larger
> > (gigantic) and smaller than that (cont-pte / cont-pmd).
> >
> > So (a) users did not panic because we also supported huge pages that were not
> > PMD-sized; (b) we managed to integrate it into the existing environment,
> > defaulting to the old PMD-sized huge pages towards the user but still providing
> > configuration knobs and (c) it is natural today to have multiple huge page sizes
> > supported in hugetlb.
> >
> > Nowadays, when somebody says that they are using hugetlb huge pages, the first
> > question frequently is "which huge page size?". The same will happen with
> > transparent huge pages I believe.
> >
> >
> > (3) THP preparation for multiple sizes
> >
> > With
> >     /sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> > added in 2016, we already provided a way for users to query the PMD size for
> > THP, implying that there might be multiple sizes in the future.
> >
> > Therefore, in commit 49920d28781d, Hugh already envisioned " some transparent
> > support for pud and pgd pages" and ended up calling it "_pmd_size". Turns out,
> > we want smaller THPs first, not larger ones.
> >
> >
> > (4) Metadata management
> >
> > How the OS manages metadata for its memory -- and how it calls the involved
> > datastructures -- is IMHO an implementation detail (an important one regarding
> > performance, robustness and metadata overhead as we learned, though ;) ).
> >
> > We were able to introduce folios without user-visible changes. We should be able
> > to implement memdesc (or memory type hierarchies) without user-visible changes
> > -- except for some interfaces that provide access to bare "struct page"
> > information (classifies as debugging interfaces IMHO).
> >
> >
> > Last but not least, we ended up consistently calling these "larger than a page"
> > things that we map into user space "(transparent) huge page" towards the user in
> > toggles, stats and documentation. Fortunately we didn't use the term "compound
> > page" back then; it would have been a mistake.
> >
> >
> > Regarding the pagecache, we managed to not expose any toggles towards the user,
> > because memory waste can be better controlled. So the term "folio" does not pop
> > up as a toggle in /sys and /proc.
> >
> >     t14s: ~  $ find /sys -name "*folio*" 2> /dev/null
> >     t14s: ~  $ find /proc -name "*folio*" 2> /dev/null
> >
> > Once we want to remove the (sub)page mapcount, we'll likely have to remove
> > _nr_pages_mapped. To make some workloads that are sensitive to memory
> > consumption [4] play along when not accounting only the actually mapped parts,
> > we might have to introduce other ways to control that, when
> > "/sys/kernel/debug/fault_around_bytes" no longer does the trick. I'm hoping we
> > can still find ways to avoid exposing any toggles for that; we'll see.
> >
> >
> > [1] https://en.wikipedia.org/wiki/Page_(computer_memory)
> > [2] https://www.freebsd.org/status/report-2022-04-2022-06/superpages/
> > [3] https://ieeexplore.ieee.org/document/6657040/similar#similar
> > [4] https://www.suse.com/support/kb/doc/?id=000019017
>
> +1 for David's reasoning.
>
> FWIW, the way I see it, everything is a folio; a folio is an implementation
> detail that neatly abstracts a physically contiguous, power-of-2 number of pages
> (including the single page case). So I'm not sure how useful it is to add the
> proposed KPF_FOLIO flag. The only real thing I can imagine user space using it
> for would be to tell if some extent of virtual memory is physically contiguous,
> and you can already do that from the PFN.
>
> Bigger picture interface-wise, I think it is simpler and more understandable to
> the user to extend an existing concept (THP) rather than invent a new one
> (folios) that substantially overlaps with the existing (PMD-sized) THP concept.
>
> That said, if you have plans in the folio roadmap that I'm not aware of, then
> perhaps those would change my mind. There is a thread here [1] where we are
> discussing the best way to expose "small-sized THP" (anon large folios) to user
> space - Metthew if you you stong feelings, please do reply!
>
> [1]
> https://lore.kernel.org/linux-mm/6d89fdc9-ef55-d44e-bf12-fafff318aef8@xxxxxxxxxx/
>
> Thanks,
> Ryan
>
>
> >
> >
> >>
> >>> Looking at other interfaces, we do expose:
> >>>
> >>> include/uapi/linux/kernel-page-flags.h:#define KPF_COMPOUND_HEAD        15
> >>> include/uapi/linux/kernel-page-flags.h:#define KPF_COMPOUND_TAIL        16
> >>>
> >>> So maybe we should just continue talking about compound pages or do we have
> >>> to use both terms here in this interface?
> >>
> >> I don;t know how easy it's going to be to distinguish between a head
> >> and tail page in the Glorious Future once pages and folios are separated.
> >
> > Probably a page-based interface would be the wrong interface for that;
> > fortunately, this interface has a "debugging" smell to it, so we might be able
> > to replace it.

This interface exposes per-pfn (not per-page) data records, specifying pfn by
file offset. It does not care about distinction between head and tail.
So I don't think that we can avoid referring to tail pages even after page-to-folio
conversion is complete.

But I agree that this interface is for debugging or testing. To clarify
this, we might consider relocating this interface to a more suitable location
within debugfs, making it effectively invisible to non-debugging processes.
And maybe this could be the case also for other similar interfaces /proc/kpage*.
So all these files can be handled together to address this problem.

Thanks,
Naoya Horiguchi