Re: [GIT PULL] Memory folios for v5.15
From: Johannes Weiner
Date: Thu Sep 09 2021 - 14:14:51 EST
On Thu, Sep 09, 2021 at 03:56:54PM +0200, Vlastimil Babka wrote:
> On 9/9/21 14:43, Christoph Hellwig wrote:
> > So what is the result here? Not having folios (with that or another
> > name) is really going to set back making progress on sane support for
> > huge pages. Both in the pagecache but also for other places like direct
> > I/O.
>From my end, I have no objections to using the current shape of
Willy's data structure as a cache descriptor for the filesystem API:
struct foo {
/* private: don't document the anon union */
union {
struct {
/* public: */
unsigned long flags;
struct list_head lru;
struct address_space *mapping;
pgoff_t index;
void *private;
atomic_t _mapcount;
atomic_t _refcount;
#ifdef CONFIG_MEMCG
unsigned long memcg_data;
#endif
/* private: the union with struct page is transitional */
};
struct page page;
};
};
I also have no general objection to a *separate* folio or pageset or
whatever data structure to address the compound page mess inside VM
code. With its own cost/benefit analysis. For whatever is left after
the filesystems have been sorted out.
My objection is simply to one shared abstraction for both. There is
ample evidence from years of hands-on production experience that
compound pages aren't the way toward scalable and maintainable larger
page sizes from the MM side. And it's anything but obvious or
self-evident that just because struct page worked for both roles that
the same is true for compound pages.
Willy says it'll work out, I say it won't. We don't have code to prove
this either way right now.
Why expose the filesystems to this gamble?
Nothing prevents us from putting a 'struct pageset pageset' or 'struct
folio folio' into a cache descriptor like above later on, right?
[ And IMO, the fact that filesystem people are currently exposed to,
and blocked on, mindnumbing internal MM discussions just further
strengthens the argument to disconnect the page cache frontend from
the memory allocation backend. The fs folks don't care - and really
shouldn't care - about any of this. I understand the frustration. ]
Can we go ahead with the cache descriptor for now, and keep the door
open on how they are backed from the MM side? We should be able to
answer this without going too deep into MM internals.
In the short term, this would unblock the fs people.
In the longer term this would allow the fs people to focus on fs
problems, and MM people to solve MM problems.
> Yeah, the silence doesn't seem actionable. If naming is the issue, I believe
> Matthew had also a branch where it was renamed to pageset. If it's the
> unclear future evolution wrt supporting subpages of large pages, should we
> just do nothing until somebody turns that hypothetical future into code and
> we see whether it works or not?
Folio or pageset works for compound pages, but implies unnecessary
implementation details for a variable-sized cache descriptor, IMO.
I don't love the name folio for compound pages, but I think it's
actually hazardous for the filesystem API.
To move forward with the filesystem bits, can we:
1. call it something - anything - that isn't tied to the page, or the
nature of multiple pages? fsmem, fsblock, cachemem, cachent, I
don't care too deeply and would rather have a less snappy name than
a clever misleading one,
2. make things like folio_order(), folio_nr_pages(), folio_page()
page_folio() private API in mm/internal.h, to acknowledge that
these are current implementation details, not promises on how the
cache entry will forever be backed in the future?
3. remove references to physical contiguity, PAGE_SIZE, anonymous
pages - and really anything else that nobody has explicitly asked
for yet - from the kerneldoc; generally keep things specced to what
we need now, and not create dependencies against speculative future
ambitions that may or may not pan out,
4. separate and/or table the bits that are purely about compound pages
inside MM code and not relevant for the fs interface - things like
the workingset.c and swap.c conversions (page_folio() usage seems
like a good indicator for where it permeated too deeply into MM
core code which then needs to translate back up again)?