Re: [PATCH v3 0/6] add mTHP support for anonymous shmem

From: Daniel Gomez
Date: Tue Jun 04 2024 - 05:29:44 EST


On Fri, May 31, 2024 at 04:43:32PM +0200, David Hildenbrand wrote:
> Hi Daniel,
>
> > > Let me summarize the takeaway from the bi-weekly MM meeting as I understood
> > > it, that includes Hugh's feedback on per-block tracking vs. mTHP:
> >
> > Thanks David for the summary. Please, find below some follow up questions.
> >
> > I want understand if zeropage scanning overhead is preferred over per-block
> > tracking complexity or if we still need to verify this.
> >
> > >
> > > (1) Per-block tracking
> > >
> > > Per-block tracking is currently considered unwarranted complexity in
> > > shmem.c. We should try to get it done without that. For any test cases that
> > > fail, we should consider if they are actually valid for shmem.
> >
> > I agree it was unwarranted complexity but only if this is just to fix lseek() as
> > we can simply make the test pass by checking if holes are reported as data. That
> > would be the minimum required for lseek() to be compliant with the syscall.
> >
> > How can we use per-block tracking for reclaiming memory and what changes would
> > be needed? Or is per-block really a non-viable option?
>
> The interesting thing is: with mTHP toggles it is opt-in -- like for
> PMD-sized THP with shmem_enabled -- and we don't have to be that concerned
> about this problem right now.

Without respecting the size when allocating large folios, mTHP toggles would
over allocate. My proposal added earlier to this thread is to combine the 2
to avoid that case. Otherwise, shouldn't we try to find a solution for the
reclaiming path?

>
> >
> > Clearly, if per-block is viable option, shmem_fault() bug would required to be
> > fixed first. Any ideas on how to make it reproducible?
> >
> > The alternatives discussed where sub-page refcounting and zeropage scanning.
>
> Yeah, I don't think sub-page refcounting is a feasible (and certainly not
> desired ;) ) option in the folio world.
>
> > The first one is not possible (IIUC) because of a refactor years back that
> > simplified the code and also requires extra complexity. The second option would
> > require additional overhead as we would involve scanning.
>
> We'll likely need something similar (scanning, tracking?) for anonymous
> memory as well. There was a proposal for a THP shrinker some time ago, that
> would solve part of the problem.

It's good to know we have the same problem in different places. I'm more
inclined to keep the information rather than adding an extra overhead. Unless
the complexity is really overwhelming. Considering the concerns here, not sure
how much should we try merging with iomap as per Ritesh's suggestion.

The THP shrinker, could you please confirm if it is this following thread?

https://lore.kernel.org/all/cover.1667454613.git.alexlzhu@xxxxxx/

>
> For example, for shmem you could simply flag folios that failed splitting
> during fallocate() as reclaim candidates and try to reclaim memory later. So
> you don't have to scan arbitrary folios (which might also be desired,
> though).

Thanks for the suggestion. I'll look into this.

>
> >
> > >
> > > To optimize FALLOC_FL_PUNCH_HOLE for the cases where splitting+freeing
> > > is not possible at fallcoate() time, detecting zeropages later and
> > > retrying to split+free might be an option, without per-block tracking.
> >
> > >
> > > (2) mTHP controls
> > >
> > > As a default, we should not be using large folios / mTHP for any shmem, just
> > > like we did with THP via shmem_enabled. This is what this series currently
> > > does, and is aprt of the whole mTHP user-space interface design.
> >
> > That was clear for me too. But what is the reason we want to boot in 'safe
> > mode'? What are the implications of not respecing that?
>
> [...]
>
> >
> > As I understood from the call, mTHP with sysctl knobs is preferred over
> > optimistic falloc/write allocation? But is still unclear to me why the former
> > is preferred.
>
> I think Hugh's point was that this should be an opt-in feature, just like
> PMD-sized THP started out, and still is, an opt-in feature.

I'd be keen to understand the use case for this. Even the current THP controls
we have in tmpfs. I guess these are just scenarios with no swap involved?
Are these use cases the same for both tmpfs and shmem anon mm?

>
> Problematic interaction with khugepaged (that will be fixed) is one thing,
> interaction with memory reclaim (without any kind of memory reclaim
> mechanisms in place) might be another one. Controlling and tuning for
> specific folio sizes might be another one Hugh raised. [summarizing what I
> recall from the discussion, there might be more].
>
> >
> > Is large folios a non-viable option?
>
> I think you mean "allocating large folios without user space control".

Yes.

>
> Because mTHP gives user space full control, to the point where you can
> enable all sizes and obtain the same result.

Agree.

>
> >
> > >
> > > Also, we should properly fallback within the configured sizes, and not jump
> > > "over" configured sizes. Unless there is a good reason.
> > >
> > > (3) khugepaged
> > >
> > > khugepaged needs to handle larger folios properly as well. Until fixed,
> > > using smaller THP sizes as fallback might prohibit collapsing a PMD-sized
> > > THP later. But really, khugepaged needs to be fixed to handle that.
> > >
> > > (4) force/disable
> > >
> > > These settings are rather testing artifacts from the old ages. We should not
> > > add them to the per-size toggles. We might "inherit" it from the global one,
> > > though.
> > >
> > > "within_size" might have value, and especially for consistency, we should
> > > have them per size.
> > >
> > >
> > >
> > > So, this series only tackles anonymous shmem, which is a good starting
> > > point. Ideally, we'd get support for other shmem (especially during fault
> > > time) soon afterwards, because we won't be adding separate toggles for that
> > > from the interface POV, and having inconsistent behavior between kernel
> > > versions would be a bit unfortunate.
> > >
> > >
> > > @Baolin, this series likely does not consider (4) yet. And I suggest we have
> > > to take a lot of the "anonymous thp" terminology out of this series,
> > > especially when it comes to documentation.
> > >
> > > @Daniel, Pankaj, what are your plans regarding that? It would be great if we
> > > could get an understanding on the next steps on !anon shmem.
> >
> > I realize I've raised so many questions, but it's essential for us to grasp the
> > mm concerns and usage scenarios. This understanding will provide clarity on the
> > direction regarding folios for !anon shmem.
>
> If I understood correctly, Hugh had strong feelings against not respecting
> mTHP toggles for shmem. Without per-block tracking, I agree with that.

My understanding was the same. But I have this follow-up question: should
we respect mTHP toggles without considering mapping constraints (size and
index)? Or perhaps we should use within_size where we can fit this intermediate
approach, as long as mTHP granularity is respected?

Daniel

>
> --
> Cheers,
>
> David / dhildenb
>