Re: [PATCH v3 7/7] mm: switch deferred split shrinker to list_lru

From: Lorenzo Stoakes (Oracle)

Date: Wed Apr 01 2026 - 13:36:13 EST

On Mon, Mar 30, 2026 at 12:40:22PM -0400, Johannes Weiner wrote:
> Hey Lorenzo,
>
> Thanks for taking an in-depth look at this!
>
> Apologies for the late reply, I was traveling last week.
>
> On Tue, Mar 24, 2026 at 01:48:06PM +0000, Lorenzo Stoakes (Oracle) wrote:
> > On Wed, Mar 18, 2026 at 03:53:25PM -0400, Johannes Weiner wrote:
> > > The deferred split queue handles cgroups in a suboptimal fashion. The
> > > queue is per-NUMA node or per-cgroup, not the intersection. That means
> > > on a cgrouped system, a node-restricted allocation entering reclaim
> > > can end up splitting large pages on other nodes:
> > >
> > > alloc/unmap
> > > deferred_split_folio()
> > > list_add_tail(memcg->split_queue)
> > > set_shrinker_bit(memcg, node, deferred_shrinker_id)
> >
> > So here it's:
> >
> > __do_huge_pmd_anonymous_page() / do_huge_zero_wp_pmd()
> > -> map_anon_folio_pmd_pf()
> > -> map_anon_folio_pmd_nopf()
> > -> deferred_split_folio()
> > -> set_shrinker_bit()
> >
> > Yeah it makes sense to make the first bit succinct anyway :)
>
> Yep I was trying to keep the focus on those parts that interact with
> the other stacks listed below. Not biblically accurate
> deferred_split_folio() callsites ;)
>
> > > for_each_zone_zonelist_nodemask(restricted_nodes)
> > > mem_cgroup_iter()
> > > shrink_slab(node, memcg)
> > > shrink_slab_memcg(node, memcg)
> > > if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> >
> > Hmm there's no such function is this kind of pseudocode, essentially?
> >
> > Wouldn't it be clearer to reference:
> >
> > shrink_slab_memcg() -> do_shrink_slab() -> shrinker->scan_objects ->
> > deferred_split_scan()?
> >
> > Though I get that it adds verbosity :)
> >
> > Sorry not (just) being pedantic here, also so I can understand myself :)
>
> Yes, it's pseudocode. Direct reclaim (shrink_zones) and kswapd
> (wake_all_kswapds) do the "for all nodes" loops. Nested within them
> are the loops that iterate the cgroup tree (shrink_node_memcgs,
> shrink_many, depending on LRU vs MGLRU). And then nested within
> *those* we have the shrink_slab() calls.
>
> Again, I tried to keep the boiler plate low and to the point. The most
> important thing is: 1) the shrinker bit gets set for each node, memcg
> pair that has objects, 2) shrinker visits all nodes x memcgs with it
> set 3) but the shrinker scans happen on per-memcg queue, not node-aware.
>
> Let me know if that works as is, or if you would like me to rephrase
> that. I would like to keep it as simple as possible, but no simpler
> than that ;)

Yeah that's fine, maybe just add a little note to say 'simplified for clarity'
or something so very-literal nutters like me don't get confused ;)

>
> > > deferred_split_scan()
> > > walks memcg->split_queue
> >
> > Ok so overall there's a per-memcg memcg->split_queue, but we set a bit in
> > memcg->nodeinfo[nid]->shrinker_info->unit[blah]->map for it, and when we
> > enter shrink_slab_memcg(), we figure out the shrink_control from the
> > for_each_set_bit() across memcg->...->unit->map?
> >
> > > The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> > > has a single large page on the node of interest, all large pages owned
> > > by that memcg, including those on other nodes, will be split.
> >
> > So at this point, regardless of node, we are invoking deferred_split_scan()
> > and it's the same memcg->split_queue we are walking, regardless of node?
> >
> > Do let me know if my understanding is correct here!
> >
> > Hmm that does sound sub-optimal.
>
> That is exactly what's happening. We have all this dance to be precise
> about the shrinker bit, but the queue itself is not node-aware.
>
> That makes for some pretty erratic behavior.

Yeah that sucks!

>
> > > list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> > > streamlines a lot of the list operations and reclaim walks. It's used
> > > widely by other major shrinkers already. Convert the deferred split
> > > queue as well.
> >
> > It's odd that it wasn't used before?
>
> The list_lru API was initially explicitly for *slab objects*, doing
> virt_to_page() and mem_cgroup_from_slab_obj() on the passed in list
> heads to derive the proper queue. It was only extended more recently
> in 0a97c01cd20b ("list_lru: allow explicit memcg and NUMA node
> selection") to allow passing in the queue routing information and
> embedding the list heads in other things (such as folios).
>
> The only missing piece now was the support for callside locking of the
> list_lru structures.

Ack

>
> > > include/linux/huge_mm.h | 6 +-
> > > include/linux/memcontrol.h | 4 -
> > > include/linux/mmzone.h | 12 --
> > > mm/huge_memory.c | 342 ++++++++++++-------------------------
> > > mm/internal.h | 2 +-
> > > mm/khugepaged.c | 7 +
> > > mm/memcontrol.c | 12 +-
> > > mm/memory.c | 52 +++---
> > > mm/mm_init.c | 15 --
> > > 9 files changed, 151 insertions(+), 301 deletions(-)
> > >
> > > diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> > > index bd7f0e1d8094..8d801ed378db 100644
> > > --- a/include/linux/huge_mm.h
> > > +++ b/include/linux/huge_mm.h
> > > @@ -414,10 +414,9 @@ static inline int split_huge_page(struct page *page)
> > > {
> > > return split_huge_page_to_list_to_order(page, NULL, 0);
> > > }
> > > +
> > > +extern struct list_lru deferred_split_lru;
> >
> > It might be nice for the sake of avoiding a global to instead expose this
> > as a getter?
> >
> > Or actually better, since every caller outside of huge_memory.c that
> > references this uses folio_memcg_list_lru_alloc(), do something like:
> >
> > int folio_memcg_alloc_deferred(struct folio *folio, gfp_t gfp);
> >
> > in mm/huge_memory.c:
> >
> > /**
> > * blah blah blah put on error blah
> > */
> > int folio_memcg_alloc_deferred(struct folio *folio, gfp_t gfp)
> > {
> > int err;
> >
> > err = folio_memcg_list_lru_alloc(folio, &deferred_split_lru, gfP);
> > if (err) {
> > folio_put(folio);
> > return err;
> > }
> >
> > return 0;
> > }
> >
> > And then the callers can just invoke this, and you can make
> > deferred_split_lru static in mm/huge_memory.c?
>
> That sounds reasonable. Let me make this change.

Thanks!

>
> > > void deferred_split_folio(struct folio *folio, bool partially_mapped);
> > > -#ifdef CONFIG_MEMCG
> > > -void reparent_deferred_split_queue(struct mem_cgroup *memcg);
> > > -#endif
> > >
> > > void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> > > unsigned long address, bool freeze);
> > > @@ -650,7 +649,6 @@ static inline int try_folio_split_to_order(struct folio *folio,
> > > }
> > >
> > > static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> > > -static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
> > > #define split_huge_pmd(__vma, __pmd, __address) \
> > > do { } while (0)
> > >
> > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > > index 086158969529..0782c72a1997 100644
> > > --- a/include/linux/memcontrol.h
> > > +++ b/include/linux/memcontrol.h
> > > @@ -277,10 +277,6 @@ struct mem_cgroup {
> > > struct memcg_cgwb_frn cgwb_frn[MEMCG_CGWB_FRN_CNT];
> > > #endif
> > >
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > - struct deferred_split deferred_split_queue;
> > > -#endif
> > > -
> > > #ifdef CONFIG_LRU_GEN_WALKS_MMU
> > > /* per-memcg mm_struct list */
> > > struct lru_gen_mm_list mm_list;
> > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > > index 7bd0134c241c..232b7a71fd69 100644
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -1429,14 +1429,6 @@ struct zonelist {
> > > */
> > > extern struct page *mem_map;
> > >
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > -struct deferred_split {
> > > - spinlock_t split_queue_lock;
> > > - struct list_head split_queue;
> > > - unsigned long split_queue_len;
> > > -};
> > > -#endif
> > > -
> > > #ifdef CONFIG_MEMORY_FAILURE
> > > /*
> > > * Per NUMA node memory failure handling statistics.
> > > @@ -1562,10 +1554,6 @@ typedef struct pglist_data {
> > > unsigned long first_deferred_pfn;
> > > #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
> > >
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > - struct deferred_split deferred_split_queue;
> > > -#endif
> > > -
> > > #ifdef CONFIG_NUMA_BALANCING
> > > /* start time in ms of current promote rate limit period */
> > > unsigned int nbp_rl_start;
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index 3fc02913b63e..e90d08db219d 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -14,6 +14,7 @@
> > > #include <linux/mmu_notifier.h>
> > > #include <linux/rmap.h>
> > > #include <linux/swap.h>
> > > +#include <linux/list_lru.h>
> > > #include <linux/shrinker.h>
> > > #include <linux/mm_inline.h>
> > > #include <linux/swapops.h>
> > > @@ -67,6 +68,8 @@ unsigned long transparent_hugepage_flags __read_mostly =
> > > (1<<TRANSPARENT_HUGEPAGE_DEFRAG_KHUGEPAGED_FLAG)|
> > > (1<<TRANSPARENT_HUGEPAGE_USE_ZERO_PAGE_FLAG);
> > >
> > > +static struct lock_class_key deferred_split_key;
> > > +struct list_lru deferred_split_lru;
> > > static struct shrinker *deferred_split_shrinker;
> > > static unsigned long deferred_split_count(struct shrinker *shrink,
> > > struct shrink_control *sc);
> > > @@ -919,6 +922,13 @@ static int __init thp_shrinker_init(void)
> > > if (!deferred_split_shrinker)
> > > return -ENOMEM;
> > >
> > > + if (list_lru_init_memcg_key(&deferred_split_lru,
> > > + deferred_split_shrinker,
> > > + &deferred_split_key)) {
> > > + shrinker_free(deferred_split_shrinker);
> > > + return -ENOMEM;
> > > + }
> > > +
> >
> > It's kind of out of scope for the series, but I hate that the huge zero
> > folio stuff has an early exit for persistent but if not, falls through to
> > trying to set that up - it'd be nice to split out the huge zero folio stuff
> > into another function.
> >
> > But probably one for a follow up!
>
> It doesn't bother me personally, but I wouldn't object to it being
> moved to its own function and have the caller do necessary cleanups.
>
> But yeah sounds like an unrelated refactor :)

Yeah it is entirely unrelated for sure, can be an optional follow up!

>
> > > deferred_split_shrinker->count_objects = deferred_split_count;
> > > deferred_split_shrinker->scan_objects = deferred_split_scan;
> > > shrinker_register(deferred_split_shrinker);
> > > @@ -939,6 +949,7 @@ static int __init thp_shrinker_init(void)
> > >
> > > huge_zero_folio_shrinker = shrinker_alloc(0, "thp-zero");
> > > if (!huge_zero_folio_shrinker) {
> > > + list_lru_destroy(&deferred_split_lru);
> > > shrinker_free(deferred_split_shrinker);
> >
> > Presumably no probably-impossible-in-reality race on somebody entering the
> > shrinker and referencing the deferred_split_lru before the shrinker is freed?
>
> Ah right, I think for clarity it would indeed be better to destroy the
> shrinker, then the queue. Let me re-order this one.
>
> But yes, in practice, none of the above fails. If we have trouble
> doing a couple of small kmallocs during a subsys_initcall(), that
> machine is unlikely to finish booting, let alone allocate enough
> memory to enter the THP shrinker.

Yeah I thought that might be the case, but seems more logical killing shrinker
first, thanks!

>
> > > @@ -953,6 +964,7 @@ static int __init thp_shrinker_init(void)
> > > static void __init thp_shrinker_exit(void)
> > > {
> > > shrinker_free(huge_zero_folio_shrinker);
> > > + list_lru_destroy(&deferred_split_lru);
> >
> > Same question above as to race/ordering.
>
> ... and this one as well.
>
> > > shrinker_free(deferred_split_shrinker);
> > > }
>
> > > @@ -3854,34 +3761,34 @@ static int __folio_freeze_and_split_unmapped(struct folio *folio, unsigned int n
> > > struct folio *end_folio = folio_next(folio);
> > > struct folio *new_folio, *next;
> > > int old_order = folio_order(folio);
> > > + struct list_lru_one *l;
> >
> > Nit, and maybe this is a convention, but hate single letter variable names,
> > 'lru' or something might be nicer?
>
> Yeah I stuck with the list_lru internal naming, which uses `lru` for
> the struct list_lru, and `l` for struct list_lru_one. I suppose that
> was fine for the very domain-specific code and short functions in
> there, but it's grating in large, general MM functions like these.
>
> Since `lru` is taken, any preferences? llo?

ljs? ;)

Could be list?

>
> > > + bool dequeue_deferred;
> > > int ret = 0;
> > > - struct deferred_split *ds_queue;
> > >
> > > VM_WARN_ON_ONCE(!mapping && end);
> > > /* Prevent deferred_split_scan() touching ->_refcount */
> > > - ds_queue = folio_split_queue_lock(folio);
> > > + dequeue_deferred = folio_test_anon(folio) && old_order > 1;
> >
> > Why anon? (This review is partly me learning about the shrinker, an area
> > I'm weak on :)
>
> This is the type of folios that can be queued through
> deferred_split_folio(). The order check is inside that function
> itself; and the function is only called for anon pages.
>
> File pages are split differently, they don't use the deferred split
> shrinker and we don't allocate list_lru heads for them. However, they
> still come through this split path here, and so I need to gate whether
> it's safe to do list_lru_lock(), __list_lru_del() etc.
>
> > > + if (dequeue_deferred) {
> > > + rcu_read_lock();
> > > + l = list_lru_lock(&deferred_split_lru,
> > > + folio_nid(folio), folio_memcg(folio));
> >
> > Hm don't adore this sort of almost 'hidden' RCU lock here, but this
> > function is pretty disgusting and needs serious work in general.
> >
> > And any function that took the RCU lock and list_lru lock/did the unlock
> > equivalent would be equally horrible so yeah, I guess needs deferring to a
> > refactor.
> >
> > OTOH, this could be a good excuse for us to pay down some technical debt
> > and split out for instance the folio_ref_freeze() bits?
> >
> > Could we do something like:
> >
> > bool frozen;
> >
> > ...
> >
> > dequeue_deferred = folio_test_anon(folio) && old_order > 1;
> > frozen = folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1);
> >
> > if (dequeue_deferred && frozen) {
> > struct list_lru_one *lru;
> >
> > rcu_read_lock();
> > lru = list_lru_lock(&deferred_split_lru,
> > folio_nid(folio), folio_memcg(folio));
> > __list_lru_del(&deferred_split_lru, lru,
> > &folio->_deferred_list, folio_nid(folio));
> > if (folio_test_partially_mapped(folio)) {
> > folio_clear_partially_mapped(folio);
> > mod_mthp_stat(old_order,
> > MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> > }
> > list_lru_unlock(lru);
> > rcu_read_unlock();
> > }
> >
> > if (!frozen)
> > return -EAGAIN;
> >
> > ... rest of logic now one indent level lower ...
> >
> > Or maybe factor that out into a helper function or something?
> >
> > static void execute_deferred_dequeue(...) { ... }
> >
> > With this implemented either way you'd be able to get rid of the else block
> > too.
> >
> > obviously only valid if you are able to do the freezing earlier?
>
> The reason it locks the list_lru first is because there can be a race
> between the shrinker and a synchronous split attempt.

Ahh!

>
> This is what that comment above it is about:
>
> /* Prevent deferred_split_scan() touching ->_refcount */

I _may_ have glossed over this :)

>
> If the shrinker fails to "get" the folio while holding the shrinker
> lock, it thinks it beat the free path to the list_lru lock and will
> feel responsible for cleaning up the deferred split state:
>
> /* caller holds the list_lru lock already */
> static enum lru_status deferred_split_isolate(struct list_head *item,
> struct list_lru_one *lru,
> void *cb_arg)
> {
> struct folio *folio = container_of(item, struct folio, _deferred_list);
> struct list_head *freeable = cb_arg;
>
> if (folio_try_get(folio)) {
> list_lru_isolate_move(lru, item, freeable);
> return LRU_REMOVED;
> }
>
> /* We lost race with folio_put() */
> list_lru_isolate(lru, item);
> if (folio_test_partially_mapped(folio)) {
> folio_clear_partially_mapped(folio);
> mod_mthp_stat(folio_order(folio),
> MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> }
> return LRU_REMOVED;
> }
>
> So if the split path freezes before acquiring the list_lru lock, we
> get them both stepping on the deferred split state.

So we're sort of intentionally doing a lock inversion on the other path (give me
some rope here, I mean broadly speaking :) to avoid this?

>
> This may or may not be safe given the current manifestation of
> list_lru_del() and the folio_test_partially_mapped() tests. But it
> feels pretty hairy to let them race and rely on individual tests into
> what should be a coherent aggregate state.

Yeah sure best not.

But, and I _know_ it's nitty sorry, but maybe worth expanding that comment to
explain that e.g. 'we must take the folio look prior to the list_lru lock to
avoid racing with deferred_split_scan() in accessing the folio reference count'
or similar?

>
> > > + }
> > > if (folio_ref_freeze(folio, folio_cache_ref_count(folio) + 1)) {
> > > struct swap_cluster_info *ci = NULL;
> > > struct lruvec *lruvec;
> > >
> > > - if (old_order > 1) {
> >
> > Before was this also applicable to non-anon folios?
>
> This branch, yes. But non-anon pages would then fail this:
>
> > > - if (!list_empty(&folio->_deferred_list)) {
> > > - ds_queue->split_queue_len--;
> > > - /*
> > > - * Reinitialize page_deferred_list after removing the
> > > - * page from the split_queue, otherwise a subsequent
> > > - * split will see list corruption when checking the
> > > - * page_deferred_list.
> > > - */
> > > - list_del_init(&folio->_deferred_list);
> > > - }
>
> ..since they aren't ever added. This was okay because accessing that
> list head is always safe. Now I need to be explicit to determine if
> it's safe to call __list_lru_del() which does all these list_lru
> references, which aren't allocated for file.
>
> > > + if (dequeue_deferred) {
> > > + __list_lru_del(&deferred_split_lru, l,
> > > + &folio->_deferred_list, folio_nid(folio));
> > > if (folio_test_partially_mapped(folio)) {
> > > folio_clear_partially_mapped(folio);
> > > mod_mthp_stat(old_order,
> > > MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> > > }
>
> And honestly, I think it's nicer to be explicit here about the
> expectations of which pages get which treatment.

Yeah for sure.

>
> Filtering file pages on !list_empty() was not very obvious.

Yes :)

>
> > > + int nid = folio_nid(folio);
> > > unsigned long flags;
> > > bool unqueued = false;
> > >
> > > WARN_ON_ONCE(folio_ref_count(folio));
> > > WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
> > >
> > > - ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
> > > - if (!list_empty(&folio->_deferred_list)) {
> > > - ds_queue->split_queue_len--;
> > > + rcu_read_lock();
> > > + l = list_lru_lock_irqsave(&deferred_split_lru, nid, folio_memcg(folio), &flags);
> > > + if (__list_lru_del(&deferred_split_lru, l, &folio->_deferred_list, nid)) {
> >
> > Maybe worth factoring __list_lru_del() into something that explicitly
> > references &folio->_deferred_list rather than open codingin both places?
>
> Hm, I wouldn't want to encode this into list_lru API, but we could do
> a huge_memory.c-local helper?
>
> folio_deferred_split_del(folio, l, nid)

Well, I kind of hate how we're using the global deferred_split_lru all over the
place, so a helper woudl be preferable but one that also could be used for
khugepaged.c and memory.c also?

>
> > > @@ -4406,7 +4320,11 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
> > > if (folio_test_swapcache(folio))
> > > return;
> > >
> > > - ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
> > > + nid = folio_nid(folio);
> > > +
> > > + rcu_read_lock();
> > > + memcg = folio_memcg(folio);
> > > + l = list_lru_lock_irqsave(&deferred_split_lru, nid, memcg, &flags);
> >
> > Do we really need to hold the lock over all the below, but hmm we can't do
> > an irqsave/restore list_lru_add(), and maybe not worth adding one either,
> > OK.
> >
> > Just seems odd to have <lock> <__unlocked_add_variant> <unlock>,
> > instinctively feels like it should be just <locked_add_variant>.
>
> It protects that auxiliary folio_test_partially_mapped() state that
> the shrinker also touches. LRU + that page flag + the counts is what
> that lock protects.

Ack thanks

>
> > > @@ -4473,45 +4375,47 @@ static bool thp_underused(struct folio *folio)
> > > return false;
> > > }
> > >
> > > +static enum lru_status deferred_split_isolate(struct list_head *item,
> > > + struct list_lru_one *lru,
> > > + void *cb_arg)
> > > +{
> > > + struct folio *folio = container_of(item, struct folio, _deferred_list);
> > > + struct list_head *freeable = cb_arg;
> > > +
> > > + if (folio_try_get(folio)) {
> > > + list_lru_isolate_move(lru, item, freeable);
> > > + return LRU_REMOVED;
> > > + }
> > > +
> > > + /* We lost race with folio_put() */
> >
> > Hmm, in the original code, this comment is associated with the partially
> > mapped logic, BUT this seems actually correct, because folio_try_get()
> > because it does folio_ref_add_unless_zero() only fails if the folio lost
> > the race.
> >
> > So I think you're more correct right?
>
> I think the original placement was because that else if made it
> awkward to place where it should be. But yes, the above is more
> correct: the comment refers to what happens when the "get" fails.

Right yeah

>
> > > - ds_queue = split_queue_lock_irqsave(sc->nid, sc->memcg, &flags);
> > > - /* Take pin on all head pages to avoid freeing them under us */
> > > - list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
> > > - _deferred_list) {
> > > - if (folio_try_get(folio)) {
> > > - folio_batch_add(&fbatch, folio);
> > > - } else if (folio_test_partially_mapped(folio)) {
> > > - /* We lost race with folio_put() */
> > > - folio_clear_partially_mapped(folio);
> > > - mod_mthp_stat(folio_order(folio),
> > > - MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
> > > - }
>
> > > @@ -4534,64 +4438,32 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> > > }
> > > folio_unlock(folio);
> > > next:
> > > - if (did_split || !folio_test_partially_mapped(folio))
> > > - continue;
> > > /*
> > > * Only add back to the queue if folio is partially mapped.
> > > * If thp_underused returns false, or if split_folio fails
> > > * in the case it was underused, then consider it used and
> > > * don't add it back to split_queue.
> > > */
> > > - fqueue = folio_split_queue_lock_irqsave(folio, &flags);
> > > - if (list_empty(&folio->_deferred_list)) {
> > > - list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
> > > - fqueue->split_queue_len++;
> > > + if (!did_split && folio_test_partially_mapped(folio)) {
> > > + rcu_read_lock();
> > > + l = list_lru_lock_irqsave(&deferred_split_lru,
> > > + folio_nid(folio),
> > > + folio_memcg(folio),
> > > + &flags);
> > > + __list_lru_add(&deferred_split_lru, l,
> > > + &folio->_deferred_list,
> > > + folio_nid(folio), folio_memcg(folio));
> > > + list_lru_unlock_irqrestore(l, &flags);
> >
> > Hmm this does make me think it'd be nice to have a list_lru_add() variant
> > for irqsave/restore then, since it's a repeating pattern!
>
> Yeah, this site calls for it the most :( I tried to balance callsite
> prettiness with the need to extend the list_lru api; it's just one
> caller. And the possible mutations and variants with these locks is
> seemingly endless once you open that can of worms...

True...

>
> Case in point: this is process context and we could use
> spin_lock_irq() here. I'm just using list_lru_lock_irqsave() because
> that's the common variant used by the add and del paths already.
>
> If I went with a helper, I could do list_lru_add_irq().
>
> I think it would actually nicely mirror the list_lru_shrink_walk_irq()
> a few lines up.

Yeah, I mean I'm pretty sure this repeats quite a few times so is worthy of a
helper.

>
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1081,6 +1081,7 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > > }
> > >
> > > count_vm_event(THP_COLLAPSE_ALLOC);
> > > +
> > > if (unlikely(mem_cgroup_charge(folio, mm, gfp))) {
> > > folio_put(folio);
> > > *foliop = NULL;
> > > @@ -1089,6 +1090,12 @@ static enum scan_result alloc_charge_folio(struct folio **foliop, struct mm_stru
> > >
> > > count_memcg_folio_events(folio, THP_COLLAPSE_ALLOC, 1);
> >
> > Do we want to put this stat counter underneath the below so it's not
> > incremented on fail?
>
> We currently increment it also when the cgroup charge fails, which is
> common. The list_lru allocation (a smaller kmalloc) in turn should
> basically never fail - we just managed to get a PMD-sized folio.
>
> Executive summar being: the way we bump them currently is weird. I
> tried to stay out of it to keep this series on track ;)

Ack ok fair enough :)

>
> > > + if (folio_memcg_list_lru_alloc(folio, &deferred_split_lru, gfp)) {
> > > + folio_put(folio);
> > > + *foliop = NULL;
> > > + return SCAN_CGROUP_CHARGE_FAIL;
> >
> > Do we not need to uncharge here?
>
> folio_put() does that. We could do it for symmetry, but there is no
> need. And the hard and fast rule is that folio_memcg() is immutable
> until the put hits 0, so it would look weird/unexpected.

So it does!

void __folio_put(struct folio *folio)
{
...
mem_cgroup_uncharge(folio);
...
}

I didn't _think_ it would but I guess I didn't realise how much work
__folio_put() was doing :)

>
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index a47fb68dd65f..f381cb6bdff1 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -4015,11 +4015,6 @@ static struct mem_cgroup *mem_cgroup_alloc(struct mem_cgroup *parent)
> > > for (i = 0; i < MEMCG_CGWB_FRN_CNT; i++)
> > > memcg->cgwb_frn[i].done =
> > > __WB_COMPLETION_INIT(&memcg_cgwb_frn_waitq);
> > > -#endif
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > - spin_lock_init(&memcg->deferred_split_queue.split_queue_lock);
> > > - INIT_LIST_HEAD(&memcg->deferred_split_queue.split_queue);
> > > - memcg->deferred_split_queue.split_queue_len = 0;
> > > #endif
> > > lru_gen_init_memcg(memcg);
> > > return memcg;
> > > @@ -4167,11 +4162,10 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> > > zswap_memcg_offline_cleanup(memcg);
> > >
> > > memcg_offline_kmem(memcg);
> > > - reparent_deferred_split_queue(memcg);
> > > /*
> > > - * The reparenting of objcg must be after the reparenting of the
> > > - * list_lru and deferred_split_queue above, which ensures that they will
> > > - * not mistakenly get the parent list_lru and deferred_split_queue.
> > > + * The reparenting of objcg must be after the reparenting of
> > > + * the list_lru in memcg_offline_kmem(), which ensures that
> > > + * they will not mistakenly get the parent list_lru.
> > > */
> > > memcg_reparent_objcgs(memcg);
> > > reparent_shrinker_deferred(memcg);
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 219b9bf6cae0..e68ceb4aa624 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -4651,13 +4651,19 @@ static struct folio *alloc_swap_folio(struct vm_fault *vmf)
> > > while (orders) {
> > > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > > folio = vma_alloc_folio(gfp, order, vma, addr);
> > > - if (folio) {
> > > - if (!mem_cgroup_swapin_charge_folio(folio, vma->vm_mm,
> > > - gfp, entry))
> > > - return folio;
> > > + if (!folio)
> > > + goto next;
> > > + if (mem_cgroup_swapin_charge_folio(folio, vma->vm_mm, gfp, entry)) {
> > > count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK_CHARGE);
> > > folio_put(folio);
> > > + goto next;
> > > }
> > > + if (folio_memcg_list_lru_alloc(folio, &deferred_split_lru, gfp)) {
> >
> > Do we need to uncharge here?
>
> Same here. It's folio state that the put cleans up.

Ack, indeed!

>
> > > + folio_put(folio);
> > > + goto fallback;
> > > + }
> > > + return folio;
> > > +next:
> > > count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
> > > order = next_order(&orders, order);
> > > }
> > > @@ -5169,24 +5175,28 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> > > while (orders) {
> > > addr = ALIGN_DOWN(vmf->address, PAGE_SIZE << order);
> > > folio = vma_alloc_folio(gfp, order, vma, addr);
> > > - if (folio) {
> > > - if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> > > - count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> > > - folio_put(folio);
> > > - goto next;
> > > - }
> > > - folio_throttle_swaprate(folio, gfp);
> > > - /*
> > > - * When a folio is not zeroed during allocation
> > > - * (__GFP_ZERO not used) or user folios require special
> > > - * handling, folio_zero_user() is used to make sure
> > > - * that the page corresponding to the faulting address
> > > - * will be hot in the cache after zeroing.
> > > - */
> > > - if (user_alloc_needs_zeroing())
> > > - folio_zero_user(folio, vmf->address);
> > > - return folio;
> > > + if (!folio)
> > > + goto next;
> >
> > This applies to the above equivalent refactorings, but maybe worth
> > separating out this:
> >
> > if (folio) { ... big branch ... } ->
> > if (!folio) goto next; ... what was big branch ...
> >
> > Refactoring into a separate patch? Makes it easier to see pertinent logic
> > changes + helps review/bisectability/fixes/etc. etc.
>
> Will do.

Thanks!

>
> > > + if (mem_cgroup_charge(folio, vma->vm_mm, gfp)) {
> > > + count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK_CHARGE);
> > > + folio_put(folio);
> > > + goto next;
> > > }
> > > + if (folio_memcg_list_lru_alloc(folio, &deferred_split_lru, gfp)) {
> >
> > Again, do we need to uncharge here?
>
> folio_put() does it :)

Yeah, I should have just checked __folio_put() :>)

<Insert quote about good programmers, lazy, hubris etc. here :P>

>
> > > + folio_put(folio);
> > > + goto fallback;
> > > + }
> > > + folio_throttle_swaprate(folio, gfp);
> > > + /*
> > > + * When a folio is not zeroed during allocation
> > > + * (__GFP_ZERO not used) or user folios require special
> > > + * handling, folio_zero_user() is used to make sure
> > > + * that the page corresponding to the faulting address
> > > + * will be hot in the cache after zeroing.
> > > + */
> > > + if (user_alloc_needs_zeroing())
> > > + folio_zero_user(folio, vmf->address);
> > > + return folio;
> > > next:
> > > count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> > > order = next_order(&orders, order);
> > > diff --git a/mm/mm_init.c b/mm/mm_init.c
> > > index cec7bb758bdd..f293a62e652a 100644
> > > --- a/mm/mm_init.c
> > > +++ b/mm/mm_init.c
> > > @@ -1388,19 +1388,6 @@ static void __init calculate_node_totalpages(struct pglist_data *pgdat,
> > > pr_debug("On node %d totalpages: %lu\n", pgdat->node_id, realtotalpages);
> > > }
> > >
> > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > > -static void pgdat_init_split_queue(struct pglist_data *pgdat)
> > > -{
> > > - struct deferred_split *ds_queue = &pgdat->deferred_split_queue;
> > > -
> > > - spin_lock_init(&ds_queue->split_queue_lock);
> > > - INIT_LIST_HEAD(&ds_queue->split_queue);
> > > - ds_queue->split_queue_len = 0;
> > > -}
> > > -#else
> > > -static void pgdat_init_split_queue(struct pglist_data *pgdat) {}
> > > -#endif
> > > -
> > > #ifdef CONFIG_COMPACTION
> > > static void pgdat_init_kcompactd(struct pglist_data *pgdat)
> > > {
> > > @@ -1416,8 +1403,6 @@ static void __meminit pgdat_init_internals(struct pglist_data *pgdat)
> > >
> > > pgdat_resize_init(pgdat);
> > > pgdat_kswapd_lock_init(pgdat);
> > > -
> > > - pgdat_init_split_queue(pgdat);
> > > pgdat_init_kcompactd(pgdat);
> > >
> > > init_waitqueue_head(&pgdat->kswapd_wait);
> > > --
> > > 2.53.0
> > >
> >
> > Overall a lovely amount of code deletion here :)
> >
> > Thanks for doing this, Cheers Lorenzo
>
> Thanks for your feedback!

No worries, thanks for the series :)

Cheers, Lorenzo