Re: [PATCH v5 0/9] mm: switch THP shrinker to list_lru
From: Lance Yang
Date: Wed Jun 03 2026 - 08:00:18 EST
On 2026/6/3 19:41, Johannes Weiner wrote:
On Wed, Jun 03, 2026 at 12:44:26PM +0800, Lance Yang wrote:
On Tue, Jun 02, 2026 at 05:46:02PM -0400, Johannes Weiner wrote:
On Mon, Jun 01, 2026 at 04:36:52PM +0800, Lance Yang wrote:
As the changelog above says, the old queue is per-memcg only, rather
than per-memcg-per-node. So reclaim on one node can still walk the whole
memcg queue and split underused THPs from other nodes in the same memcg.
But I think the new one can lose reclaim in the cgroup.memory=nokmem
case ...
With nokmem, the deferred shrinker can still run from memcg reclaim,
because it is SHRINKER_NONSLAB. But the list_lru is no longer per-memcg:
__list_lru_init() clears memcg_aware,
if (mem_cgroup_kmem_disabled())
memcg_aware = false;
so list_lru_from_memcg_idx() falls back to the shared node list:
static inline struct list_lru_one *
list_lru_from_memcg_idx(struct list_lru *lru, int nid, int idx)
{
if (list_lru_memcg_aware(lru) && idx >= 0) {
[...]
}
return &lru->node[nid].lru;
}
That makes the shrinker bit unreliable. __list_lru_add() still sets the
bit on the memcg passed in, but only when the list goes from empty to
non-empty:
bool __list_lru_add(struct list_lru *lru, struct list_lru_one *l,
struct list_head *item, int nid,
struct mem_cgroup *memcg)
{
if (list_empty(item)) {
[...]
if (!l->nr_items++)
set_shrinker_bit(memcg, nid, lru_shrinker_id(lru));
[...]
return true;
}
return false;
}
If memcg A adds the first folio, A gets the bit. If memcg B later adds a
folio to the same shared list, B does not get a bit, because the list
was already non-empty.
So in the A-first/B-later case, reclaim from B may not call the deferred
shrinker at all. The shared list is scanned from memcg reclaim only if
reclaim runs from the memcg that has the bit, such as A here, or from
global reclaim :)
Anyway, only after the shared list is emptied does the next memcg to add
a folio get to be the one with the bit, IIUC :)
Sorry for the delay, this took me a bit to think about. The shrinker
code is a mess.
I read it the same way you do. And this is true for all list_lru users
when nokmem is set: we just set random nonsense shrinker bits.
HOWEVER, the generic shrinker code fixes that up by IGNORING random
shrinker bits like this when !memcg_kmem_online(). And shrinking
correctly happens only against the shared root queue when the reclaim
iterator walks root_mem_cgroup.
HOWEVER, the THP shrinker explicitly sets SHRINKER_NONSLAB, which in
turn overrides the previous override. So yes there is a weirdness: we
get the root cgroup invocation against the shared queue, and then one
more time triggered by that random memcg bit.
The most direct fix is to just drop SHRINKER_NONSLAB. It declares
independence from kmem, which is no longer true.
Cleaning up the shrinker code is left for another day.
Thanks for working on this!
Wondering if this fix trades one problem for another, though ...
Before this series, the deferred split shrinker had a real per-memcg
queue. Even with cgroup.memory=nokmem, memcg reclaim could still scan
that memcg's own deferred_split_queue:
memcg reclaim -> deferred split shrinker -> sc->memcg->deferred_split_queue
With the fix, nokmem + w/o SHRINKER_NONSLAB falls back to a
non-memcg-aware shrinker:
memcg reclaim -> skip deferred split shrinker
root/global reclaim -> deferred split shrinker -> shared list_lru
Is that expected? There woud be no memcg-driven deferred split reclaim
under nokmem, IIUC ...
Yes, this is all correct. list_lru is still inherently tied to the
kmem component of memcg (memcg_kmem_id()).
So without kmem, no isolation. But without kmem, no isolation *for a
lot of stuff*. It's a legacy knob when slab accounting was new and
expensive. But so many things depend on it now, disabling it just
punches a nassive hole into memcg functionality and isolation
coverage. It's not a sanctioned production use flag.
This change is negligible from a memcg semantics POV.
Thanks for clarifying!
No strong objection from me. Just wanted to call out the nokmem
behavior change and hear what folks think :D
Not sure what the right fix is, as I am not a memcg expert ...