Re: [PATCH v3 0/3] mm/percpu: Fix possible NOFS/NOIO reclaim recursion

From: Dennis Zhou

Date: Wed Jun 17 2026 - 03:03:28 EST

Hello,

On Fri, Jun 12, 2026 at 10:26:45AM +0800, Kaitao Cheng wrote:
> From: chengkaitao <chengkaitao@xxxxxxxxxx>
>
> Hi all,
>
> After v1 was posted, there were many different opinions, mainly around
> optimizing pcpu_alloc_mutex. This v3 is intended to describe the existing
> problems more clearly and provide a conventional fix approach.
>
> Commit 9a5b183941b5 ("mm, percpu: do not consider sleepable allocations
> atomic") allowed GFP_NOFS and GFP_NOIO percpu allocations to use
> pcpu_alloc_mutex and the chunk creation slow path. This restored the
> allocation capability that was lost when those constrained allocations
> were treated as atomic, but it also makes the percpu slow path visible
> to callers from constrained reclaim contexts.
>
> There are two related problems.
>
> First, the create and populate slow paths do not fully preserve the
> caller's allocation constraints. pcpu_alloc_noprof() derives pcpu_gfp from
> the caller supplied GFP mask and passes it down to the percpu backing page
> allocator. However, chunk creation calls pcpu_get_vm_areas(), and chunk
> population can allocate temporary metadata or vmalloc page tables while
> mapping backing pages. Those internal allocations can still use GFP_KERNEL,
> so a caller using GFP_NOFS or GFP_NOIO can enter unconstrained FS or IO
> reclaim while holding pcpu_alloc_mutex.
>
> One possible case is blk-cgroup after commit 5d726c4dbeed
> ("blk-cgroup: fix possible deadlock while configuring policy").
> blkg_conf_prep() now serializes against blkcg_deactivate_policy() with
> q->blkcg_mutex, and blkg_alloc() uses GFP_NOIO because queue freeze and IO
> reclaim dependencies can otherwise deadlock. If the percpu slow path loses
> that GFP_NOIO context, direct reclaim or writeback can issue IO to a frozen
> queue while q->blkcg_mutex is held.
>
> Second, allowing sleepable GFP_NOFS/GFP_NOIO allocations to take
> pcpu_alloc_mutex means that unconstrained backing allocations made under
> the mutex can create an FS/IO reclaim dependency against a constrained
> caller which already holds an FS or IO lock and then waits for
> pcpu_alloc_mutex.
>
> This series fixes those issues in three steps:
>
> - pass the caller supplied GFP mask into pcpu_get_vm_areas() and use it
> for vmalloc metadata and KASAN shadow allocations;
> - pass the GFP mask through the chunk population path, including the
> temporary pages array and vmalloc page table allocation scope;
> - restrict percpu backing allocations performed while holding
> pcpu_alloc_mutex to GFP_NOIO, so they cannot recurse into IO or FS
> reclaim.
>
> This keeps sleepable GFP_NOFS/GFP_NOIO percpu allocations working, while
> avoiding the reclaim recursion risks introduced by making those allocations
> eligible for the mutex-protected slow path.
>
> Changes in v3:
> Allow @gfp to pass __GFP_NOFAIL through. (Andrew Morton)
>
> Changes in v2:
> - split the previous first patch into vmalloc-area creation and chunk
> population changes; (Pedro Falcato)
> - pass the GFP mask explicitly to pcpu_get_vm_areas(); (Pedro Falcato)
> - apply the corresponding memalloc scope around vmalloc page table
> allocation during chunk population;
> - replace the reclaim recursion avoidance with a GFP_NOIO backing
> allocation mask instead of only rejecting nested reclaim.
> (Michal Hocko)
>
> Link to v2:
> https://lore.kernel.org/all/20260604113101.89510-1-kaitao.cheng@xxxxxxxxx/
>
> Link to v1:
> https://lore.kernel.org/all/20260528132917.81123-1-kaitao.cheng@xxxxxxxxx/
>
> Kaitao Cheng (3):
> mm/vmalloc: honor GFP constraints in pcpu_get_vm_areas()
> mm/percpu: honor GFP constraints when populating chunks
> mm/percpu: Avoid IO/FS reclaim in backing allocations
>
> include/linux/vmalloc.h | 4 ++--
> mm/percpu-vm.c | 40 +++++++++++++++++++++++++++-------------
> mm/percpu.c | 18 ++++++++++++------
> mm/vmalloc.c | 23 ++++++++++++-----------
> 4 files changed, 53 insertions(+), 32 deletions(-)
>
> --
> 2.43.0
>

Thanks for taking on this work. I definitely missed this earlier.

I acked patches 1 and 2. I think 3 is good but the __GFP_NOFAIL warrants
more discussion. I think my take back then was a single percpu
allocation can trigger a large # of backing pages. As a result, while
the caller may not be asking for a lot of memory, we may need
substantially more to back that allocation. Given the discrepancy,
that's why __GFP_NOFAIL is just mutex_lock() vs mutex_lock_killable().

Thanks,
Dennis