Re: Ask help about this patch c0cd6f557b90 "mm: page_alloc: fix freelist movement during block conversion"
From: Johannes Weiner
Date: Wed Apr 02 2025 - 15:44:42 EST
Hi Carlos,
On Wed, Apr 02, 2025 at 11:31:58AM +0000, Carlos Song wrote:
> Hi, all
>
> I found a 300ms~600ms IRQ off when writing 1Gb data to storage device at I.MX7d SDB board at Linux-kernel-v6.14.
> From this discussion I find the regression root cause:
> https://lore.kernel.org/linux-mm/CAJuCfpGajtAP8-kw5B5mKmhfyq6Pn67+PJgMjBeozW-qzjQMkw@xxxxxxxxxxxxxx/T/
Thanks for the report!
> 2. After add this patch: c0cd6f557b90 "mm: page_alloc: fix freelist movement during block conversion"
> # tracer: irqsoff
> #
> # irqsoff latency trace v1.1.5 on 6.9.0-rc4-00116-gc0cd6f557b90
> # --------------------------------------------------------------------
> # latency: 93635 us, #13758/13758, CPU#0 | (M:server VP:0, KP:0, SP:0 HP:0 #P:2)
> # -----------------
> # | task: dd-764 (uid:0 nice:0 policy:0 rt_prio:0)
> # -----------------
> # => started at: _raw_spin_lock_irqsave
> # => ended at: _raw_spin_unlock_irqrestore
> #
> #
> # _------=> CPU#
> # / _-----=> irqs-off/BH-disabled
> # | / _----=> need-resched
> # || / _---=> hardirq/softirq
> # ||| / _--=> preempt-depth
> # |||| / _-=> migrate-disable
> # ||||| / delay
> # cmd pid |||||| time | caller
> # \ / |||||| \ | /
> dd-764 0d.... 1us!: _raw_spin_lock_irqsave
> dd-764 0d.... 206us : find_suitable_fallback <-__rmqueue_pcplist
> dd-764 0d.... 209us : find_suitable_fallback <-__rmqueue_pcplist
> dd-764 0d.... 210us : find_suitable_fallback <-__rmqueue_pcplist
> dd-764 0d.... 213us+: steal_suitable_fallback <-__rmqueue_pcplist
> dd-764 0d.... 281us : find_suitable_fallback <-__rmqueue_pcplist
> dd-764 0d.... 282us : find_suitable_fallback <-__rmqueue_pcplist
> dd-764 0d.... 284us : find_suitable_fallback <-__rmqueue_pcplist
> dd-764 0d.... 286us : find_suitable_fallback <-__rmqueue_pcplist
> dd-764 0d.... 288us+: steal_suitable_fallback <-__rmqueue_pcplist
This is the freelists being replenished with a loop over
__rmqueue(). Two things stand out:
1. steal_suitable_fallback() is the expensive part. The patch in
question made this slightly worse because stealability is checked
up-front instead of just stealing optimistically like before. So
the pages in the block are iterated twice. This can explain some of
the issue, but not a 100x increase in lock hold time.
2. We're doing it *a lot*. And this is the likely culprit. Whereas
before the patch, we'd steal whole buddies and their remainders,
afterwards there is a lot more single page stealing when grabbing
the whole block fails. This means __rmqueue_smallest() fails more
often and we end up doing a lot more topdown fallback scans:
> dd-767 0d.... 2043us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2045us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2047us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2049us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2101us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2103us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2181us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2184us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2220us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2222us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2304us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2306us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2365us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2367us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2368us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2370us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2372us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2434us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2436us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2438us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2442us : __mod_zone_page_state <-__rmqueue_pcplist
The __mod_zone_page_state() is the successful allocation after
attempting to steal a few different blocks. If this had succeeded, it
would have replenished the native freelist and we'd see another
__mod_zone_page_state() quickly. Alas it failed:
> dd-767 0d.... 2445us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2446us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2448us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2450us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2490us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2492us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2548us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2550us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2586us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2588us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2652us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2654us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2712us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2714us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2715us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2717us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2719us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2720us+: try_to_claim_block <-__rmqueue_pcplist
> dd-767 0d.... 2778us : find_suitable_fallback <-__rmqueue_pcplist
> dd-767 0d.... 2780us : __mod_zone_page_state <-__rmqueue_pcplist
... and we go through the whole fallback spiel for the next page.
We can definitely do better. rmqueue_bulk() holds the zone->lock the
entire time, which means nobody else can modify the freelists
underneath us. Once block claiming has failed, there is no point in
trying it again for the next page.
In fact, the recent kernel test bot report [1] appears to be related
to this. It points to c2f6ea38fc1b640aa7a2e155cc1c0410ff91afa2 ("mm:
page_alloc: don't steal single pages from biggest buddy"), a patch
that further forces bottom-up freelist scans if block stealing fails.
Attached is a patch that has __rmqueue() remember which fallback level
it had to stoop to in order to succeed; for the next page, it restarts
the search from there.
I cannot reproduce Carlos' setup, but testing with lru-file-mmap-read
from the kernel test bot, it shows a stark difference:
upstream patched
real 0m8.939s 0m5.546s
user 0m2.617s 0m2.528s
sys 0m52.885s 0m30.183s
Trace points confirm that try_to_reclaim_block() is called about two
orders of magnitudes less than before.
[1] https://lore.kernel.org/all/202503271547.fc08b188-lkp@xxxxxxxxx/
---