Re: [PATCH 3/3] mm: optimize free_area_empty() check using per-migratetype counts

From: Hongru Zhang

Date: Tue Mar 03 2026 - 03:05:45 EST

> On Sat, Nov 29, 2025 at 8:04 AM Barry Song <21cnbao@xxxxxxxxx> wrote:
> >
> > On Fri, Nov 28, 2025 at 11:13 AM Hongru Zhang <zhanghongru06@xxxxxxxxx> wrote:
> > >
> > > From: Hongru Zhang <zhanghongru@xxxxxxxxxx>
> > >
> > > Use per-migratetype counts instead of list_empty() helps reduce a
> > > few cpu instructions.
> > >
> > > Signed-off-by: Hongru Zhang <zhanghongru@xxxxxxxxxx>
> > > ---
> > > mm/internal.h | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 1561fc2ff5b8..7759f8fdf445 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -954,7 +954,7 @@ int find_suitable_fallback(struct free_area *area, unsigned int order,
> > >
> > > static inline bool free_area_empty(struct free_area *area, int migratetype)
> > > {
> > > - return list_empty(&area->free_list[migratetype]);
> > > + return !READ_ONCE(area->mt_nr_free[migratetype]);
> >
> > I'm not quite sure about this. Since the counter is written and read more
> > frequently, cache coherence traffic may actually be higher than for the list
> > head.
> >
> > I'd prefer to drop this unless there is real data showing it performs better.
>
> If the goal is to optimize free_area list checks and list_add,
> a reasonable approach is to organize the data structure
> to reduce false sharing between different mt and order entries.
>
> struct mt_free_area {
> struct list_head free_list;
> unsigned long nr_free;
> } ____cacheline_aligned;
>
> struct free_area {
> struct mt_free_area mt_free_area[MIGRATE_TYPES];
> };
>
> However, without supporting data, it’s unclear if the space increase
> is justified :-)

I designed a test model to trigger more false sharing and collected data under
it to see which layout performs better.

Test model
- Based on the microbench that was removed from mmtests
Commit: beeaeb89 ("pagealloc: Remove bit-rotted benchmark")
- Goal: Generate concurrent kernel page alloc/free activity across multiple
orders and migratetypes to observe cacheline sharing and contention in the
buddy free_area
- Mechanism: A systemtap module exposes a write-only
/proc/mmtests-pagealloc-micro. Writing a 64-bit encoded value triggers
repeated page alloc/free in kernel space
- bits 7:0 -> mt (0=UNMOVABLE, 1=MOVABLE, 2=RECLAIMABLE)
- bits 15:8 -> order
- bits 63:16 -> batch
- Workload distribution:
- order = cpu % 4 (orders 0/1/2/3)
- mt = cpu % 3 (UNMOVABLE/MOVABLE/RECLAIMABLE)
- cpu0 and cpu1 are not used for the test
- Sampling:
- load stap
- determine encoded value according to cpu id and bind it to the cpu
- after a short delay, runs 'perf mem record' for 100s
- unload stap
- Test tool:
- https://gist.github.com/zhr250/72e56f87ac703e833b11b5341d616cb0
- Data analysis tool:
- https://gist.github.com/zhr250/f4a385ffa9fae2993d22748f31e18588

CPU topo info of my machine:
Package L#0
NUMANode L#0 (P#0 15GB)
L3 L#0 (25MB)
L2 L#0 (1280KB) + L1d L#0 (48KB) + L1i L#0 (32KB) + Core L#0
PU L#0 (P#0)
PU L#1 (P#1)
L2 L#1 (1280KB) + L1d L#1 (48KB) + L1i L#1 (32KB) + Core L#1
PU L#2 (P#2)
PU L#3 (P#3)
L2 L#2 (1280KB) + L1d L#2 (48KB) + L1i L#2 (32KB) + Core L#2
PU L#4 (P#4)
PU L#5 (P#5)
L2 L#3 (1280KB) + L1d L#3 (48KB) + L1i L#3 (32KB) + Core L#3
PU L#6 (P#6)
PU L#7 (P#7)
L2 L#4 (1280KB) + L1d L#4 (48KB) + L1i L#4 (32KB) + Core L#4
PU L#8 (P#8)
PU L#9 (P#9)
L2 L#5 (1280KB) + L1d L#5 (48KB) + L1i L#5 (32KB) + Core L#5
PU L#10 (P#10)
PU L#11 (P#11)
L2 L#6 (1280KB) + L1d L#6 (48KB) + L1i L#6 (32KB) + Core L#6
PU L#12 (P#12)
PU L#13 (P#13)
L2 L#7 (1280KB) + L1d L#7 (48KB) + L1i L#7 (32KB) + Core L#7
PU L#14 (P#14)
PU L#15 (P#15)
L2 L#8 (2048KB)
L1d L#8 (32KB) + L1i L#8 (64KB) + Core L#8 + PU L#16 (P#16)
L1d L#9 (32KB) + L1i L#9 (64KB) + Core L#9 + PU L#17 (P#17)
L1d L#10 (32KB) + L1i L#10 (64KB) + Core L#10 + PU L#18 (P#18)
L1d L#11 (32KB) + L1i L#11 (64KB) + Core L#11 + PU L#19 (P#19)

Actual (order, mt) distribution on my machine:
order=0, mt=0: cpu12
order=0, mt=1: cpu4, cpu16
order=0, mt=2: cpu8
order=1, mt=0: cpu9
order=1, mt=1: cpu13
order=1, mt=2: cpu5, cpu17
order=2, mt=0: cpu6, cpu18
order=2, mt=1: cpu10
order=2, mt=2: cpu2, cpu14
order=3, mt=0: cpu3, cpu15
order=3, mt=1: cpu7, cpu19
order=3, mt=2: cpu11

Different migratetype/order combinations are placed on CPUs that do not share
L1/L2 caches to maximize cacheline contention. For our test goal, I think this
distribution is relatively reasonable. I ran 10 rounds for each kernel and
found the data to be stable.

Layouts tested (capturing load/store samples in free_area[0..MAX_PAGE_ORDER]):
- vanilla kernel:
struct free_area
{
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
};
- patched kernel:
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned long nr_free;
+ unsigned long mt_nr_free[MIGRATE_TYPES];
};
- mtlist kernel:
+struct mt_free_list {
+ struct list_head list;
+ unsigned long nr_free;
+};
+
struct free_area {
- struct list_head free_list[MIGRATE_TYPES];
+ struct mt_free_list mt_free_list[MIGRATE_TYPES];
unsigned long nr_free;
};

summary:
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| Kernel | inrange samples | HitM (%) | L1 hit inc LFB/MAB (%) | L2 hit (%) | L3 hit inc HitM (%) | RAM hit (%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| vanilla | 192,468 | 45,421 (23.60%) | 94,486 (49.09%) | 1,952 (1.01%) | 91,240 (47.41%) | 4,790 (2.49%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| patched | 227,196 | 27,293 (12.01%) | 165,238 (72.73%) | 1,194 (0.53%) | 54,609 (24.04%) | 6,155 (2.71%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+
| mtlist | 240,694 | 50,911 (21.15%) | 132,827 (55.19%) | 3,165 (1.31%) | 98,556 (40.95%) | 6,146 (2.55%) |
+---------+-----------------+-----------------+------------------------+---------------+---------------------+---------------+

Detailed data:
- https://gist.github.com/zhr250/2ccf8902080ecaf85477d9c051e72a96

For both L1 hit and HitM, the patched kernel is the best among the three.

In this test model, I also collected memory allocation counts. The patched
kernel delivers the best performance — about 7.00% higher than vanilla and
4.93% higher than mtlist.

Detailed data:
https://gist.github.com/zhr250/4439523b7ca3c18f4a2d2c97b24c4965