Re: [PATCH 0/8] mm/mglru: improve reclaim loop and dirty folio handling

From: Eric Naim

Date: Wed Mar 25 2026 - 05:33:39 EST

On 3/25/26 1:47 PM, Kairui Song wrote:
> On Wed, Mar 25, 2026 at 1:04 PM Eric Naim <dnaim@xxxxxxxxxxx> wrote:
>>
>> Hi Kairui,
>>
>> On 3/18/26 3:08 AM, Kairui Song via B4 Relay wrote:
>>> This series cleans up and slightly improves MGLRU's reclaim loop and
>>> dirty flush logic. As a result, we can see an up to ~50% reduce of file
>>> faults and 30% increase in MongoDB throughput with YCSB and no swap
>>> involved, other common benchmarks have no regression, and LOC is
>>> reduced, with less unexpected OOM in our production environment.
>>>
>
> ...
>
>>
>> I applied this patch set to 7.0-rc5 and noticed the system locking up when performing the below test.
>>
>> fallocate -l 5G 5G
>> while true; do tail /dev/zero; done
>> while true; do time cat 5G > /dev/null; sleep $(($(cat /sys/kernel/mm/lru_gen/min_ttl_ms)/1000+1)); done
>>
>> After reading [1], I suspect that this was because the system was using zram as swap, and yes if zram is disabled then the lock up does not occur.
>
> Hi Eric,
>
> Thanks for the report, I was about to send V2 but noticing your report
> I'll try to reproduce your issue first.
>
> So far I didn't notice any regression, is this an issue caused by this
> patch or is it an existing issue? I don't have any context about how
> you are doing the test. BTW the calculation in patch "mm/mglru:
> restructure the reclaim loop" needs to have a lowest bar
> "max(nr_to_scan, SWAP_CLUSTER_MAX)" for small machines, not sure if
> related but will add to V2.
>

As of writing this, I got some new information that makes this a bit more confusing. The kernel that doesn't have the issue was patched with [1] as a means of protecting the working set (similar to lru_gen_min_ttl_ms).

So this time on an unpatched kernel, the system still freezes but quickly recovers itself after about 2 seconds. With this patchset applied, the system freezes but it doesn't quickly recover (if at all).

Curiously, I had the user test again but this time with lru_gen_min_ttl_ms = 100. With this set, the system doesn't freeze at all with or without this patchset.

> And about the test you posted:
> while true; do tail /dev/zero; done
>
> I believe this will just consume all memory with zero pages and then
> get OOM killed, that's exactly what the test is meant to do. By lockup
> I'm not sure you mean since you mentioned OOM kill. The system
> actually hung or the desktop is dead?

The system actually hung. They needed a hard reset to recover the system. (pure speculation: given a few minutes the system would likely recover itself as this seems to be a common scenario)

>
> I just ran that with or without ZRAM on two machines and my laptop,
> everything looks good here with this series.
>
>> zram as swap seems to be unsupported by upstream.
>
> That's simply not true, other distros like Fedora even have ZRAM as
> swap by default:
> https://fedoraproject.org/wiki/Changes/SwapOnZRAM
>
> And systemd have a widely used ZRAM swap support:
> https://github.com/systemd/zram-generator
>
> Android also uses that, and we are using ZRAM by default in our fleet
> which runs fine.
>
>> the user that tested this wasn't able to get a
>> good kernel trace, the only thing left was
>> a trace of the OOM killer firing.
>
> No worry, that's fine, just send me the OOM trace or log, the more
> detailed context I get the better.

Mar 25 08:24:22 osiris kernel: Call Trace:
Mar 25 08:24:22 osiris kernel: <TASK>
Mar 25 08:24:22 osiris kernel: dump_stack_lvl+0x61/0x80
Mar 25 08:24:22 osiris kernel: dump_header+0x4a/0x160
Mar 25 08:24:22 osiris kernel: oom_kill_process+0x18f/0x1f0
Mar 25 08:24:22 osiris kernel: out_of_memory+0x4ab/0x5c0
Mar 25 08:24:22 osiris kernel: __alloc_pages_slowpath+0x9ac/0x1060
Mar 25 08:24:22 osiris kernel: __alloc_frozen_pages_noprof+0x29a/0x320
Mar 25 08:24:22 osiris kernel: alloc_pages_mpol+0x107/0x1b0
Mar 25 08:24:22 osiris kernel: folio_alloc_noprof+0x85/0xb0
Mar 25 08:24:22 osiris kernel: __filemap_get_folio_mpol+0x1ff/0x4c0
Mar 25 08:24:22 osiris kernel: filemap_fault+0x3e3/0x6e0
Mar 25 08:24:22 osiris kernel: __do_fault+0x46/0x140
Mar 25 08:24:22 osiris kernel: do_pte_missing+0x154/0xea0
Mar 25 08:24:22 osiris kernel: ? __pte_offset_map+0x1d/0xd0
Mar 25 08:24:22 osiris kernel: handle_mm_fault+0x89c/0x1280
Mar 25 08:24:22 osiris kernel: do_user_addr_fault+0x23b/0x720
Mar 25 08:24:22 osiris kernel: exc_page_fault+0x75/0xe0
Mar 25 08:24:22 osiris kernel: asm_exc_page_fault+0x26/0x30
Mar 25 08:24:22 osiris kernel: RIP: 0033:0x7fec4beb43c0
Mar 25 08:24:22 osiris kernel: Code: Unable to access opcode bytes at 0x7fec4beb4396.
Mar 25 08:24:22 osiris kernel: RSP: 002b:00007ffcb348d698 EFLAGS: 00010293
Mar 25 08:24:22 osiris kernel: RAX: 00000000c70f6907 RBX: 00007ffcb348d8d0 RCX: 00007fec4bb1604d
Mar 25 08:24:22 osiris kernel: RDX: c6a4a7935bd1e995 RSI: 4fb7dae88ad99bfb RDI: 000055ee77cc8150
Mar 25 08:24:22 osiris kernel: RBP: 00007ffcb348dd60 R08: 000055ee77cc8158 R09: 000000000000000c
Mar 25 08:24:22 osiris kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000c
Mar 25 08:24:22 osiris kernel: R13: 000055ee77cc8150 R14: 0000000000000064 R15: 431bde82d7b634db
Mar 25 08:24:22 osiris kernel: </TASK>

Here's the call trace that was recovered. Some mm related settings that we set in our kernel in case its useful:

vm.compact_unevictable_allowed = 0
vm.compaction_proactiveness = 0
vm.page-cluster = 0
vm.swappiness = 150
vm.vfs_cache_pressure = 50
vm.dirty_bytes = 268435456
vm.dirty_background_bytes = 67108864
vm.dirty_writeback_centisecs = 1500
vm.watermark_boost_factor = 0

/sys/kernel/mm/transparent_hugepage/defrag = defer+madvise

[1] https://github.com/firelzrd/le9uo/

--
Regards,
Eric