Re: [REGRESSION] Null pointer dereference while shrinking zswap
From: Chengming Zhou
Date: Tue Apr 16 2024 - 23:45:31 EST
On 2024/4/17 08:22, Nhat Pham wrote:
> On Tue, Apr 16, 2024 at 4:29 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>>
>> On Tue, Apr 16, 2024 at 3:14 PM Nhat Pham <nphamcs@xxxxxxxxx> wrote:
>>>
>>> On Tue, Apr 16, 2024 at 5:19 AM Christian Heusel <christian@xxxxxxxxx> wrote:
>>>>
>>>> Hello everyone,
>>>
>>> Thanks for the report, Christian! Looking at it now.
>>>
>>>>
>>>> while rebuilding a few packages in Arch Linux we have recently come
>>>> across a regression in the linux kernel which was made visible by a test
>>>> failure in libguestfs[0], where the booted kernel showed a Call Trace
>>>> like the following one:
>>>>
>>>> [ 218.738568] CPU: 0 PID: 167 Comm: guestfsd Not tainted 6.7.0-rc4-1-mainline-00158-gb5ba474f3f51 #1 bf39861cf50acae7a79c534e25532f28afe4e593^M
>>>
>>> Is this one of the kernel versions that was broken? That looks a bit
>>> odd, as zswap shrinker landed on 6.8...
>>
>> Ah ignore this - I understand the versioning now...
>>
>>>
>>>> [ 218.739007] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014^M
>>>> [ 218.739787] RIP: 0010:memcg_page_state+0x9/0x30^M
>>>> [ 218.740299] Code: 0d b8 ff ff 66 66 2e 0f 1f 84 00 00 00 00 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48^M
>>>> [ 218.740727] RSP: 0018:ffffb5fa808dfc10 EFLAGS: 00000202^M
>>>> [ 218.740862] RAX: 0000000000000000 RBX: ffffb5fa808dfce0 RCX: 0000000000000002^M
>>>> [ 218.741016] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000^M
>>>> [ 218.741168] RBP: 0000000000000000 R08: ffff976681ff8000 R09: 0000000000000000^M
>>>> [ 218.741322] R10: 0000000000000001 R11: ffff9766833f9d00 R12: ffff9766ffffe780^M
>>>> [ 218.742167] R13: 0000000000000000 R14: ffff976680cc1800 R15: ffff976682204d80^M
>>>> [ 218.742376] FS: 00007f1479d9f540(0000) GS:ffff9766fbc00000(0000) knlGS:0000000000000000^M
>>>> [ 218.742569] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033^M
>>>> [ 218.743256] CR2: 0000000000000600 CR3: 0000000103606000 CR4: 0000000000750ef0^M
>>>> [ 218.743494] PKRU: 55555554^M
>>>> [ 218.743593] Call Trace:^M
>>>> [ 218.743733] <TASK>^M
>>>> [ 218.743847] ? __die+0x23/0x70^M
>>>> [ 218.743957] ? page_fault_oops+0x171/0x4e0^M
>>>> [ 218.744056] ? free_unref_page+0xf6/0x180^M
>>>> [ 218.744458] ? exc_page_fault+0x7f/0x180^M
>>>> [ 218.744551] ? asm_exc_page_fault+0x26/0x30^M
>>>> [ 218.744684] ? memcg_page_state+0x9/0x30^M
>>>> [ 218.744779] zswap_shrinker_count+0x9d/0x110^M
>>>> [ 218.744896] do_shrink_slab+0x3a/0x360^M
>>>> [ 218.744990] shrink_slab+0xc7/0x3c0^M
>>>> [ 218.745609] drop_slab+0x85/0x140^M
>>>> [ 218.745691] drop_caches_sysctl_handler+0x7e/0xd0^M
>>>> [ 218.745799] proc_sys_call_handler+0x1c0/0x2e0^M
>>>> [ 218.745912] vfs_write+0x23d/0x400^M
>>>> [ 218.745998] ksys_write+0x6f/0xf0^M
>>>> [ 218.746080] do_syscall_64+0x64/0xe0^M
>>>> [ 218.746169] ? exit_to_user_mode_prepare+0x132/0x1f0^M
>>>> [ 218.746873] entry_SYSCALL_64_after_hwframe+0x6e/0x76^M
>>>>
>
> Actually, inspecting the code a bit more - can memcg be null here?
>
> Specifically, if mem_cgroup_disabled() is true, can we see null memcg
> here? Looks like in this case, mem_cgroup_iter() can return null, and
> the first iteration of drop_slab_node() can have null memcg if it's
> returned by mem_cgroup_iter().
>
> shrink_slab() will still proceed and call do_shrink_slab() if the
> memcg is null - provided that mem_cgroup_disabled() holds:
>
> if (!mem_cgroup_disabled() && !mem_cgroup_is_root(memcg))
> return shrink_slab_memcg(gfp_mask, nid, memcg, priority);
>
Ah, I think your analysis is very right, here memcg is NULL but memcg_page_state
won't check.
> Inside zswap_shrink_count(), all the memcg accessors in this area seem
> to always check for null memcg (mem_cgroup_lruvec,
> mem_cgroup_zswap_writeback_enabled), *except* memcg_page_state, which
> is the one line that fail.
>
> If this is all to it, we can simply add a null check or
> mem_cgroup_disabled() check here, and use pool stats instead?
Both look ok to me. The VM could only set sc->memcg to NULL when memcg
disabled, right?
Thanks.