Re: [PATCH v2] mm: memcg: Don't periodically flush stats when memcg is disabled

From: T.J. Mercier
Date: Thu Feb 01 2024 - 16:02:30 EST


On Thu, Feb 1, 2024 at 6:26 AM Michal Koutný <mkoutny@xxxxxxxx> wrote:
>
> On Fri, Jan 26, 2024 at 09:19:25PM +0000, "T.J. Mercier" <tjmercier@xxxxxxxxxx> wrote:
> > The root memcg is onlined even when memcg is disabled. When it's onlined
> > a 2 second periodic stat flush is started, but no stat flushing is
> > required when memcg is disabled because there can be no child memcgs.
> > Most calls to flush memcg stats are avoided when memcg is disabled as a
> > result of the mem_cgroup_disabled check added in 7d7ef0a4686a
> > ("mm: memcg: restore subtree stats flushing"), but the periodic flushing
> > started in mem_cgroup_css_online is not. Skip it.
>
> Have you tried
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -6099,6 +6099,9 @@ int __init cgroup_init(void)
> cgroup_unlock();
>
> for_each_subsys(ss, ssid) {
> + if (!cgroup_ssid_enabled(ssid))
> + continue;
> +
> if (ss->early_init) {
> struct cgroup_subsys_state *css =
> init_css_set.subsys[ss->id];
> @@ -6118,9 +6121,6 @@ int __init cgroup_init(void)
> * disabled flag and cftype registration needs kmalloc,
> * both of which aren't available during early_init.
> */
> - if (!cgroup_ssid_enabled(ssid))
> - continue;
> -
> if (cgroup1_ssid_disabled(ssid))
> pr_info("Disabling %s control group subsystem in v1 mounts\n",
> ss->legacy_name);
> ?
> I'm asking about a try because I'm not sure whether this does not blow
> up due to something missing. But I think disabled controllers would not
> need to be (root-)onlined at all.
>
> Thanks,
> Michal

Hi Michal,

It does blow up, but not how I was expecting. There's a null pointer
dereference inside find_css_set when trying to get a css pointer for
the memory controller, I think because the allocation in
cgroup_init_subsys is skipped:

[ 9.591766] BUG: kernel NULL pointer dereference, address: 0000000000000000
[ 9.591909] #PF: supervisor read access in kernel mode
[ 9.592023] #PF: error_code(0x0000) - not-present page
[ 9.592138] PGD 0 P4D 0
[ 9.592199] Oops: 0000 [#1] PREEMPT SMP PTI
[ 9.592289] CPU: 1 PID: 1 Comm: init Tainted: G E
6.7.0-mainline-maybe-dirty #1
[ 9.592490] Hardware name: emulation qemu-x86/qemu-x86, BIOS
2023.07-rc6-gb8315a3184b2-ab11347395 07/01/2023
[ 9.592731] RIP: 0010:find_css_set+0x5c3/0x7a0
[ 9.592850] Code: 20 69 cf b2 4e 8b 3c 33 48 c7 c7 1d 3b 41 b2 4c
89 fe e8 10 b0 f7 00 49 8b b4 24 a0 00 00 00 4e 8d 24 2b 49 81 c4 b0
fc ff ff <49> 8b 0f 4c 01 e9 48 c7 c7 c4 c0 46 b2 4c 89 e2 e8 e8 af f7
00 49
[ 9.593251] RSP: 0018:ffffb6218000bb90 EFLAGS: 00010087
[ 9.593370] RAX: 0000000000000021 RBX: ffff99181044a200 RCX: 4f5e789f089a0c00
[ 9.593554] RDX: ffffb6218000ba50 RSI: ffffffffb2448284 RDI: ffffffffb2c91950
[ 9.593735] RBP: ffffb6218000bc28 R08: 0000000000000fff R09: ffffffffb2c79950
[ 9.593909] R10: 0000000000002ffd R11: 0000000000000004 R12: ffff99181044a2d8
[ 9.594102] R13: 0000000000000428 R14: 0000000000000020 R15: 0000000000000000
[ 9.594291] FS: 00007f3d2f986fc8(0000) GS:ffff99182bd00000(0000)
knlGS:0000000000000000
[ 9.594467] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9.594610] CR2: 0000000000000000 CR3: 00000001001d8001 CR4: 0000000000370eb0
[ 9.594818] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9.595007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 9.595178] Call Trace:
[ 9.595237] <TASK>
[ 9.595297] ? __die_body+0x5e/0xb0
[ 9.595392] ? __die+0x9e/0xb0
[ 9.595481] ? page_fault_oops+0x35f/0x3d0
[ 9.595574] ? do_user_addr_fault+0x6ab/0x780
[ 9.595690] ? prb_read_valid+0x28/0x60
[ 9.595782] ? exc_page_fault+0x83/0x220
[ 9.595874] ? asm_exc_page_fault+0x2b/0x30
[ 9.595966] ? find_css_set+0x5c3/0x7a0
[ 9.596059] cgroup_migrate_prepare_dst+0x75/0x2b0
[ 9.596194] cgroup_attach_task+0x293/0x450
[ 9.596305] ? cgroup_attach_task+0xb6/0x450
[ 9.596449] __cgroup_procs_write+0xef/0x1a0
[ 9.596589] cgroup_procs_write+0x16/0x30
[ 9.596733] cgroup_file_write+0x9d/0x260
[ 9.596840] kernfs_fop_write_iter+0x145/0x1e0
[ 9.596981] vfs_write+0x276/0x2e0
[ 9.597092] ksys_write+0x73/0xe0
[ 9.597198] __x64_sys_write+0x1a/0x30
[ 9.597303] do_syscall_64+0x5a/0x100
[ 9.597430] entry_SYSCALL_64_after_hwframe+0x6e/0x76


But there are also calls to mem_cgroup_css_from_folio that I think
would cause a different null pointer deref even if we had the css but
no root_mem_cgroup. There seems to be an assumption that we'll have a
memcg to charge against even when the controller is disabled. To me it
looks like that's to simplify the possible combinations of
CONFIG_MEMCG and memcg being boot-time disabled or not.

Best,
T.J.