Re: cgroup pointed by sock is leaked on mode switch

From: Yang Yingliang
Date: Fri May 08 2020 - 22:31:13 EST

Next message: Joe Perches: "Re: [PATCH] net: tg3: tidy up loop, remove need to compute off with a multiply"
Previous message: Bhaumik Bhatt: "[PATCH v7 8/8] bus: mhi: core: Ensure non-zero session or sequence ID values are used"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On 2020/5/6 15:51, Zefan Li wrote:

On 2020/5/6 10:16, Zefan Li wrote:

On 2020/5/6 9:50, Yang Yingliang wrotee:

+cc lizefan@xxxxxxxxxx

On 2020/5/6 0:06, Tejun Heo wrote:

Hello, Yang.

On Sat, May 02, 2020 at 06:27:21PM +0800, Yang Yingliang wrote:

I find the number nr_dying_descendants is increasing:
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep
'^nr_dying_descendants [^0]'Â {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants
1
/sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
/sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants
78
/sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants
78

Those numbers are nowhere close to causing oom issues. There are some
aspects of page and other cache draining which is being improved but unless
you're seeing numbers multiple orders of magnitude higher, this isn't the
source of your problem.

The situation is as same as the commit bd1060a1d671 ("sock, cgroup: add
sock->sk_cgroup") describes.
"On mode switch, cgroup references which are already being pointed to by
socks may be leaked."

I'm doubtful that you're hitting that issue. Mode switching means memcg
being switched between cgroup1 and cgroup2 hierarchies, which is unlikely to
be what's happening when you're launching docker containers.

The first step would be identifying where memory is going and finding out
whether memcg is actually being switched between cgroup1 and 2 - look at the
hierarchy number in /proc/cgroups, if that's switching between 0 and
someting not zero, it is switching.

I think there's a bug here which can lead to unlimited memory leak.
This should reproduce the bug:

ÂÂÂ # mount -t cgroup -o netprio xxx /cgroup/netprio
ÂÂÂ # mkdir /cgroup/netprio/xxx
ÂÂÂ # echo PID > /cgroup/netprio/xxx/tasks
ÂÂÂ /* this PID process starts to do some network thing and then exits */
ÂÂÂ # rmdir /cgroup/netprio/xxx
ÂÂÂ /* now this cgroup will never be freed */

Correction (still not tested):

ÂÂÂ # mount -t cgroup2 none /cgroup/v2
ÂÂÂ # mkdir /cgroup/v2/xxx
ÂÂÂ # echo PID > /cgroup/v2/xxx/cgroup.procs
ÂÂÂ /* this PID process starts to do some network thing */

ÂÂÂ # mount -t cgroup -o netprio xxx /cgroup/netprio
ÂÂÂ # mkdir /cgroup/netprio/xxx
ÂÂÂ # echo PID > /cgroup/netprio/xxx/tasks
ÂÂÂ ...
ÂÂÂ /* the PID process exits */

ÂÂÂ rmdir /cgroup/netprio/xxx
ÂÂÂ rmdir /cgroup/v2/xxx
ÂÂÂ /* now looks like this v2 cgroup will never be freed */

Look at the code:

static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
{
ÂÂÂÂÂ...
ÂÂÂÂÂsock_cgroup_set_prioidx(skcd, task_netprioidx(current));
}

static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ u16 prioidx)
{
ÂÂÂÂÂ...
ÂÂÂÂÂif (sock_cgroup_prioidx(&skcd_buf) == prioidx)
ÂÂÂÂÂÂÂÂ return ;
ÂÂÂÂÂ...
ÂÂÂÂÂskcd_buf.prioidx = prioidx;
ÂÂÂÂÂWRITE_ONCE(skcd->val, skcd_buf.val);
}

task_netprioidx() will be the cgrp id of xxx which is not 1, but
sock_cgroup_prioidx(&skcd_buf) is 1 because it thought it's in v2 mode.
Now we have a memory leak.

I think the eastest fix is to do the mode switch here:

diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b905747..2397866 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -240,6 +240,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
ÂÂÂÂÂÂÂÂ struct task_struct *p;
ÂÂÂÂÂÂÂÂ struct cgroup_subsys_state *css;

+ÂÂÂÂÂÂ cgroup_sk_alloc_disable();
+
ÂÂÂÂÂÂÂÂ cgroup_taskset_for_each(p, css, tset) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ void *v = (void *)(unsigned long)css->cgroup->id;

I've do some test with this change, here is the test result:

Without this change, nr_dying_descendants is increased and the cgroup is leaked:

linux-dVpNUK:~ # mount | grep cgroup
tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup2 on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime,nsdelegate)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
linux-dVpNUK:~ # mkdir /sys/fs/cgroup/net_cls,net_prio/test
linux-dVpNUK:~ # ps -ef | grep bash
rootÂÂÂÂ 12151 12150Â 0 00:31 pts/0ÂÂÂ 00:00:00 -bash
rootÂÂÂÂ 12322 12321Â 0 00:31 pts/1ÂÂÂ 00:00:00 -bash
rootÂÂÂÂ 12704 12703Â 0 00:31 pts/2ÂÂÂ 00:00:00 -bash
rootÂÂÂÂ 13359 12704Â 0 00:31 pts/2ÂÂÂ 00:00:00 grep --color=auto bash
linux-dVpNUK:~ # echo 12151 > /sys/fs/cgroup/net_cls,net_prio/test/tasks
linux-dVpNUK:~ # echo 12322 > /sys/fs/cgroup/net_cls,net_prio/test/tasks

(Use bash(12151/12322) do some network things, then kill them, the nr_dying_descendants is increased.)
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep '^nr_dying_descendants [^0]'Â {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 1
linux-dVpNUK:~ # kill 12151 12322
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep '^nr_dying_descendants [^0]'Â {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 3
/sys/fs/cgroup/unified/user.slice/cgroup.stat:nr_dying_descendants 2
/sys/fs/cgroup/unified/user.slice/user-0.slice/cgroup.stat:nr_dying_descendants 2
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 1

(after rmdir test, the nr_dying_descendants is not decreased.)
linux-dVpNUK:~ # rmdir /sys/fs/cgroup/net_cls,net_prio/test
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep '^nr_dying_descendants [^0]'Â {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 3
/sys/fs/cgroup/unified/user.slice/cgroup.stat:nr_dying_descendants 2
/sys/fs/cgroup/unified/user.slice/user-0.slice/cgroup.stat:nr_dying_descendants 2
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 1

With this change, nr_dying_descendants is not increased:

linux-dVpNUK:~ # mkdir /sys/fs/cgroup/net_cls,net_prio/test
linux-dVpNUK:~ # ps -ef | grep bash
rootÂÂÂÂÂ 5466Â 5443Â 0 00:50 pts/1ÂÂÂ 00:00:00 -bash
rootÂÂÂÂÂ 5724Â 5723Â 0 00:50 pts/2ÂÂÂ 00:00:00 -bash
rootÂÂÂÂÂ 6701 17013Â 0 00:50 pts/0ÂÂÂ 00:00:00 grep --color=auto bash
rootÂÂÂÂ 17013 17012Â 0 00:46 pts/0ÂÂÂ 00:00:00 -bash
linux-dVpNUK:~ # echo 5466 > /sys/fs/cgroup/net_cls,net_prio/test/tasks
linux-dVpNUK:~ # echo 5724 > /sys/fs/cgroup/net_cls,net_prio/test/tasks
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep '^nr_dying_descendants [^0]'Â {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 1
linux-dVpNUK:~ # kill 5466 5724
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep '^nr_dying_descendants [^0]'Â {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 1
linux-dVpNUK:~ # rmdir /sys/fs/cgroup/net_cls,net_prio/test/
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep '^nr_dying_descendants [^0]'Â {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants 1

.

Next message: Joe Perches: "Re: [PATCH] net: tg3: tidy up loop, remove need to compute off with a multiply"
Previous message: Bhaumik Bhatt: "[PATCH v7 8/8] bus: mhi: core: Ensure non-zero session or sequence ID values are used"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]