On 2020/5/6 9:50, Yang Yingliang wrotee:
+cc lizefan@xxxxxxxxxx
On 2020/5/6 0:06, Tejun Heo wrote:
Hello, Yang.
On Sat, May 02, 2020 at 06:27:21PM +0800, Yang Yingliang wrote:
I find the number nr_dying_descendants is increasing:Those numbers are nowhere close to causing oom issues. There are some
linux-dVpNUK:~ # find /sys/fs/cgroup/ -name cgroup.stat -exec grep
'^nr_dying_descendants [^0]'Â {} +
/sys/fs/cgroup/unified/cgroup.stat:nr_dying_descendants 80
/sys/fs/cgroup/unified/system.slice/cgroup.stat:nr_dying_descendants 1
/sys/fs/cgroup/unified/system.slice/system-hostos.slice/cgroup.stat:nr_dying_descendants
1
/sys/fs/cgroup/unified/lxc/cgroup.stat:nr_dying_descendants 79
/sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/cgroup.stat:nr_dying_descendants
78
/sys/fs/cgroup/unified/lxc/5f1fdb8c54fa40c3e599613dab6e4815058b76ebada8a27bc1fe80c0d4801764/system.slice/cgroup.stat:nr_dying_descendants
78
aspects of page and other cache draining which is being improved but unless
you're seeing numbers multiple orders of magnitude higher, this isn't the
source of your problem.
The situation is as same as the commit bd1060a1d671 ("sock, cgroup: addI'm doubtful that you're hitting that issue. Mode switching means memcg
sock->sk_cgroup") describes.
"On mode switch, cgroup references which are already being pointed to by
socks may be leaked."
being switched between cgroup1 and cgroup2 hierarchies, which is unlikely to
be what's happening when you're launching docker containers.
The first step would be identifying where memory is going and finding out
whether memcg is actually being switched between cgroup1 and 2 - look at the
hierarchy number in /proc/cgroups, if that's switching between 0 and
someting not zero, it is switching.
I think there's a bug here which can lead to unlimited memory leak.
This should reproduce the bug:
ÂÂ # mount -t cgroup -o netprio xxx /cgroup/netprio
ÂÂ # mkdir /cgroup/netprio/xxx
ÂÂ # echo PID > /cgroup/netprio/xxx/tasks
ÂÂ /* this PID process starts to do some network thing and then exits */
ÂÂ # rmdir /cgroup/netprio/xxx
ÂÂ /* now this cgroup will never be freed */
Look at the code:
static inline void sock_update_netprioidx(struct sock_cgroup_data *skcd)
{
ÂÂÂÂ...
ÂÂÂÂsock_cgroup_set_prioidx(skcd, task_netprioidx(current));
}
static inline void sock_cgroup_set_prioidx(struct sock_cgroup_data *skcd,
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ u16 prioidx)
{
ÂÂÂÂ...
ÂÂÂÂif (sock_cgroup_prioidx(&skcd_buf) == prioidx)
ÂÂÂÂÂÂÂ return ;
ÂÂÂÂ...
ÂÂÂÂskcd_buf.prioidx = prioidx;
ÂÂÂÂWRITE_ONCE(skcd->val, skcd_buf.val);
}
task_netprioidx() will be the cgrp id of xxx which is not 1, but
sock_cgroup_prioidx(&skcd_buf) is 1 because it thought it's in v2 mode.
Now we have a memory leak.
I think the eastest fix is to do the mode switch here:
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index b905747..2397866 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -240,6 +240,8 @@ static void net_prio_attach(struct cgroup_taskset *tset)
ÂÂÂÂÂÂÂ struct task_struct *p;
ÂÂÂÂÂÂÂ struct cgroup_subsys_state *css;
+ÂÂÂÂÂÂ cgroup_sk_alloc_disable();
+
ÂÂÂÂÂÂÂ cgroup_taskset_for_each(p, css, tset) {
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ void *v = (void *)(unsigned long)css->cgroup->id;