RE: [PATCHSET v4] netfilter, cgroup: implement cgroup2 path match in xt_cgroup
From: Dexuan Cui
Date: Mon Dec 14 2015 - 08:27:02 EST
> -----Original Message-----
> From: netdev-owner@xxxxxxxxxxxxxxx [mailto:netdev-owner@xxxxxxxxxxxxxxx]
> On Behalf Of Tejun Heo
> Sent: Tuesday, December 8, 2015 6:39
> To: davem@xxxxxxxxxxxxx; pablo@xxxxxxxxxxxxx; kaber@xxxxxxxxx;
> kadlec@xxxxxxxxxxxxxxxxx; daniel@xxxxxxxxxxxxx; daniel.wagner@xxxxxxxxxxxx;
> Cc: lizefan@xxxxxxxxxx; hannes@xxxxxxxxxxx; netdev@xxxxxxxxxxxxxxx;
> netfilter-devel@xxxxxxxxxxxxxxx; coreteam@xxxxxxxxxxxxx;
> cgroups@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; kernel-team@xxxxxx;
> Subject: [PATCHSET v4] netfilter, cgroup: implement cgroup2 path match in
> This is v4 of the xt_cgroup2 patchset. Changes from v3 are
> * Refreshed on top of net-next/master e72c932d3f8a ("cxgb3: Convert
> simple_strtoul to kstrtox"). Dropped "cgroups: Allow dynamically
> changing net_classid" from series as it's already applied.
> The changes from v2 to v3 are
> * Folded cgroup2 path matching into xt_cgroup as a new revision rather
> than a separate xt_cgroup2 match as suggested by Pablo.
> * Refreshed on top of Nina's net_cls dynamic config update fix patch.
> I included the fix patch as part of this series to ease reviewing.
> The changes from v1 to v2 are
> * Instead of adding sock->sk_cgroup separately, sock->sk_cgrp_data now
> carries either (prioidx, classid) pair or cgroup2 pointer. This
> avoids inflating struct sock with yet another cgroup related field.
> Unfortunately, this does add some complexity but that's the
> trade-off and the complexity is contained in cgroup proper.
> * Various small updats as per David and Jan's reviews.
> In cgroup v1, dealing with cgroup membership was difficult because the
> number of membership associations was unbound. As a result, cgroup v1
> grew several controllers whose primary purpose is either tagging
> membership or pull in configuration knobs from other subsystems so
> that cgroup membership test can be avoided.
> net_cls and net_prio controllers are examples of the latter. They
> allow configuring network-specific attributes from cgroup side so that
> network subsystem can avoid testing cgroup membership; unfortunately,
> these are not only cumbersome but also problematic.
> Both net_cls and net_prio aren't properly hierarchical. Both inherit
> configuration from the parent on creation but there's no interaction
> afterwards. An ancestor doesn't restrict the behavior in its subtree
> in anyway and configuration changes aren't propagated downwards.
> Especially when combined with cgroup delegation, this is problematic
> because delegatees can mess up whatever network configuration
> implemented at the system level. net_prio would allow the delegatees
> to set whatever priority value regardless of CAP_NET_ADMIN and net_cls
> the same for classid.
> While it is possible to solve these issues from controller side by
> implementing hierarchical allowable ranges in both controllers, it
> would involve quite a bit of complexity in the controllers and further
> obfuscate network configuration as it becomes even more difficult to
> tell what's actually being configured looking from the network side.
> While not much can be done for v1 at this point, as membership
> handling is sane on cgroup v2, it'd be better to make cgroup matching
> behave like other network matches and classifiers than introducing
> further complications.
> This patchset includes the following eight patches.
> 0001-0003 are prepatory patches in kernfs and cgroup. These patches
> are available in the following branch which will stay stable.
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git for-4.5-ancestor-test
> 0004-0006 consolidate two cgroup related fields in struct sock into
> cgroup_sock_data and update it so that it can alternatively carry a
> cgroup pointer.
> 0007-0008 implement cgroup2 patch matching in xt_cgroup.
> This patchset is on top of net-next/master and also available in the
> following git branch.
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-xt_cgroup2
> I'll post iptables extension as a reply. diffstat follows. Thanks.
> fs/kernfs/dir.c | 46 +++++++++++
> include/linux/cgroup-defs.h | 126 +++++++++++++++++++++++++++++++
> include/linux/cgroup.h | 66 +++++++++++++++-
> include/linux/kernfs.h | 12 ++
> include/net/cls_cgroup.h | 11 +-
> include/net/netprio_cgroup.h | 16 +++
> include/net/sock.h | 13 ---
> include/uapi/linux/netfilter/xt_cgroup.h | 15 +++
> kernel/cgroup.c | 126 ++++++++++++++++++++++++-------
> net/Kconfig | 6 +
> net/core/dev.c | 3
> net/core/netclassid_cgroup.c | 11 +-
> net/core/netprio_cgroup.c | 19 ++++
> net/core/scm.c | 4
> net/core/sock.c | 17 ----
> net/netfilter/nft_meta.c | 2
> net/netfilter/xt_cgroup.c | 108 ++++++++++++++++++++++----
> 17 files changed, 513 insertions(+), 88 deletions(-)
With today's linux-next (next-20151214), I still got the same back trace, which
was previously reported at http://lists.openwall.net/netdev/2015/11/23/80:
[ 15.129701] BUG: spinlock bad magic on CPU#6, (systemd)/1012
[ 15.129701] lock: cgroup_sk_update_lock+0x0/0x40, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[ 15.129701] CPU: 6 PID: 1012 Comm: (systemd) Not tainted 4.4.0-rc4-next-20151214+ #3
[ 15.129701] Hardware name: Microsoft Corporation Virtual Machine/Virtual Machine, BIOS 090006 05/23/2012
[ 15.129701] ffffffffae6cddc0 ffff8800e158bab0 ffffffffad317212 0000000000000000
[ 15.129701] ffff8800e158bad0 ffffffffad0a1b8c ffffffffae6cddc0 ffffffffad800ee6
[ 15.129701] ffff8800e158baf0 ffffffffad0a1c06 ffffffffae6cddc0 ffff8800ead9f080
[ 15.129701] Call Trace:
[ 15.129701] [<ffffffffad317212>] dump_stack+0x44/0x62
[ 15.129701] [<ffffffffad0a1b8c>] spin_dump+0x7c/0xd0
[ 15.129701] [<ffffffffad0a1c06>] spin_bug+0x26/0x30
[ 15.129701] [<ffffffffad0a1d55>] do_raw_spin_lock+0xe5/0x120
[ 15.129701] [<ffffffffad5968b9>] _raw_spin_lock+0x39/0x40
[ 15.129701] [<ffffffffad4c27b3>] ? update_classid_sock+0x33/0x80
[ 15.129701] [<ffffffffad4c27b3>] update_classid_sock+0x33/0x80
[ 15.129701] [<ffffffffad4c2780>] ? write_classid+0x30/0x30
[ 15.129701] [<ffffffffad1e2c2a>] iterate_fd+0x5a/0x90
[ 15.129701] [<ffffffffad4c2717>] update_classid+0x47/0x80
[ 15.129701] [<ffffffffad4c2825>] cgrp_attach+0x25/0x30
[ 15.129701] [<ffffffffad0de43b>] cgroup_taskset_migrate+0x14b/0x280
[ 15.129701] [<ffffffffad0de62f>] cgroup_migrate+0xbf/0x100
[ 15.129701] [<ffffffffad0de575>] ? cgroup_migrate+0x5/0x100
[ 15.129701] [<ffffffffad0de725>] cgroup_attach_task+0xb5/0x100
[ 15.129701] [<ffffffffad0de675>] ? cgroup_attach_task+0x5/0x100
[ 15.129701] [<ffffffffad0dea0a>] __cgroup_procs_write+0x1da/0x310
[ 15.129701] [<ffffffffad0de88e>] ? __cgroup_procs_write+0x5e/0x310
[ 15.129701] [<ffffffffad0deb74>] cgroup_procs_write+0x14/0x20
[ 15.129701] [<ffffffffad0db3d0>] cgroup_file_write+0x40/0x130
[ 15.129701] [<ffffffffad242fa0>] kernfs_fop_write+0x130/0x180
[ 15.129701] [<ffffffffad1c4618>] __vfs_write+0x28/0xe0
[ 15.129701] [<ffffffffad09bb3c>] ? percpu_down_read+0x3c/0x90
[ 15.129701] [<ffffffffad1c7ddc>] ? __sb_start_write+0xdc/0xf0
[ 15.129701] [<ffffffffad1c7ddc>] ? __sb_start_write+0xdc/0xf0
[ 15.129701] [<ffffffffad1c4eb9>] vfs_write+0xa9/0x190
[ 15.129701] [<ffffffffad1c5e19>] SyS_write+0x49/0xa0
[ 15.129701] [<ffffffffad5972f6>] entry_SYSCALL_64_fastpath+0x16/0x7a
My kernel config is attached FYI.