Re: [PATCH] capabilities: add capability cgroup controller

From: Petr Mladek
Date: Fri Jul 08 2016 - 05:13:49 EST


On Thu 2016-07-07 20:27:13, Topi Miettinen wrote:
> On 07/07/16 09:16, Petr Mladek wrote:
> > On Sun 2016-07-03 15:08:07, Topi Miettinen wrote:
> >> The attached patch would make any uses of capabilities generate audit
> >> messages. It works for simple tests as you can see from the commit
> >> message, but unfortunately the call to audit_cgroup_list() deadlocks the
> >> system when booting a full blown OS. There's no deadlock when the call
> >> is removed.
> >>
> >> I guess that in some cases, cgroup_mutex and/or css_set_lock could be
> >> already held earlier before entering audit_cgroup_list(). Holding the
> >> locks is however required by task_cgroup_from_root(). Is there any way
> >> to avoid this? For example, only print some kind of cgroup ID numbers
> >> (are there unique and stable IDs, available without locks?) for those
> >> cgroups where the task is registered in the audit message?
> >
> > I am not sure if anyone know what really happens here. I suggest to
> > enable lockdep. It might detect possible deadlock even before it
> > really happens, see Documentation/locking/lockdep-design.txt
> >
> > It can be enabled by
> >
> > CONFIG_PROVE_LOCKING=y
> >
> > It depends on
> >
> > CONFIG_DEBUG_KERNEL=y
> >
> > and maybe some more options, see lib/Kconfig.debug
>
> Thanks a lot! I caught this stack dump:
>
> starting version 230
> [ 3.416647] ------------[ cut here ]------------
> [ 3.417310] WARNING: CPU: 0 PID: 95 at
> /home/topi/d/linux.git/kernel/locking/lockdep.c:2871
> lockdep_trace_alloc+0xb4/0xc0
> [ 3.417605] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
> [ 3.417923] Modules linked in:
> [ 3.418288] CPU: 0 PID: 95 Comm: systemd-udevd Not tainted 4.7.0-rc5+ #97
> [ 3.418444] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
> BIOS Debian-1.8.2-1 04/01/2014
> [ 3.418726] 0000000000000086 000000007970f3b0 ffff88000016fb00
> ffffffff813c9c45
> [ 3.418993] ffff88000016fb50 0000000000000000 ffff88000016fb40
> ffffffff81091e9b
> [ 3.419176] 00000b3705e2c798 0000000000000046 0000000000000410
> 00000000ffffffff
> [ 3.419374] Call Trace:
> [ 3.419511] [<ffffffff813c9c45>] dump_stack+0x67/0x92
> [ 3.419644] [<ffffffff81091e9b>] __warn+0xcb/0xf0
> [ 3.419745] [<ffffffff81091f1f>] warn_slowpath_fmt+0x5f/0x80
> [ 3.419868] [<ffffffff810e9a84>] lockdep_trace_alloc+0xb4/0xc0
> [ 3.419988] [<ffffffff8120dc42>] kmem_cache_alloc_node+0x42/0x600
> [ 3.420156] [<ffffffff8110432d>] ? debug_lockdep_rcu_enabled+0x1d/0x20
> [ 3.420170] [<ffffffff8163183b>] __alloc_skb+0x5b/0x1d0
> [ 3.420170] [<ffffffff81144f6b>] audit_log_start+0x29b/0x480
> [ 3.420170] [<ffffffff810a2925>] ? __lock_task_sighand+0x95/0x270
> [ 3.420170] [<ffffffff81145cc9>] audit_log_cap_use+0x39/0xf0
> [ 3.420170] [<ffffffff8109cd75>] ns_capable+0x45/0x70
> [ 3.420170] [<ffffffff8109cdb7>] capable+0x17/0x20
> [ 3.420170] [<ffffffff812a2f50>] oom_score_adj_write+0x150/0x2f0
> [ 3.420170] [<ffffffff81230997>] __vfs_write+0x37/0x160
> [ 3.420170] [<ffffffff810e33b7>] ? update_fast_ctr+0x17/0x30
> [ 3.420170] [<ffffffff810e3449>] ? percpu_down_read+0x49/0x90
> [ 3.420170] [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
> [ 3.420170] [<ffffffff81233d47>] ? __sb_start_write+0xb7/0xf0
> [ 3.420170] [<ffffffff81231048>] vfs_write+0xb8/0x1b0
> [ 3.420170] [<ffffffff812533c6>] ? __fget_light+0x66/0x90
> [ 3.420170] [<ffffffff81232078>] SyS_write+0x58/0xc0
> [ 3.420170] [<ffffffff81001f2c>] do_syscall_64+0x5c/0x300
> [ 3.420170] [<ffffffff81849c9a>] entry_SYSCALL64_slow_path+0x25/0x25
> [ 3.420170] ---[ end trace fb586899fb556a5e ]---
> [ 3.447922] random: systemd-udevd urandom read with 3 bits of entropy
> available
> [ 4.014078] clocksource: Switched to clocksource tsc
> Begin: Loading essential drivers ... done.
>
> This is with qemu and the boot continues normally. With real computer,
> there's no such output and system just seems to freeze.
>
> Could it be possible that the deadlock happens because there's some IO
> towards /sys/fs/cgroup, which causes a capability check and that in turn
> causes locking problems when we try to print cgroup list?

The above warning is printed by the code from
kernel/locking/lockdep.c:2871

static void __lockdep_trace_alloc(gfp_t gfp_mask, unsigned long flags)
{
[...]
/* We're only interested __GFP_FS allocations for now */
if (!(gfp_mask & __GFP_FS))
return;

/*
* Oi! Can't be having __GFP_FS allocations with IRQs disabled.
*/
if (DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags)))
return;


The backtrace shows that your new audit_log_cap_use() is called
from vfs_write(). You might try to use audit_log_start() with
GFP_NOFS instead of GFP_KERNEL.

Note that this is rather intuitive advice. I still need to learn a lot
about memory management and kernel in general to be more sure about
a correct solution.

Best Regards,
Petr