On Thu, Jun 3, 2021 at 2:53 PM Daniel Borkmann <daniel@xxxxxxxxxxxxx> wrote:[...]
I did run this entire discussion by both of the other BPF co-maintainers
(Alexei, Andrii, CC'ed) and together we did further brainstorming on the
matter on how we could solve this, but couldn't find a sensible & clean
solution so far.
Before I jump into the patch below I just want to say that I
appreciate you looking into solutions on the BPF side of things.
However, I voted "no" on this patch previously and since you haven't
really changed it, my "no"/NACK vote remains, at least until we
exhaust a few more options.
[PATCH] bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks
Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
added an implementation of the locked_down LSM hook to SELinux, with the aim
to restrict which domains are allowed to perform operations that would breach
lockdown. This is indirectly also getting audit subsystem involved to report
events. The latter is problematic, as reported by Ondrej and Serhei, since it
can bring down the whole system via audit:
1) The audit events that are triggered due to calls to security_locked_down()
can OOM kill a machine, see below details [0].
2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
when trying to wake up kauditd, for example, when using trace_sched_switch()
tracepoint, see details in [1]. Triggering this was not via some hypothetical
corner case, but with existing tools like runqlat & runqslower from bcc, for
example, which make use of this tracepoint. Rough call sequence goes like:
rq_lock(rq) -> -------------------------+
trace_sched_switch() -> |
bpf_prog_xyz() -> +-> deadlock
selinux_lockdown() -> |
audit_log_end() -> |
wake_up_interruptible() -> |
try_to_wake_up() -> |
rq_lock(rq) --------------+
Since BPF is a bit of chaotic nightmare in the sense that it basically
out-of-tree kernel code that can be called from anywhere and do pretty
much anything; it presents quite the challenge for those of us worried
about LSM access controls.
So let's look at this from a different angle. Let's look at the two
problems you mention above.
If we start with the runqueue deadlock we see the main problem is that
audit_log_end() pokes the kauditd_wait waitqueue to ensure the
kauditd_thread thread wakes up and processes the audit queue. The
audit_log_start() function does something similar, but it is
conditional on a number of factors and isn't as likely to be hit. If
we relocate these kauditd wakeup calls we can remove the deadlock in
trace_sched_switch(). In the case of CONFIG_AUDITSYSCALL=y we can
probably just move the wakeup to __audit_syscall_exit() and in the
case of CONFIG_AUDITSYSCALL=n we can likely just change the
wait_event_freezable() call in kauditd_thread to a
wait_event_freezable_timeout() call with a HZ timeout (the audit
stream will be much less on these systems anyway so a queue overflow
is much less likely). I'm building a kernel with these changes now, I
should have something to test when I wake up tomorrow morning. It
might even provide a bit of a performance boost as we would only be
calling a wakeup function once for each syscall.
The other issue is related to security_locked_down() and using the
right subject for the access control check. As has been pointed out
several times in this thread, the current code uses the current() task
as the subject, which is arguably incorrect for many of the BPF helper
functions. In the case of BPF, we have talked about using the
credentials of the task which loaded the BPF program instead of
current(), and that does make a certain amount of sense. Such an
approach should make the security policy easier to develop and
rationalize, leading to a significant decrease in audit records coming
from LSM access denials. The question is how to implement such a
change. The current SELinux security_bpf_prog_alloc() hook causes the
newly loaded BPF program to inherit the subject context from the task
which loads the BPF program; if it is possible to reference the
bpf_prog struct, or really just the associated bpf_prog_aux->security
blob, from inside a security_bpf_locked_down() function we use that
subject information to perform the access check. BPF folks, is there
a way to get that information from within a BPF kernel helper
function? If it isn't currently possible, could it be made possible
(or something similar)?