[RFC] proposal for resolving the cred_guard_mutex deadlock
From: Jann Horn
Date: Thu Oct 11 2018 - 17:14:22 EST
Hi!
There is that old deadlock of cred_guard_mutex that I originally heard
about from Oleg, and that has come up on LKML sometimes since then; to
recap, essentially, the problem is demonstrated by the following
testcase:
========
#include <pthread.h>
#include <stdlib.h>
#include <sys/ptrace.h>
#include <unistd.h>
#include <signal.h>
void *thread(void *arg) {
ptrace(PTRACE_TRACEME, 0, 0, 0);
return NULL;
}
int main(void) {
int pid = fork();
if (!pid) {
pthread_t pt;
pthread_create(&pt, NULL, thread, NULL);
pthread_join(pt, NULL);
execlp("echo", "echo", "passed", NULL);
}
sleep(1);
ptrace(PTRACE_ATTACH, pid, 0, 0);
kill(pid, SIGCONT);
return 0;
}
========
The parent gets stuck in ptrace(), waiting for the cred_guard_mutex:
========
[<0>] ptrace_attach+0x97/0x250
[<0>] __x64_sys_ptrace+0xa2/0x100
[<0>] do_syscall_64+0x55/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
========
The child gets stuck in execve(), waiting in de_thread() for its other thread:
========
? __schedule+0x2b7/0x880
schedule+0x28/0x80
flush_old_exec+0xc8/0x6c0
load_elf_binary+0x291/0x15d0
? get_user_pages_remote+0x137/0x1f0
? get_acl+0x1a/0x100
search_binary_handler+0x90/0x1c0
__do_execve_file.isra.37+0x6b8/0x880
__x64_sys_execve+0x34/0x40
do_syscall_64+0x55/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
========
But the other thread can't exit until the ptrace parent handles it;
and the ptrace parent is stuck in another syscall, and therefore can't
wait() for that thread.
This can cause debuggers to lock up, and also gets in the way of
taking cred_guard_mutex in more places (since that'd make the deadlock
worse).
IMO at the core of the problem is that the cred_guard_mutex is held
across pretty much all of sys_execve(), and in particular across
do_thread(). I think that actually, we could drop the cred_guard_mutex
before waiting for sibling threads if we make things like the ptrace
check a bit more complicated.
At the moment, cred_guard_mutex is held across de_thread() and beyond
to protect against things like seccomp TSYNC from a sibling thread or
ptrace attach. For TSYNC, this is necessary because another thread
could try to TSYNC to us as long as we're not done with de_thread();
for ptrace attach, this is necessary because we only switch
credentials after de_thread(). But just doing the credential switch
earlier would also be dangerous.
So here's my idea:
When we're done calculating the post-execve creds (which happens
before de_thread()), we stash a pointer to the post-exec creds in
current->signal (or in current?). Then, in de_thread(), once we know
that all our siblings have been zapped, we drop the cred_guard_mutex
*before* the loop that waits for the siblings to exit. At that point,
sibling threads that are mid-syscall and other processes can try to
interact with us again. This means:
For accesses that are permitted based on whether the access is
same-thread (in particular TSYNC and probably also the
same_thread_group() check in ptrace), we'll have to check whether exec
creds are stashed in current->signal; if so, we have to bail out. Any
error code we return there shouldn't be visible to userspace, since
this can only happen to a thread that has already been zapped.
For accesses that go through ptrace_may_access(), we'll have to expand
the check such that, if an attempt is made to access a task that is
currently going through execve, the access check is performed against
*both* its old and its new credentials. This also means that the
target credentials will have to be plumbed into the LSMs in addition
to the target task pointer.
Opinions? If nobody thinks that this is an incredibly stupid idea,
I'll try to come up with some patches.