Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1

From: Linus Torvalds
Date: Thu Apr 02 2020 - 15:05:03 EST


On Wed, Apr 1, 2020 at 9:16 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>
> The work on exec starts solving a long standing issue with exec that
> it takes mutexes of blocking userspace applications, which makes exec
> extremely deadlock prone. For the moment this adds a second mutex
> with a narrower scope that handles all of the easy cases. Which
> makes the tricky cases easy to spot. With a little luck the code to
> solve those deadlocks will be ready by next merge window.

So this worries me.

I've pulled it, but I'm not entirely happy about some of it.

For example, the "rationale" for some of the changes is

This should be safe, as the credentials are only used for reading.

which is just nonsensical. "Only used for reading" is immaterial, and
there's no explanation for why that would matter at all. Most of the
credentials are ever used for reading, and the worry about execve() is
that the credentials can change, and people race with them and use the
new 'suid' credentials and allow things that shouldn't be allowed. So
the rationale makes no sense at all.

Btw, if "this only takes it for reading" is such a big deal, why is
that mutex not an rw-semaphore?

The pidfd change at least has a rationale that makes sense:

This should be safe, as the credentials do not change
before exec_update_mutex is locked. Therefore whatever
file access is possible with holding the cred_guard_mutex
here is also possbile with the exec_update_mutex.

so now you at least have an explanation of why that particular lock
makes sense and works and is equivalent.

It's still not a *great* explanation for why it would be equivalent,
because cred_guard_mutex ends up not just guarding the write of the
credentials, but makes it atomic wrt *other* data. That's the same
problem as "I'm only reading".

Locking is not about *one* value being consistent - if that was the
case, then you could just do a "get refcount on the credentials, now I
have a stable set of creds I can read forever". No lock needed.

So locking is about *multiple* values being consistent, which is why
that "I'm only reading" is not an explanation for why you can change
the lock.

It's also why that second one is questionable: it's a _better_ attempt
at explaining things, but the point is really that cred_guard_mutex
protects *other* things too.

A real explanation would have absolutely *nothing* to do with
"reading" or "same value of credentials". Both of those are entirely
immaterial, since - as mentioned - you could just get a snapshot of
the creds instead.

A real explanation would be about how there is no other state that
cred_guard_mutex protects that matters.

See what I'm saying?

This code is subtle as h*ll, and we've had bugs in it, and it has a
series of tens of patches to fix them. But that also means that the
explanations for the patches should take the subtleties into account,
and not gloss over them with things like this.

Ok, enough about the explanations. The actual _code_ is kind of odd
too. For example, you have that "bprm->called_exec_mmap" flag to say
"I've taken the exec_update_mutex, and need to drop it".

But that flag is not set anywhere _near_ actually taking the lock.
Sure, it is taken after exec_mmap() returns successfully, and that
makes sense from a naming standpoint, but wouldn't it have been a
_lot_ more obvious if you just set the flag when you took that lock,
and instead of naming it by some magical code sequence, you named it
for what it does?

Again, this looks all technically correct, but it's written in a way
that doesn't seem to make a lot of sense. Why is the code literally
written with a magical assumption of "calling exec_mmap takes this
lock, so if the flag named called_exec_mmap is set, I have to free
that lock that is not named that at all".

I hate conditional locking in the first place, but if it has to exist,
then the conditional should be named after the lock, and the lock
getting should be very very explicitly tied to it.

Wouldn't it have been much clearer if you called that flag
"exec_update_mutex_taken", and set it WHEN YOU TAKE IT?

In fact, then you could drop the

mutex_unlock(&tsk->signal->exec_update_mutex);

in the error case of exec_mmap(), because now the error handling in
free_bprm() would do the cleanup automatically.

See what I'm saying? You've made the locking more complex and subtle
than it needed to be. And since the whole point of the *new* lock is
that it should replace an old lock that was really complex and subtle,
that's a problem.

Linus