Re: [GIT PULL] Please pull proc and exec work for 5.7-rc1
From: Bernd Edlinger
Date: Fri Apr 03 2020 - 11:09:15 EST
On 4/2/20 9:04 PM, Linus Torvalds wrote:
> On Wed, Apr 1, 2020 at 9:16 AM Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
>>
>> The work on exec starts solving a long standing issue with exec that
>> it takes mutexes of blocking userspace applications, which makes exec
>> extremely deadlock prone. For the moment this adds a second mutex
>> with a narrower scope that handles all of the easy cases. Which
>> makes the tricky cases easy to spot. With a little luck the code to
>> solve those deadlocks will be ready by next merge window.
>
> So this worries me.
>
> I've pulled it, but I'm not entirely happy about some of it.
>
> For example, the "rationale" for some of the changes is
>
> This should be safe, as the credentials are only used for reading.
>
> which is just nonsensical. "Only used for reading" is immaterial, and
> there's no explanation for why that would matter at all. Most of the
> credentials are ever used for reading, and the worry about execve() is
> that the credentials can change, and people race with them and use the
> new 'suid' credentials and allow things that shouldn't be allowed. So
> the rationale makes no sense at all.
>
> Btw, if "this only takes it for reading" is such a big deal, why is
> that mutex not an rw-semaphore?
>
> The pidfd change at least has a rationale that makes sense:
>
> This should be safe, as the credentials do not change
> before exec_update_mutex is locked. Therefore whatever
> file access is possible with holding the cred_guard_mutex
> here is also possbile with the exec_update_mutex.
>
> so now you at least have an explanation of why that particular lock
> makes sense and works and is equivalent.
>
> It's still not a *great* explanation for why it would be equivalent,
> because cred_guard_mutex ends up not just guarding the write of the
> credentials, but makes it atomic wrt *other* data. That's the same
> problem as "I'm only reading".
>
> Locking is not about *one* value being consistent - if that was the
> case, then you could just do a "get refcount on the credentials, now I
> have a stable set of creds I can read forever". No lock needed.
>
> So locking is about *multiple* values being consistent, which is why
> that "I'm only reading" is not an explanation for why you can change
> the lock.
>
> It's also why that second one is questionable: it's a _better_ attempt
> at explaining things, but the point is really that cred_guard_mutex
> protects *other* things too.
>
Can we still edit the change logs, maybe that is a clear indication
that they are not sufficiently clear, when one don't understand the
patch without following the whole email thread.
> A real explanation would have absolutely *nothing* to do with
> "reading" or "same value of credentials". Both of those are entirely
> immaterial, since - as mentioned - you could just get a snapshot of
> the creds instead.
>
The problem we have here is that *another* thread can change no_new_privs
of the execve thread, that is a write. I think that must be avoided
whatever it costs. Those are the hard issues,
and reading another thread's credentials, an taking a reference of the
vm need to be consistent, so should just not happen while the vm
is updated, but the credentials not yet.
Or am I missing something here?
> A real explanation would be about how there is no other state that
> cred_guard_mutex protects that matters.
>
> See what I'm saying?
>
> This code is subtle as h*ll, and we've had bugs in it, and it has a
> series of tens of patches to fix them. But that also means that the
> explanations for the patches should take the subtleties into account,
> and not gloss over them with things like this.
>
:-)
> Ok, enough about the explanations. The actual _code_ is kind of odd
> too. For example, you have that "bprm->called_exec_mmap" flag to say
> "I've taken the exec_update_mutex, and need to drop it".
>
previously that was bprm->mm == NULL, that is even more hacky.
> But that flag is not set anywhere _near_ actually taking the lock.
> Sure, it is taken after exec_mmap() returns successfully, and that
> makes sense from a naming standpoint, but wouldn't it have been a
> _lot_ more obvious if you just set the flag when you took that lock,
> and instead of naming it by some magical code sequence, you named it
> for what it does?
>
Linus, I take full responsibility for this part of the patch.
In this case, I just did not want to change the name again.
That name was in a previous version of my patch, that I merged
with Eric's and at the same time had to fix the mutex-lock-order
issue in Eric's original patch. But if anybody would have
suggested a better name, and advised me to pass a parameter to
exec_mmap that would have happened.
So a kind of laziness on my side, and unfortunately I forgot to
point to all the changes in a revision log, I usually do that,
but this time I forgot it somehow. This was a 16-part patch
series at that time, so I just was really busy with following
each mail of the previous patch version, and also get the latest
revision of the change log (I use the mail maybe I should have
pulled Eric's tree, but I am still a newbie here ... :-) ).
Anyhow I was surprised that Eric did not see my changes by
looking at them, but that is the human nature, nothing to be
blamed for.
> Again, this looks all technically correct, but it's written in a way
> that doesn't seem to make a lot of sense. Why is the code literally
> written with a magical assumption of "calling exec_mmap takes this
> lock, so if the flag named called_exec_mmap is set, I have to free
> that lock that is not named that at all".
>
Names can be changed. In the peer review everybody was happy with it.
But that is not set in stone.
Initially I only wanted to address the ptrace attach, but Eric
came up with the user mode page fault handler, that made the patch
a lot more complicated, if that goal is dropped, also the place
where the mutex need to be taken could be a different one.
> I hate conditional locking in the first place, but if it has to exist,
> then the conditional should be named after the lock, and the lock
> getting should be very very explicitly tied to it.
>
> Wouldn't it have been much clearer if you called that flag
> "exec_update_mutex_taken", and set it WHEN YOU TAKE IT?
>
Can be done. I don't care. It is one additional register taken
with a parameter to exec_mmap and it is probably inlined, nothing
more nothing less.
> In fact, then you could drop the
>
> mutex_unlock(&tsk->signal->exec_update_mutex);
>
> in the error case of exec_mmap(), because now the error handling in
> free_bprm() would do the cleanup automatically.
>
The error handling is sometimes called when the exec_update_mutex is
not taken, in fact even de_thread not called.
Can you say how you would suggest that to be done?
> See what I'm saying? You've made the locking more complex and subtle
> than it needed to be. And since the whole point of the *new* lock is
> that it should replace an old lock that was really complex and subtle,
> that's a problem.
>
> Linus
>