Re: [PATCH 0/1] futex: check PF_KTHREAD rather than !p->mm to filter out kthreads

From: Peter Zijlstra
Date: Mon Feb 02 2015 - 10:12:10 EST


On Mon, Feb 02, 2015 at 03:05:15PM +0100, Oleg Nesterov wrote:

> First of all, why exactly do we need this mm/PF_KTHREAD check added by
> f0d71b3dcb8332f7971 ? Of course, it is simply wrong to declare a random
> kernel thread to be the owner as the changelog says. But why kthread is
> worse than a random user-space task, say, /sbin/init?

As the changelog says, we _should_ equally disallow other userspace
tasks that do not share the futex value with us, its just that at the
time we could not come up with a sensible (and cheap) way of testing for
this.

> IIUC, the fact that we can abuse ->pi_state_list is not that bad, no matter
> if this (k)thread will exit or not. AFAICS, the only problem is that we can
> boost the prio of this thread. Or I missed another problem?

No that's it.

> I am asking because we need to backport some fixes, and I am trying to
> convince myself that I actually understand what I am trying to do ;)



> And another question. Lets forget about this ->mm check. I simply can not
> understand this
>
> ret = (p->flags & PF_EXITPIDONE) ? -ESRCH : -EAGAIN
>
> logic in attach_to_pi_owner(). First of all, why do we need to retry if
> PF_EXITING is set but PF_EXITPIDONE is not? Why we can not simply ignore
> PF_EXITING and rely on exit_pi_state_list() if PF_EXITPIDONE is not set?
>
> I must have missed something but this looks buggy, I do not see any
> preemption point in this "retry" loop. Suppose that max_cpus=1 and rt_task()
> preempts the non-rt PF_EXITING owner. Looks like futex_lock_pi() can spin
> forever in this case? (OK, ignoring RT throttling).

This is not something I've ever looked at before; 778e9a9c3e71
("pi-futex: fix exit races and locking problems") seems to suggest its
possible to get onto tsk->pi_state_list after exit_pi_state_list().

So while the below shows preemption points; those don't actually help
against RT tasks, a FIFO-99 task will always be more eligible to run
than most others.

So yes, I do like your proposal of putting PF_EXITPIDONE under the
->pi_lock section that handles exit_pi_state_list().

I further think we can remove the smp_mb(); raw_spin_unlock_wait() from
do_exit() -- this would offset the new unconditional ->pi_lock
acquisition in exit_pi_state_list(). The comment there suggests robust
futexes are involved but I cannot find any except the PI state muck
testing ->flags.

As for the recursive fault; I think the safer option is to set
EXITPIDONE and not register more PI states, as opposed to allowing more
and more states to be added. Yes we'll leak whatever currently is there,
but no point in allowing it to get worse.


do_exit()
{
exit_signals(tsk); /* sets PF_EXITING */

smp_mb();
raw_spin_unlock_wait(&tsk->pi_lock);

exit_mm() {
mm_release() {
exit_pi_state_list();
}
}

tsk->flags |= PF_EXITPIDONE;
}

vs

futex_lock_pi()
{
retry:
...

ret = futex_lock_pi_atomic() {
attach_to_pi_owner() {
raw_spin_lock(&tsk->pi_lock);
if (PF_EXITING) {
ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
raw_spin_unlock(&tsk->pi_lock);
return ret;
}
}
}
if (ret) {
switch(ret) {
...

case -EAGAIN:
...
cond_resched();
goto retry;
}
}
}

vs

futex_requeue()
{
retry:
...

ret = futex_proxy_trylock_atomic() {
ret = futex_lock_pi_atomic() {
attach_to_pi_owner() {
raw_spin_lock(&tsk->pi_lock);
if (PF_EXITING) {
ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
raw_spin_unlock(&tsk->pi_lock);
return ret;
}
}
}
}

if (ret > 0) {
ret = lookup_pi_state() {
attach_to_pi_owner() {
raw_spin_lock(&tsk->pi_lock);
if (PF_EXITING) {
ret = PF_EXITPIDONE ? -ESRCH : -AGAIN;
raw_spin_unlock(&tsk->pi_lock);
return ret;
}
}
}
}

...
switch(ret) {
...
case -EAGAIN:
...
cond_resched();
goto retry;
}
}

vs


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/