Re: Q: wait_task_inactive() and !CONFIG_SMP && CONFIG_PREEMPT

From: Oleg Nesterov
Date: Fri Aug 01 2008 - 07:46:36 EST


On 07/31, Roland McGrath wrote:
>
> > I dont think this is right.
> >
> > Firstly, the above always fails if match_state == 0, this is not right.
>
> A call with 0 is the "legacy case", where the return value is 0 and nothing
> but the traditional wait_task_inactive behavior is expected. On UP, this
> was a nop before and still is.
>
> Anyway, this is moot since we are soon to have no callers that pass 0.

This means we can't use wait_task_inactive(p) unless we know p->state
is already != TASK_RUNNING. OK, the current callers assume exactly this.

But perhaps we should change the state checks from "==" to "&", note
that pfm_task_incompatible() checks task_is_stopped_or_traced().

> > But more importantly, we can't just check ->state == match_state. And
> > preempt_disable() buys nothing.
>
> It ensures that the samples of ->state and ->nvcsw both came while the
> target could never have run in between. Without it, a preemption after the
> ->state check could mean the ->nvcsw value we use is from a later block in
> a different state than the one intended.

I meant, it buys nothing because the task can set its state == TASK_RUNNING
right after preempt_enable(), because the task can be runnable despite
the fact its state is (say) TASK_UNINTERRUPTIBLE.

Please see below.

> > Let's look at task_current_syscall(). The "target" can set, say,
> > TASK_UNINTERRUPTIBLE many times, do a lot of syscalls, and not once
> > call schedule().
> >
> > And the task remains fully preemptible even if it runs in
> > TASK_UNINTERRUPTIBLE state.
>
> One of us is missing something basic. We are on the only CPU. If target
> does *anything*, it means we got preempted, then target switched in, did
> things, and then called schedule (including via preemption)--so that we
> could possibly be running again now afterwards. That schedule call bumped
> the counter after we sampled it. The second call done for "is it still
> blocked afterwards?" will see a different count and abort. Am I confused?

I think we misunderstood each other. I never said the _current_ callers have
problems (except the dummy version can't just return 1 of course).

I proposed to synchronize 2 implementations to avoid the possible problems,
and personally I still dislike the fact that !SMP version has the very
different behaviour.

But yes, if we only want to "fix" the current callers, we can make the !SMP
version

static inline unsigned long wait_task_inactive(struct task_struct *p,
long match_state)
{
int ret = 0;

preempt_disable();
if (!match_state || p->state == match_state)
ret = (p->nvcsw + p->nivcsw) | LONG_MIN;
preempt_enable();

return ret;
}

> Ah, I think it was me who was missing something when I let you talk me into
> checking only ->nvcsw. It really should be ->nivcsw + ->nvcsw as I had it
> originally (| LONG_MIN as you've done, a good trick). That makes what I
> just said true in the preemption case. This bit:
>
> if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
>
> will not hit, so switch_count = &prev->nivcsw; remains from before.
> This is why it was nivcsw + nvcsw to begin with.

This is fine on SMP, afaics.

More precisely, this is fine if wait_task_inactive(p) returns success only
when p->se.on_rq == 0, we shouldn't worry about preemption (nivcsw) at all.

Let's forget about the overflow, and suppose that 2 subsequent calls return
the same ->nvcsw. Every time the task leaves the runqueue (so that !.on_rq
becomes true), shedule() bumps its ->nvcsw. If the task was scheduled in
between we must notice this, because the second success means the task
was deactivate_task()'ed again.

We shouldn't look at nivcsw at all. While the task is deactivated, its
nivcsw/nvcsw can't be changed. If it is scheduled again, it must increment
->nvcsw before wait_task_inactive() can return success.

(Dmitry cc'ed to check my understanding).

Oleg.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/