Re: Linux 3.19-rc5

From: Peter Zijlstra
Date: Mon Feb 02 2015 - 08:19:48 EST


On Sat, Jan 31, 2015 at 01:54:36PM -0800, Linus Torvalds wrote:
> On Sat, Jan 31, 2015 at 12:16 PM, Peter Zijlstra <peterz@xxxxxxxxxxxxx> wrote:
> >
> > All the stuff it flagged are genuinely wrong, albeit not disastrously
> > so, things mostly just work.
>
> I really disagree.
>
> They weren't wrong. They *could* occasionally result in extra
> reschedules, but that was never wrong to begin with.
>
> But the debugging code made that much much worse, and made that "could
> result in extra reschedules" happen all the time if it triggered.

Yes, that was a bad choice. Again, sorry about that.

> So you say "genuinely wrong", and I say "but that's how things were
> designed - it's an optimistic approach, not an exact one". Your
> debugging code changed that behavior, and actually introduced a real
> bug, exactly because you felt that the "no nested sleeps" was a harder
> requirement than it has ever actually been.
>
> In other words, I think the debugging code itself is wrong, and then
> that sched_annotate_sleep() thing is just a symptom of how it is
> wrong. If you have to sprinkle these kinds of random workarounds in a
> few core scheduler places (ok, mainly "wait()" paths, it looks like),
> why would you expect random *drivers* to have to care about things
> that even core kernel code says "I'm not going to care about this,
> I'll just shut the warning up, because the warning is wrong".

I was not aware this pattern was so wide spread / accepted. I have
totally underestimated the amount of fallout here.

> So don't get me wrong - I think the whole "add debug code to find
> places where we might have issues" was well worth it, and resulted in
> improvements.
>
> But once the low-hanging fruit and the core code that everybody hits
> has been fixed, and people cannot apparently even be bothered with the
> other cases it finds (like the pccardd case), at that point the value
> of the debug code becomes rather less obvious.

My bad (again) for not paying proper attention there. No excuses for
that.

> The pccardd example is an example of legacy use of our old and
> original semantics of how the whole nested sleep was supposed to work.
> And it *does* work. It's not a bug. It's how things have worked time
> immemorial, and clearly nobody is really willing to bother with
> changing working - but legacy - cardbus code. And at that point, I
> think the debug code is actually *wrong*, and causes more problems
> than it "fixes".
>
> And debug code that causes more problems that it fixes should either
> be removed, or improved to the point where the problems go away.
>
> The "improved" part might be about saying "it's actually perfectly
> _fine_ to have nested sleeps, as long as it is truly rare that the
> nested sleep actually sleeps". And thus make the debug code really
> test that it's *rare*, rather than test that it never happens. Warn if
> it happens more than a couple of times a second for any particular
> process, or something like that.

So the thing that makes it work is the fact that schedule is called in
a loop, without that loop things come apart very quickly.

Now most core primitives have this loop, and are safe from spurious
wakeups -- and I think we can class this under that. But there's heaps
of legacy code that has non looping calls of schedule and really breaks
with spurious wakeups (I ran into that a few years ago when I made
schedule() return 'randomly' for some other reason).

So I worry about separating nesting sleep in loops, which are mostly
harmless vs the non-loop case which is really bad.

FWIW the bug that started all this was someone calling blocking
functions from io_schedule():

io_schedule() -> blk_flush_plug() -> block-layer-magic ->
device-magic() -> mutex_lock().

And mutex_lock(() used:

if (need_resched())
schedule();

for preemption and would never return (because !TASK_RUNNING).

Now, we've since fixed the mutex code to not assume TASK_RUNNING, so we
won't actually trigger the reported 'deadlock' anymore.

In any case, I can certainly make the warning go away for now and try
again later with a (hopefully) more intelligent version.




On a related note, would it be possible to make sparse/coccinelle try
and warn on broken wait primitives? ie, no loop around schedule(), no
conditional after set_current_state()?

Examples:

drivers/atm/atmtcp.c- while (test_bit(flag,&vcc->flags) == old_test) {
drivers/atm/atmtcp.c- mb();
drivers/atm/atmtcp.c- out_vcc = PRIV(vcc->dev) ? PRIV(vcc->dev)->vcc : NULL;
drivers/atm/atmtcp.c- if (!out_vcc) {
drivers/atm/atmtcp.c- error = -EUNATCH;
drivers/atm/atmtcp.c- break;
drivers/atm/atmtcp.c- }

set_bit(flag, &vcc->flags);
wake_up_process(foo);

drivers/atm/atmtcp.c- set_current_state(TASK_UNINTERRUPTIBLE);
drivers/atm/atmtcp.c: schedule();
drivers/atm/atmtcp.c- }

Would not ever wake again it seems.



And I'm not entire sure what this is supposed to do, but it looks fishy
(seems to be popular though, I've found more instances of it):

drivers/iommu/amd_iommu_v2.c-static void put_device_state_wait(struct device_state *dev_state)
drivers/iommu/amd_iommu_v2.c-{
drivers/iommu/amd_iommu_v2.c- DEFINE_WAIT(wait);
drivers/iommu/amd_iommu_v2.c-
drivers/iommu/amd_iommu_v2.c- prepare_to_wait(&dev_state->wq, &wait, TASK_UNINTERRUPTIBLE);
drivers/iommu/amd_iommu_v2.c- if (!atomic_dec_and_test(&dev_state->count))
drivers/iommu/amd_iommu_v2.c: schedule();
drivers/iommu/amd_iommu_v2.c- finish_wait(&dev_state->wq, &wait);
drivers/iommu/amd_iommu_v2.c-
drivers/iommu/amd_iommu_v2.c- free_device_state(dev_state);
drivers/iommu/amd_iommu_v2.c-}

Did they want to write:

if (atomic_dec_and_test(&dev_state->count)) {
wake_up(&dev_state->wq);
return;
}

wait_event(dev_state->wq, !atomic_read(&dev_state->count));



Also, people seem to have gotten the memo that sched_yield() is
undesired, but not actually understood why; there's a absolute ton of:

while(!foo)
schedule();

Some explicitly set TASK_RUNNING, most not so much.



I realize we're not ever going to fix all of that; but it would be good
to avoid growing more of it. And I think the amd_iommu_v2 thing shows
its not only legacy code.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/