Re: INFO: task hung in wdm_flush

From: Dmitry Vyukov
Date: Sat Nov 23 2019 - 01:53:01 EST


On Tue, Nov 19, 2019 at 12:34 PM BjÃrn Mork <bjorn@xxxxxxx> wrote:
>
> Oliver Neukum <oneukum@xxxxxxx> writes:
> > Am Dienstag, den 19.11.2019, 10:14 +0100 schrieb BjÃrn Mork:
> >
> >> Anyway, I believe this is not a bug.
> >>
> >> wdm_flush will wait forever for the IN_USE flag to be cleared or the
> >
> > Damn. Too obvious. So you think we simply have pending output that does
> > just not complete?
>
> I do miss a lot of stuff so I might be wrong, but I can't see any other
> way this can happen. The out_callback will unconditionally clear the
> IN_USE flag and wake up the wait_queue.
>
> >> DISCONNECTING flag to be set. The only way you can avoid this is by
> >> creating a device that works normally up to a point and then completely
> >> ignores all messages,
> >
> > Devices may crash. I don't think we can ignore that case.
>
> Sure, but I've never seen that happen without the device falling off the
> bus. Which is a disconnect.
>
> But I am all for handling this *if* someone reproduces it with a real
> device. I just don't think it's worth the effort if it's only a
> theoretical problem.
>
> >> but without resetting or disconnecting. It is
> >> obviously possible to create such a device. But I think the current
> >> error handling is more than sufficient, unless you show me some way to
> >> abuse this or reproduce the issue with a real device.
> >
> > Malicious devices are real. Potentially at least.
> > But you are right, we need not bend over to handle them well, but we
> > ought to be able to handle them.
>
> Sure, we need to handle malicious devices. But only if they can be used
> for real harm.
>
> This warning requires physical acceess and is only slightly annoying.
> Like a USB device making loud farting sounds. You'd just disconnect the
> device. No need for Linux to detect the sound and handle it
> automatically, I think.

Hi BjÃrn,

Besides the production use you are referring to, there are 2 cases we
should take into account as well:
1. Testing.
Any kernel testing system needs a binary criteria for detecting kernel
bugs. It seems right to detect unkillable hung tasks as kernel bugs.
Which means that we need to resolve this in some way regardless of the
production scenario.
2. Reliable killing of processes.
It's a very important property that an admin or script can reliably
kill whatever process/container they need to kill for whatever reason.
This case results in an unkillable process, which means scripts will
fail, automated systems will misbehave, admins will waste time (if
they are qualified to resolve this at all).