Re: [PATCH v1 5/5] driver-core: add driver asynchronous probe support

From: Luis R. Rodriguez
Date: Tue Sep 30 2014 - 11:24:34 EST


On Tue, Sep 30, 2014 at 11:22:14AM +0200, Tom Gundersen wrote:
> On Tue, Sep 30, 2014 at 4:27 AM, Luis R. Rodriguez <mcgrof@xxxxxxxx> wrote:
> > On Sun, Sep 28, 2014 at 07:07:24PM +0200, Tom Gundersen wrote:
> >> On Fri, Sep 26, 2014 at 11:57 PM, Luis R. Rodriguez
> >> <mcgrof@xxxxxxxxxxxxxxxx> wrote:
> >> > From: "Luis R. Rodriguez" <mcgrof@xxxxxxxx>
> >> > Systemd has a general timeout for all workers currently set to 180
> >> > seconds after which it will send a sigkill signal. Systemd now has a
> >> > warning which is issued once it reaches 1/3 of the timeout. The original
> >> > motivation for the systemd timeout was to help track device drivers
> >> > which do not use asynch firmware loading on init() and the timeout was
> >> > originally set to 30 seconds.
> >>
> >> Please note that the motivation for the timeout in systemd had nothing
> >> to do with async firmware loading (that was just the case where
> >> problems cropped up).
> >
> > *Part *of the original kill logic, according to the commit log, was actually
> > due to the assumption that the issues observed *were* synchronous firmware
> > loading on module init():
> >
> > commit e64fae5573e566ce4fd9b23c68ac8f3096603314
> > Author: Kay Sievers <kay.sievers@xxxxxxxx>
> > Date: Wed Jan 18 05:06:18 2012 +0100
> >
> > udevd: kill hanging event processes after 30 seconds
> >
> > Some broken kernel drivers load firmware synchronously in the module init
> > path and block modprobe until the firmware request is fulfilled.
> > <...>
>
> This was a workaround to avoid a deadlock between udev and the kernel.
> The 180 s timeout was already in place before this change, and was not
> motivated by firmware loading. Also note that this patch was not about
> "tracking device drivers", just about avoiding dead-lock.

Thanks, can you elaborate on how a deadlock can occur if the kmod
worker is not at some point sigkilled?

> > My point here is not to point fingers but to explain why we went on with
> > this and how we failed to realize only until later that the driver core
> > ran probe together with init. When a few folks pointed out the issues
> > with the kill the issue was punted back to kernel developers and the
> > assumption even among some kernel maintainers was that it was init paths
> > with sync behaviour that was causing some delays and they were broken
> > drivers. It is important to highlight these assumptions ended up setting
> > us off on the wrong path for a while in a hunt to try to fix this issue
> > either in driver or elsewhere.
>
> Ok. I'm not sure the motivations for user-space changes is important
> to include in the commit message, but if you do I'll try to clarify
> things to avoid misunderstandings.

I can try to omit it on the next series.

> > Thanks for clarifying this, can you explain what issues could arise
> > from making an exception to allowing kmod workers to hang around
> > completing init + probe over a certain defined amount of time without
> > being killed?
>
> We could run out of udev workers and the whole boot would hang.

Is the issue that if there is no extra worker available and all are
idling on sleep / synchronous long work boot will potentially hang
unless a new worker becomes available to do more work? If so I can
see the sigkill helping for hanging tasks but it doesn't necessarily
mean its a good idea to kill modules loading taking a while. Also
what if the sigkill is just avoided for *just* kmod workers?

> The way I see it, the current status from systemd's side is: our
> short-term work-around is to increase the timeout, and at the moment
> it appears no long-term solution is needed (i.e., it seems like the
> right thing to do is to make sure insmod can be near instantaneous, it
> appears people are working towards this goal, and so far no examples
> have cropped up showing that it is fundamentally impossible (once/if
> they do, we should of course revisit the problem)).

That again would be reactive behaviour, what would prevent avoiding the
sigkill only for kmod workers? Is it known the deadlock is immiment?
If the amount of workers for kmod that would hit the timeout is
considered low I don't see how that's possible and why not just lift
the sigkill.

Luis
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/