Re: [PATCH 4.13 20/27] Revert "firmware: add sanity check on shutdown/suspend"

From: Luis R. Rodriguez
Date: Wed Sep 13 2017 - 14:38:48 EST


Rafeal a question for you below.

On Tue, Sep 12, 2017 at 09:11:46PM -0700, Linus Torvalds wrote:
> On Tue, Sep 12, 2017 at 5:47 PM, Greg Kroah-Hartman
> <gregkh@xxxxxxxxxxxxxxxxxxx> wrote:
> >
> >> If reverting this commit please consider reverting also commit
> >> 06a45a93e7d34a ("firmware: move umh try locks into the umh code").
> >
> > Ok, I can queue that revert up in my tree and will send it to Linus once
> > 4.14-rc1 is out.
>
> I want to see a _reason_ for that revert. The two have absolutely
> nothing to do with each other., Reverting one is *not* a reason for
> reverting the other.

There is a dependency between both commits, the reason is not obvious though.
I'll explain below.

> Commit 06a45a93e7d34a seems to be a cleanup. The arguments in
> 06a45a93e7d3 ("firmware: move umh try locks into the umh code") seem
> valid, and there's no real reason to worry about that FW_OPT_NOWAIT
> etc for the direct-from-filesystem loading. That's simply not
> sensible.

Indeed! That stupid UMH lock *seems* wrong on the direct filesystem path!

Hence these changes.

The devil is in the details though. That UMH lock however carried an implicit
suspend guard, the "cleanup" actually then has a functional change. The commit
which was reverted provided the safe guard in generic form, in case we already
had become dependent on the suspend guard. This UMH lock on the direct FS path
then added an implicit "arbitrary rule", as you put it, on the firmware API.

Commit 81f95076281fd clarified and made it explicit just to be safe.

*Only* carrying 06a45a93e7d34a then was a long term goal, eventually I intended
to remove the code which you removed, but I would have preferred to have waited
at least a release.

By only having 06a45a93e7d34a we then loose the suspend guard from the direct
FS lookup path immediately. If we're happy to live with that right away then
great, this may be one of those *random arbitrary rules* inherited worth
removing, it was a long term goal to remove that code, however I just want
folks to be very well aware of the original goal: 06a45a93e7d34a without
81f95076281fd is not a cleanup, there is a hidden functional change there.

So to be clear: I'm happy with 81f95076281fd reverted as it was my long term
goal, however I would not be doing my job if I were not explaining that
06a45a93e7d34a *has* a functional change and that 81f95076281fd was an
attempt at a safe baby step forward.

> That whole FW_OPT_NOWAIT thing only makes sense for the actual
> user-mode helper, so that commit actually seems to move the testing
> and the logic to a place that really does make sense.

Indeed!

> So why would we revert a commit that makes SENSE?

Because although the UMH lock should only be used for code which needs the UMH
helpers, the UMH lock also had added a suspend guard which we grew to rely on.
Removing it could mean allowing races on resume, and there was no clear way
generically detecting this. The UMH lock was *also* never used on *any other*
UMH code, so from the UMH perspective it also begs the question if the other
UMH code is error prone as it lacks the safe UMH locking safe guards.

> For example, what might be senseible is to add a warning that tries to
> verify that people do *not* do firmware loads from the ->resume()
> callbacks.

The firmware API *allows* for resume() callbacks to use the firmware API in
ways which *should* not block much if any at all, however they must have first
at least called the firmware API once so that the device gets a devres entry
with a string associated; this is the firmware cache. Then upon suspend the
firmware API requests for each of these firmwares. If a resume() callback then
calls the firmware API for any of these files it would work without issue as
the firmware is loaded in the cache, as simple pointer reference. Processing
and consuming the firmware on the driver though *can* take a while and block
longer.

Because of the firmware cache implementation then we needed a whitelist, so the
check which I implemented in 81f95076281fd *only* complains if we've passed the
firmware cache check, and we *know* then the incoming call is new.

> But then it should literally check *that*: it could do
> something like
>
> WARN_ON_ONCE(current == resume_thread, "Firmware loading
> called synchronously during resume");

Sure that would be *much* smaller code.

Rafael do you have helpers for this sort of thing or are OK with them being
added?

> or whatever, exactly because it's obviously *not* ok to block the same
> process that is going to resume all the other devices that might be
> *needed* for the firmware loading.

Your interest in this seems to be the blocking implications on synchronous
calls. That's certainly a new concern I had not heard raised yet and is
worthwhile addressing. Are you not concerned in any way though about the
filesystem being ready?

> But on the other hand, if somebody then does an independent thread to
> resume their firmware, we could just block to wait for it - it
> wouldn't be the same kind of chicken-and-egg issue with IO at resume
> time.
>
> So instead of "arbitrary rules", there should be things that actually
> make sense.

The arbitrary rule here was hidden underneath code Rafael added years ago.
I'm pretty sure he did not intend to add the check on the direct FS path
as direct FS path came later as the code evolved, *but* we kept the lock
on the direct FS path so we have no option but to review carefully its
removal.

> The commit that Luis now argues for _also_ reverting makes a lot of
> sense to me to keep. I'm not seeing why that should be reverted, when
> the _only_ reason seems to be some spiteful "well, if you reverted one
> commit, you should randomly revert another one too".

I hope to have clarified there is no spite here, I'm simply being very
careful to avoid a regression with a real hidden functional change.

Luis