Re: [PATCH v4 00/79] Address some issues with PM runtime at media subsystem

From: Johan Hovold
Date: Thu Apr 29 2021 - 11:17:08 EST


On Thu, Apr 29, 2021 at 12:18:16PM +0200, Mauro Carvalho Chehab wrote:
> Em Wed, 28 Apr 2021 17:50:08 +0200
> Johan Hovold <johan@xxxxxxxxxx> escreveu:
>
> > On Wed, Apr 28, 2021 at 04:51:21PM +0200, Mauro Carvalho Chehab wrote:
>
> > > 1. despite its name, this is actually a PM runtime resume call,
> > > but some developers didn't seem to realize that, as I got this
> > > pattern on some drivers:
> > >
> > > pm_runtime_get_sync(&client->dev);
> > > pm_runtime_disable(&client->dev);
> > > pm_runtime_set_suspended(&client->dev);
> > > pm_runtime_put_noidle(&client->dev);
> > >
> > > It makes no sense to resume PM just to suspend it again ;-)
> >
> > This is perfectly alright. Take a look at ov7740_remove() for example:
> >
> > pm_runtime_get_sync(&client->dev);
> > pm_runtime_disable(&client->dev);
> > pm_runtime_set_suspended(&client->dev);
> > pm_runtime_put_noidle(&client->dev);
> >
> > ov7740_set_power(ov7740, 0);
> >
> > There's an explicit power-off after balancing the PM count and this will
> > work regardless of the power state when entering this function.
>
> Ok, but this should equally work:
>
> pm_runtime_disable(&client->dev);
> pm_runtime_set_suspended(&client->dev);
>
> ov7740_set_power(ov7740, 0);
>
> as there's no additional cleanup made on this particular driver
> between pm_runtime_get_sync() and pm_runtime_put_noidle().

No, that would break the driver as I pointed out to you yesterday:

https://lore.kernel.org/r/YImG1klSPkFSaS3a@xxxxxxxxxxxxxxxxxxxx

If the device is already suspended when remove is called then you'll
end up with an unbalanced call to ov7740_set_power() that will try to
disable an already disabled clock.

> > So this has nothing to do with pm_runtime_get_sync() per se.
>
> Yes, but some patches on this series are cleaning up the driver release
> logic.

You mentioned this example as an argument against using
pm_runtime_get_sync(), which I don't think makes sense.

> > > 2. Usual *_get() methods only increment their use count on success,
> > > but pm_runtime_get_sync() increments it unconditionally. Due to
> > > that, several drivers were mistakenly not calling
> > > pm_runtime_put_noidle() when it fails;
> >
> > Sure, but pm_runtime_get_async() also works this way. You just won't be
> > notified if the async resume fails.
>
> Granted, it makes sense along the pm_runtime kAPI.
>
> It is inconsistent with the behavior of kobject_get*() and other
> *_get*() methods that are based or inspired on it, as, on those, the
> operations are atomic: either everything succeeds and it doesn't return
> an error, or the usage counter is not incremented and the object
> state doesn't change after the call.

Right, and I'm aware that some people have overlooked this. But its not
the end of the world since hardly any driver can handle resume failures
properly anyway.

This is mostly just an exercise to shut up static checkers.

> > > 3. The name of the new variant is a lot clearer:
> > > pm_runtime_resume_and_get()
> > > As its same clearly says that this is a PM runtime resume function,
> > > that also increments the usage counter on success;
> >
> > It also introduced an inconsistency in the API and does not pair as well
> > with the pm_runtime_put variants.
>
> Agreed. A name that would be more consistent with PM runtime would
> probably be:
>
> pm_runtime_resume_if_get()

Naw, since the get part always succeeds.

It should start with pm_runtime_get, but pm_runtime_get_sync() is
unfortunately taken.

> as there are already:
>
> pm_runtime_get_if_in_use()
> pm_runtime_get_if_active()
>
> But any such discussions are out of the scope of this patchset ;-)

Right.

> > > 4. Consistency: we did similar changes subsystem wide with
> > > for instance strlcpy() and strcpy() that got replaced by
> > > strscpy(). Having all drivers using the same known-to-be-safe
> > > methods is a good thing;
> >
> > It's not known to be safe; there are ways to get also this interface
> > wrong as for example this series has shown.
>
> Very true. Yet, it is a lot simpler to use functions that won't change
> the state of the objects when returning an error, as this is by far
> the most common pattern within the Kernel.

A resume failure does change the state (and needs to be recovered from),
but I get what you're saying.

> Human brains are trained to identify certain patterns. When there's
> something using a similar pattern, but with a different behavior,
> our brains are more subject to fail identifying problems.

Sure. But I'm not sure that having two interfaces with different
semantics to do the job is doing us any favours here. But again, that
discussion has already been had.

And I realise that this is partly also your motive here (even if the old
interface isn't going to go away).

> > > compile-tested only.
> > > Patches 1 to 7 fix some issues that already exists at the current
> > > PM runtime code;
> > >
> > > patches 8 to 20 fix some usage_count problems that still exists
> > > at the media subsystem;
> > >
> > > patches 21 to 78 repaces pm_runtime_get_sync() by
> > > pm_runtime_resume_and_get();
> > >
> > > Patch 79 (and a hunk on patch 78) documents the two exceptions
> > > where pm_runtime_get_sync() will still be used for now.

80 patches in one series (posted to lkml) is a bit excessive. Perhaps
you can break it up in a fixes part and one or more cleanups parts?

Johan