Re: [PATCH RFC 01/10] PM: runtime: Add helpers to resume consumers
From: Rafael J. Wysocki
Date: Wed Apr 09 2025 - 13:55:41 EST
Hi,
On Fri, Mar 28, 2025 at 10:59 AM Miquel Raynal
<miquel.raynal@xxxxxxxxxxx> wrote:
>
> Hello Rafael,
>
> >> The runtime PM core currently allows to runtime resume/suspend a device,
> >> or its suppliers.
> >>
> >> Let's make it also possible to runtime resume/suspend consumers.
> >>
> >> Consumers and suppliers are seen here through the description made by
> >> device_links.
> >
> > It would be good to explain why all of this is needed.
> >
> > I gather that it is used for resolving some synchronization issues in
> > the clk framework, but neither the cover letter nor this changelog
> > explains how it is used.
>
> The explanation is quite long, there have been already 3 full threads
> from people attempting to fix a problem that resides in the clock
> subsystem (but that may also be probably problematic in others, just
> uncovered so far). I don't know if you took the time to read the cover
> letter:
> https://lore.kernel.org/linux-clk/20250326-cross-lock-dep-v1-0-3199e49e8652@xxxxxxxxxxx/
> It tries to explain the problem and the approach to fix this problem,
> but let me try to give a runtime PM focused view of it here.
>
> [Problem]
>
> We do have an ABBA locking situation between clk and any other subsystem
> that might be in use during runtime_resume() operations, provided that
> these subsystems also make clk calls at some point. The usual suspect
> here are power domains.
>
> There are different approaches that can be taken but the one that felt
> the most promising when we discussed it during last LPC (and also the
> one that was partially implemented in the clk subsystem already for a
> tiny portion of it) is the rule that "subsystem locks should not be kept
> acquired while calling in some other subsystems".
>
> Typically in the clk subsystem the logic is:
>
> func() {
> mutex_lock(clk);
> runtime_resume(clk);
> ...
> }
>
> Whereas what would definitely work without locking issues is the
> opposite:
>
> func() {
> runtime_resume(clk);
> mutex_lock(clk);
> ...
> }
>
> Of course life is not so simple, and the clock core is highly
> recursive, which means inverting the two calls like I hinted above
> simply does not work as we go deeper in the subcalls. As a result, we
> need to runtime resume *all* the relevant clocks in advance, before
> calling functions recursively (the lock itself is allowed to re-enter
> and is not blocking in this case).
>
> I followed all possible paths in the clock subsystem and identified 3
> main categories. The list of clocks we need to runtime resume in advance
> can either be:
> 1- the parent clocks
> 2- the child clocks
> 3- the parent and child clocks
> 4- all the clocks (typically for debugfs/sysfs purposes).
>
> [Solution 1: discarded]
>
> The first approach to do that was do to some guessing based on the clock
> tree topology. Unfortunately this approach does not stand because it is
> virtually unbounded. In order to know the clock topology we must acquire
> the clock main lock. In order to runtime resume we must release it. As a
> result, this logic is virtually unbounded (even though in practice we
> would converge at some point). So this approach was discarded by Steven.
>
> [Solution 2: this proposal]
>
> After the LPC discussion with Steven, I also discussed with Saravana
> about this and he pointed that since we were using fw_devlink=rpm by
> default now, all providers -including clock controllers of course- would
> already be runtime resumed the first time we would make a
> runtime_resume(clk), and thus all the nested calls were no longer
> needed. This native solution was already addressing point #1 above (and
> partially point #3) and all I had to do was to make a similar function
> for point #2.
So this depends on DT being used and fw_devlink=rpm being used, doesn't it?
You cannot really assume in general that there will be device links
between parents and children.