Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate

From: Stephen Boyd
Date: Tue Apr 12 2022 - 00:47:58 EST


Quoting Stephen Boyd (2022-04-11 13:20:27)
>
> I think it's some IPA unclocked access because it's an SError async
> abort. I looked some at the IPA clk implementation and it is a little
> concerning. I see two problems. The first problem is that clk-rpmh is
> only voting on the active state for the IPA clk. Maybe RPMh will move
> the sleep/wake state over from the active state automatically as long as
> we never make a request in either of those states? I don't know, but it
> looks concerning and either needs to be fixed or a big comment
> indicating that the copy happens. This patch makes it set a frequency in
> each state bucket.
>

This patch doesn't seem to matter. I think that's because RPMh copies
over active state settings to sleep and wake state automatically if
those states are never changed by that processor. Someone at qcom needs
to confirm this theory. I'll send the patch and see if that spurs
someone at qcom to respond.

>
> The second problem I see is that the IPA clk resource is "IP0" but it is
> still defined in the interconnect/qcom/sc7180.c file.
>
> $ git grep \"IP0\" -- drivers/{interconnect,clk}/
> drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdm845, ipa, "IP0");
> drivers/clk/qcom/clk-rpmh.c:DEFINE_CLK_RPMH_BCM(sdx55, ipa, "IP0");
> drivers/interconnect/qcom/sc7180.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
> drivers/interconnect/qcom/sc8180x.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &slv_ipa_core_slave);
> drivers/interconnect/qcom/sdx55.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
> drivers/interconnect/qcom/sm8150.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
> drivers/interconnect/qcom/sm8250.c:DEFINE_QBCM(bcm_ip0, "IP0", false, &ipa_core_slave);
>
> That sounds pretty bad because it means both interconnect and clk frameworks
> are trying to control the same RPMh resource, trampling over each other.
> Combine that with unused clks and sync_state support and I don't know
> what the state of the IP0 resource really is anymore.

Alright, some more debugging has confirmed this. I put a print in the
rpmh driver to figure out when the IP0 resource, according to cmd-db, is
being modified. Luckily, the interconnect driver uses the rpmh batch API
while the clk driver uses the direct write API. I put a dump_stack()
when the batch API is called on the IP0 address and that sometimes gets
zeroed out (i.e. IPA clk is turned off) after the rpmh clk driver turned
it ON. The rpmh driver in the kernel doesn't do any aggregation between
kernel clients, so the problem is this sequence:

IPA driver probes
runtime PM get()
IPA clk enabled -> IP0 resource is ON
interconnect zeroes out the IP0 resource -> IP0 resource is off
IPA driver tries to access a register and blows up

The interconnect framework is zeroing it out now because there's a
sync_state callback once all drivers have probed. This is why I saw it
more easily when the IPA driver is builtin vs. a module. The IPA driver
is much more likely to have turned on the resource before sync_state is
called, but the use of autosuspend made it variable. If the IPA driver
autosuspend callback is delayed just enough, then IPA will turn on the
IP0 resource via the clk API, sync state will turn it off from the
interconnect framework, and then the pm_runtime_get_sync() in the IPA
driver will skip powering the clk on again because it never turned it
off. The runtime PM state of the device is correct, but the clk is off.
Oops!

I think changing the IPA_AUTOSUSPEND_DELAY define to be multiple seconds
makes it even more likely, because then the IPA device definitely won't
be suspended by the time interconnect sync state happens. Anyway, this
also explains why the IPA runtime PM patch made things better. Sometimes
the IPA device will be runtime suspended, and thus it will power on the
IP0 resource on runtime resume even after interconnect sync state
happened. This is the situation on v5.10.y, where runtime PM isn't
happening at all, but sync state is once commit b95b668eaaa2
("interconnect: qcom: icc-rpmh: Add BCMs to commit list in
pre_aggregate") is included. The IP0 resource is guaranteed to be turned
off after sync state drops the request.

So the fix is simple, completely remove IP0 from the interconnect driver
so that sync state doesn't turn it off. Then the IPA driver without
runtime PM will work because the clk resource is turned on and kept on
until system suspend. When the runtime PM patch in IPA is applied, it
will also work because runtime PM will power things on correctly, or
keep them powered on because autosuspend is delayed.

I'll send this patch to the list shortly. And split it up for the other
drivers too.