Re: [PATCH v3] interconnect: qcom: icc-rpmh: Add BCMs to commit list in pre_aggregate

From: Alex Elder
Date: Mon Apr 11 2022 - 11:59:18 EST


On 4/5/22 6:00 PM, Stephen Boyd wrote:
Quoting Georgi Djakov (2021-11-25 09:47:51)
From: Mike Tipton <mdtipton@xxxxxxxxxxxxxx>

We're only adding BCMs to the commit list in aggregate(), but there are
cases where pre_aggregate() is called without subsequently calling
aggregate(). In particular, in icc_sync_state() when a node with initial
BW has zero requests. Since BCMs aren't added to the commit list in
these cases, we don't actually send the zero BW request to HW. So the
resources remain on unnecessarily.

Add BCMs to the commit list in pre_aggregate() instead, which is always
called even when there are no requests.

Signed-off-by: Mike Tipton <mdtipton@xxxxxxxxxxxxxx>
[georgi: remove icc_sync_state for platforms with incomplete support]
Signed-off-by: Georgi Djakov <djakov@xxxxxxxxxx>

I'm back from vacation and am finally giving proper attention to
this. I want to make sure I understand the problem, because there
are (at least) two parts to it.

- The first problem you observe is that you are not seeing XO
shutdown on suspend on a Lazor device.
- You didn't say this directly but I think you are seeing this
on Linux v5.15.y (the 5.15 LTS branch), or perhaps on something
derived from that branch.
- You find that if you back-port (or cherry-pick?) the commit
that landed upstream as b95b668eaaa2 ("interconnect: qcom:
icc-rpmh: Add BCMs to commit list in pre_aggregate
"), you
*do* see XO shutdown on suspend, as desired.

Here's what I understand that commit to do:
- In some cases, the bus clock managers (BCMs) are configured
by the boot loader so that some interconnects have non-zero
initial bandwidth.
- There is no sense in keeping an interconnect active if Linux
has nothing that requires its use. So we would like Linux to
ensure the configured bandwidth for an *unused* interconnect
is zero.
- Prior to that commit, BCM-managed hardware was only queued
to update its configuration when the ->aggregate interconnect
provider function was called. After that commit, updates were
queued by the ->pre_aggregate provider function.
- Unlike the ->aggregate callback, the ->pre_aggregate provider
function queues updates to the hardware configuration whether
or not they have active users.
- The result of this commit is that the hardware configuration
for all defined BCM-managed interconnects is updated, and in
particular, the configured bandwidth for unused interconnects
is set to zero.

When unused interconnects are configured for zero bandwidth, they
do not require an active main XO clock, and so with this commit
it becomes possible for the XO clock to be shut down.

And that's why this commit addresses your XO shutdown problem on
the Linux 5.15 LTS branch.

Is the above an accurate description?

Looking at that branch, I see this commit: f753067494c27
("Revert "interconnect: qcom: icc-rpmh: Add BCMs to commit
list in pre_aggregate"
"). Which shows that an attempt was made
to include this commit in the 5.15 LTS branch, but it caused
some *other* regressions. That suggests this might not be
easy to fix.

---

The second problem you have is exhibited by the IPA driver if
the "fix" commit (upstream b95b668eaaa2) is back-ported to the
Linux 5.10.y LTS branch (along with some other prerequisite
commits). We can conclude that applying the above commit
makes the bandwidth for an unused interconnect (or perhaps
the rate for the IPA core clock) get set to zero. And in that
case, an attempt to access IPA hardware leads to the crash you
observed.

The IPA driver does not implement runtime power management
until Linux v5.15. You later said you thought enabling that
might ensure the clock and interconnects were active when
needed by the IPA driver, and I concur (but there could be a
little more to it).

In any case, based on the time stamp in your log, it seems
this problem is likely occurring upon the first access to IPA
hardware.

I have a hunch about what might be happening here. There is
some synchronization that must occur between the AP and modem
when IPA is starting up. Until that synchronization step has
completed, we can't allow the IPA network device to be opened.
In later kernels I think this is precluded, but perhaps in
Linux v5.10 it isn't. Until I look a little more closely I'm
not sure what would happen, but it *could* be this.

I'm going to look a little how the particular access that
caused the crash is prevented in newer kernels. It could
be that back-porting that (or re-implementing it for the
older kernel) will address the crash you're seeing.

-Alex

This patch fixes suspend/resume for me on sc7180-trogdor-lazor. Without
it I can't achieve XO shutdown. It seems that it fixes the sync_state
support that was added in commit b1d681d8d324 ("interconnect: Add sync
state support"). Before that commit suspend worked because the
interconnect wasn't maxed out at boot. After that commit we started
maxing out the interconnect state and never dropping it.

It would be good to pick this back to stable kernels so we have a
working suspend/resume on LTS kernels. I tried picking it back to
5.10.109 (latest 5.10 LTS) and booting it on my Lazor w/ LTE device but
it crashes at boot pretty reliably in the IPA driver. Interestingly I
can't get it to crash on 5.15.32 when I pick it back, so maybe something
has changed between 5.10 and 5.15 for IPA? I'll try to bisect it.

[ 23.708432] Internal error: synchronous external abort: 96000010
[#1] PREEMPT SMP
[ 23.708451] Modules linked in: veth rfcomm algif_hash
algif_skcipher af_alg uinput xt_MASQUERADE uvcvideo videobuf2_vmalloc
venus_enc venus_dec videobuf2_dma_sg videobuf2_memops venus_core
v4l2_mem2mem videobuf2_v4l2 cros_ec_typec videobuf2_common hci_uart
typec btqca qcom_q6v5_mss ipa qcom_pil_info qcom_q6v5 qcom_common
rmtfs_mem ip6table_nat fuse 8021q bluetooth ecdh_generic ecc
ath10k_snoc ath10k_core ath lzo_rle lzo_compress mac80211 zram
cfg80211 r8152 mii joydev
[ 23.708565] CPU: 5 PID: 3706 Comm: mmdata_mgr Not tainted 5.10.109+ #61
[ 23.708571] Hardware name: Google Lazor (rev1+) with LTE (DT)
[ 23.708578] pstate: 60400009 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[ 23.708597] pc : gsi_channel_start+0x78/0x1dc [ipa]
[ 23.708609] lr : gsi_channel_start+0x4c/0x1dc [ipa]
[ 23.708614] sp : ffffffc013d9ba20
[ 23.708619] x29: ffffffc013d9ba20 x28: 0000000000000000
[ 23.708628] x27: 0000000000000000 x26: ffffffc013d9bc20
[ 23.708637] x25: 000000000001c000 x24: 0000000000000000
[ 23.708646] x23: ffffffab00cb9410 x22: 00000000712dcf80
[ 23.708654] x21: ffffffab486bc148 x20: ffffffab486b8a18
[ 23.708663] x19: ffffffab486b8000 x18: 00000000ffff0a00
[ 23.708671] x17: 000000002f7254f1 x16: ffffffeb3db6f344
[ 23.708680] x15: 00000000ffee6094 x14: ffffffffffffffff
[ 23.708689] x13: 0000000000000010 x12: 0101010101010101
[ 23.708697] x11: 000000000000013c x10: 0000000000000000
[ 23.708706] x9 : 000000000001c000 x8 : ffffffc018f1c000
[ 23.708715] x7 : fefefefefeff2f60 x6 : 0000808080808080
[ 23.708724] x5 : 0000000000000000 x4 : 8080808080800000
[ 23.708732] x3 : 0000000000000000 x2 : ffffffab5089eac0
[ 23.708741] x1 : 0000000000000000 x0 : 0000000000000000
[ 23.708750] Call trace:
[ 23.708762] gsi_channel_start+0x78/0x1dc [ipa]
[ 23.708773] ipa_endpoint_enable_one+0x34/0xc0 [ipa]
[ 23.708785] ipa_open+0x30/0x98 [ipa]
[ 23.708795] __dev_open+0xd8/0x190
[ 23.708803] __dev_change_flags+0xbc/0x1c8
[ 23.708810] dev_change_flags+0x30/0x70
[ 23.708818] devinet_ioctl+0x274/0x500
[ 23.708824] inet_ioctl+0x10c/0x394
[ 23.708831] sock_do_ioctl+0x58/0x324
[ 23.708837] compat_sock_ioctl+0x238/0xdb0
[ 23.708845] __arm64_compat_sys_ioctl+0xcc/0x104
[ 23.708854] el0_svc_common+0xec/0x1dc
[ 23.708860] do_el0_svc_compat+0x28/0x54
[ 23.708868] el0_svc_compat+0x10/0x1c
[ 23.708874] el0_sync_compat_handler+0xc0/0xf0
[ 23.708880] el0_sync_compat+0x184/0x1c0
[ 23.708890] Code: 51286129 53037d29 1b166529 8b090108 (b9400108)

Note I had to pick a handful of other patches for nvmem to get normal
boot on 5.10.109. I'll send those over to stable maintainers shortly.

fd3bb8f54a88 ("nvmem: core: Add support for keepout regions")
de0534df9347 ("nvmem: core: fix error handling while validating
keepout regions")
044ee8f85267 ("nvmem: qfprom: Don't touch certain fuses")
437145dbcdee ("arm64: dts: qcom: sc7180: Add soc-specific qfprom
compat string")
437cdef515e2 ("arm64: dts: qcom: sc7180:: modified qfprom CORR size
as per RAW size")