Re: [PATCH] regulator: qcom-rpmh: Revert "regulator: qcom-rpmh: Use PROBE_FORCE_SYNCHRONOUS"

From: Doug Anderson
Date: Tue May 16 2023 - 17:24:27 EST


Hi,

On Tue, May 16, 2023 at 11:12 AM Amit Pundir <amit.pundir@xxxxxxxxxx> wrote:
>
> On Mon, 15 May 2023 at 20:33, Doug Anderson <dianders@xxxxxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On Mon, May 15, 2023 at 7:42 AM Amit Pundir <amit.pundir@xxxxxxxxxx> wrote:
> > >
> > > On Sun, 14 May 2023 at 18:11, Caleb Connolly <caleb.connolly@xxxxxxxxxx> wrote:
> > > >
> > > >
> > > >
> > > > On 13/05/2023 18:08, Amit Pundir wrote:
> > > > > On Fri, 24 Mar 2023 at 19:05, Douglas Anderson <dianders@xxxxxxxxxxxx> wrote:
> > > > >>
> > > > >> This reverts commit 58973046c1bf ("regulator: qcom-rpmh: Use
> > > > >> PROBE_FORCE_SYNCHRONOUS"). Further digging into the problems that
> > > > >> prompted the us to switch to synchronous probe showed that the root
> > > > >> cause was a missing "rootwait" in the kernel command line
> > > > >> arguments. Let's reinstate asynchronous probe.
> > > > >
> > > > > Hi, the asynchronous probe is broken on Dragonboard 845c (SDM845)
> > > > > running AOSP (Android Open Source Project) with v6.4-rc1
> > > > > https://bugs.linaro.org/show_bug.cgi?id=5975.
> > > > > Can we please go back to synchronous probe.
> > > > >
> > > > > AOSP do not make use of rootwait, IIRC, but it is added by the
> > > > > bootloader anyway. And the device fails to boot AOSP regardless of
> > > > > "rootwait" bootarg being present or not.
> > > >
> > > > Could you try applying this diff to enable some log spam and let me know
> > > > what you get? I'm keen to try and figure this one out. My mail client
> > > > might crunch this a bit so I have pasted it here too
> > > > https://p.calebs.dev/ab74b7@raw
> > >
> > > These prints add just enough delay for the UFS probe to succeed that I
> > > can't reproduce the failure anymore.
> >
> > I'd prefer doing at least a little debugging before jumping to a
> > revert. From looking at your dmesg [1], it looks as if the async probe
> > is allowing RPMH to probe at the same time as "qcom-vadc-common".
> > That's something that talks on the SPMI bus and is (potentially)
> > talking to the same PMICs that RPMH-regulator is, right? I'm by no
> > means an expert on how Qualcomm's PMICs work, but it seems plausible
> > that the "qcom-vadc-common" is somehow causing problems and screwing
> > up RPMH. Does that seem plausible to you?
> >
> > If so, one interesting way to track it down would be to move around
> > delays. Put ~500ms sleep at the _end_ of vadc_probe(). Presumably that
> > _won't_ fix the problem. Now put a ~500ms sleep at the beginning of
> > vadc_probe(). Maybe that will fix the problem? If so, you can move the
> > delay around to narrow down the conflict. My wild guess would be that
> > vadc_reset() could be throwing things for a loop?
> >
> > If the above doesn't work, maybe we could add more tracing / printouts
> > to see what is probing at the same time as RPMH?
>
> Tried out a few changes today but none of them worked or were
> effective enough to debug this crash further, other than setting
> fw_devlink=permissive.
>
> Adding more tracing / prints (BOOTTIME_TRACING and
> FUNCTION_GRAPH_TRACER) didn't work and didn't help in reproducing the
> crash either. They added just enough delay to boot the device
> successfully everytime.
>
> I tried to reason with the kernel modules which gets loaded before and
> after the qcom-rpmh-regulator (QCOM_REBOOT_MODE, QCOM_PON, IIO/VADC,
> SPMI_PMIC* etc) as suggested, but I run into the same crash even if I
> disable those driver modules. So I don't think that the other driver
> modules which gets loaded at around the same time as
> qcom-rpmh-regulator by default have any impact on this failure.

Ugh, Heisenbugs are no fun to debug. :( It sorta sounds as if pretty
much anything you can do to change the timing fixes you. That does
make reverting the async probe of the regulator less appealing. If, as
you said, it's not just some other driver loading at the same time
that's interfering then the revert "fixes" you in the same way that a
"msleep" would fix you. That doesn't seem like enough of a
justification for the revert to me.

It still feels to me like _something_ is happening at the same time as
the RPMH regulator driver is loading, though, I'm just not sure how to
suggest debugging it. I guess other thoughts:

* When RPMH complains, is it always with the same regulator (lvs1), or
sometimes different ones? Any clue there?

* How much can you control module loading order? If rpmh-regulator
module loads first, does it "fix" things? If it does, maybe you could
bisect to find the place where problems start cropping up. Does that
give any clues?


> The only way I can boot successfully everytime is if I boot with
> fw_devlink=permissive bootarg. So I'll have to check if there is any
> new dependency which got added recently in DT or somewhere else that
> is causing this breakage.

I guess I'll assume that `fw_devlink=permissive` only fixes you
because it changes the timing...