On Mon, Aug 9, 2021 at 7:52 AM Akhil P Oommen <akhilpo@xxxxxxxxxxxxxx> wrote:The power domains (for cx and gx) are defined in the GMU DT, the OPPs in the GPU DT. For the sake of simplicity I'll refer to the lowest frequency (257000000) and OPP level (RPMH_REGULATOR_LEVEL_LOW_SVS) as the "min" state, and the highest frequency (710000000) and OPP level (RPMH_REGULATOR_LEVEL_TURBO_L1) as the "max" state. These are defined in sdm845.dtsi under the gpu node.
On 8/8/2021 10:22 PM, Rob Clark wrote:
On Sun, Aug 8, 2021 at 7:33 AM Caleb Connolly <caleb.connolly@xxxxxxxxxx> wrote:
On 07/08/2021 21:04, Rob Clark wrote:
On Sat, Aug 7, 2021 at 12:21 PM Caleb Connolly
<caleb.connolly@xxxxxxxxxx> wrote:
Hi Rob, Akhil,
On 29/07/2021 21:53, Rob Clark wrote:
On Thu, Jul 29, 2021 at 1:28 PM Caleb ConnollyIt looks like this affects at least the OnePlus 6 and PocoPhone F1, I've done some more poking and the following diff
<caleb.connolly@xxxxxxxxxx> wrote:
On 29/07/2021 21:24, Rob Clark wrote:
On Thu, Jul 29, 2021 at 1:06 PM Caleb ConnollyOh yeah that sounds like a more sensible workaround than mine .
<caleb.connolly@xxxxxxxxxx> wrote:
Hi Rob,
I've done some more testing! It looks like before that patch ("drm/msm: Devfreq tuning") the GPU would never get above
the second frequency in the OPP table (342MHz) (at least, not in glxgears). With the patch applied it would more
aggressively jump up to the max frequency which seems to be unstable at the default regulator voltages.
*ohh*, yeah, ok, that would explain it
Hacking the pm8005 s1 regulator (which provides VDD_GFX) up to 0.988v (instead of the stock 0.516v) makes the GPU stable
at the higher frequencies.
Applying this patch reverts the behaviour, and the GPU never goes above 342MHz in glxgears, losing ~30% performance in
glxgear.
I think (?) that enabling CPR support would be the proper solution to this - that would ensure that the regulators run
at the voltage the hardware needs to be stable.
Is hacking the voltage higher (although ideally not quite that high) an acceptable short term solution until we have
CPR? Or would it be safer to just not make use of the higher frequencies on a630 for now?
tbh, I'm not sure about the regulator stuff and CPR.. Bjorn is already
on CC and I added sboyd, maybe one of them knows better.
In the short term, removing the higher problematic OPPs from dts might
be a better option than this patch (which I'm dropping), since there
is nothing stopping other workloads from hitting higher OPPs.
Perhaps you won the sillicon lottery - iirc sdm850 is binned for higher clocks as is out of the factory.
I'm slightly curious why I didn't have problems at higher OPPs on my
c630 laptop (sdm850)
Would it be best to drop the OPPs for all devices? Or just those affected? I guess it's possible another c630 might
crash where yours doesn't?
I've not heard any reports of similar issues from the handful of other
folks with c630's on #aarch64-laptops.. but I can't really say if that
is luck or not.
seems to fix the stability issues completely, it seems the delay is required to let the update propagate.
This doesn't feel like the right fix, but hopefully it's enough to come up with a better solution than disabling the new
devfreq behaviour on a630.
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index d7cec7f0dde0..69e2a5e84dae 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -139,6 +139,10 @@ void a6xx_gmu_set_freq(struct msm_gpu *gpu, struct dev_pm_opp *opp)
return;
}
+ dev_pm_opp_set_opp(&gpu->pdev->dev, opp);
+
+ usleep_range(300, 500);
+
I am a bit confused. We don't define a power domain for gpu in dt,
correct? Then what exactly set_opp do here? Do you think this usleep is
what is helping here somehow to mask the issue?
I believe this hacky workaround expresses the root cause of the issue quite clearly, by setting the OPP first and allowing the gx regulator to become stable before telling the GPU to change clock speeds, we avoid the edge case and prevent the crashes.
Hmm, I thought "opp-level = RPMH_REGULATOR_LEVEL_*" did *something*,
but tbh I'm not sure exactly what..
I feel we should just leave the new dcvs feature (shall we call it NAP?)
disabled for a630 (and 10ms devfreq interval), until this is root caused.
Ack
I suppose "NAP" is a reasonable name.
But I think that reverting to previous behavior would not be enough,
there is nothing stopping devfreq from jumping from min to max freq,
which AFAIU should be enough to trigger this. I guess that there just
hasn't been enough testing with different game workloads on those
phones to trigger this.
My db845c arrives this week, I'll definitely try and reproduce this.
That said, I haven't seen similar issues on my sdm850 laptop, where I
defn have triggered mix->max freq transitions.. I guess it would be
interesting to know if this issue could be reproduced on db845c, or if
it really is board specific?
Based on my reasoning above, I came up with the following: reducing thrashing by preventing rapid idle/active transitions. The minimum active time of 30ms was just used for testing, I think some number between 2 and 4 frames would be a sensible choice - the higher the safer.
To workaround, I think we'd need to implement some way to limit that
maximum frequency jump (and then use delayed work to continue ramping
up the freq over time until we hit the target).. which seems like a
lot of work if this is just a board(s) specific workaround and isn't
needed once CPR is supported
BR,
-R
Hmm, this is going to be in the critical path on idle -> activeThat would be a workaround, however I'd really like to avoid limiting performance as a solution if I can help it,
transition (ie. think response time to user-input).. so we defn don't
want to do this unconditionally..
If I understand the problem, we just want to limit how far we jump the
gpu freq in one go.. maybe deleting the lowest (and perhaps highest)
OPP would accomplish that? Could that be done in the board(s)'s
toplevel dts files?
especially as the fix might just be "set the opp first, wait for it to apply, then set the core clock".
Is there a sensible way to get a callback from the opp notify chain? Or from rpmh directly? Or is this solution really
not the right way to go?
It does seem a bit strange to me that we are telling GMU to change
freq before calling dev_pm_opp_set_opp().. if dev_pm_opp_set_opp() is
increasing voltage, it seems like you'd want to do that *before*
increasing freq (but reverse the order when decreasing freq).. But I'm
not an expert on the ways of the GMU.. maybe Akhil or Jordan knows
better how this is supposed to work.
For legacy gmu, we trigger DCVS using DCVS OOB which comes later in this
function. But the order between regulator and clock which you mentioned
is correct.
But the delay seems like papering something over, and I'm trying to go
in the other direction and reduce latency between user input and
pageflip..
BR,
-R
BR,
-R
gmu_write(gmu, REG_A6XX_GMU_DCVS_ACK_OPTION, 0);
gmu_write(gmu, REG_A6XX_GMU_DCVS_PERF_SETTING,
@@ -158,7 +162,6 @@ void a6xx_gmu_set_freq(struct msm_gpu *gpu, struct dev_pm_opp *opp)
if (ret)
dev_err(gmu->dev, "GMU set GPU frequency error: %d\n", ret);
- dev_pm_opp_set_opp(&gpu->pdev->dev, opp);
pm_runtime_put(gmu->dev);
}
Maybe just remove it for affected devices? But I'll defer to Bjorn.
BR,
-R
--
Kind Regards,
Caleb (they/them)
--
Kind Regards,
Caleb (they/them)