Re: [PATCH] cpufreq: CPPC: Resolve the large frequency discrepancy from cpuinfo_cur_freq
From: Ionela Voinescu
Date: Tue Feb 13 2024 - 18:03:11 EST
Hi,
On Friday 09 Feb 2024 at 11:55:08 (+0100), Beata Michalska wrote:
> On Tue, Feb 06, 2024 at 04:02:15PM +0800, lihuisong (C) wrote:
[..]
> > > > > >
> > > > > > > Would you be able to test [1] on your platform and usecase?
> > > > > > I has tested it on my platform (CPU number: 64, SMT: off and CPU base
> > > > > > frequency: 2.7GHz).
> > > > > > Accoding to the testing result,
> > > > > > 1> I found that patch [1] and [2] cannot cover the no housekeeping CPUs.
> > > > > > They still have to face the large frequency discrepancy issue my patch
> > > > > > mentioned.
> > > > > > 2> Additionally, the frequency value of all CPUs are almost the same by
> > > > > > using the 'arch_freq_scale' factor way. I'm not sure if it is ok.
> > > > > >
> > > > > > The patch [1] has been modified silightly as below:
> > > > > > -->
> > > > > > @@ -1756,7 +1756,10 @@ static unsigned int
> > > > > > cpufreq_verify_current_freq(struct cpufreq_policy *policy, b
> > > > > > {
> > > > > > unsigned int new_freq;
> > > > > >
> > > > > > - new_freq = cpufreq_driver->get(policy->cpu);
> > > > > > + new_freq = arch_freq_get_on_cpu(policy->cpu);
> > > > > > + if (!new_freq)
> > > > > > + new_freq = cpufreq_driver->get(policy->cpu);
> > > > > > +
> > > > > As pointed out this change will not make it to the next version of the patch.
> > > > > So I'd say you can safely ignore it and assume that arch_freq_get_on_cpu will
> > > > > only be wired for sysfs nodes for scaling_cur_freq/cpuinfo_cur_freq
> > > > > > if (!new_freq)
> > > > > > return 0;
> > > > > >
> > > > > > And the result is as follows:
> > > > > > *case 1:**No setting the nohz_full and cpufreq use performance governor*
> > > > > > *--> Step1: *read 'cpuinfo_cur_freq' in no pressure
> > > > > > 0: 2699264 2: 2699264 4: 2699264 6: 2699264
> > > > > > 8: 2696628 10: 2696628 12: 2696628 14: 2699264
> > > > > > 16: 2699264 18: 2696628 20: 2699264 22: 2696628
> > > > > > 24: 2699264 26: 2696628 28: 2699264 30: 2696628
> > > > > > 32: 2696628 34: 2696628 36: 2696628 38: 2696628
> > > > > > 40: 2699264 42: 2699264 44: 2696628 46: 2696628
> > > > > > 48: 2696628 50: 2699264 52: 2699264 54: 2696628
> > > > > > 56: 2696628 58: 2696628 60: 2696628 62: 2696628
> > > > > > 64: 2696628 66: 2699264 68: 2696628 70: 2696628
> > > > > > 72: 2699264 74: 2696628 76: 2696628 78: 2699264
> > > > > > 80: 2696628 82: 2696628 84: 2699264 86: 2696628
> > > > > > 88: 2696628 90: 2696628 92: 2696628 94: 2699264
> > > > > > 96: 2696628 98: 2699264 100: 2699264 102: 2696628
> > > > > > 104: 2699264 106: 2699264 108: 2699264 110: 2696628
> > > > > > 112: 2699264 114: 2699264 116: 2699264 118: 2699264
> > > > > > 120: 2696628 122: 2699264 124: 2696628 126: 2699264
> > > > > > Note: the frequency of all CPUs are almost the same.
> > > > > Were you expecting smth else ?
> > > > The frequency of each CPU might have a different value.
> > > > All value of all CPUs is the same under high pressure.
> > > > I don't know what the phenomenon is on other platform.
> > > > Do you know who else tested it?
> > > So I might have rushed a bit with my previous comment/question: apologies for
> > > that.
> > > The numbers above: those are on a fairly idle/lightly loaded system right?
> > Yes.
> > > Would you mind having another go with just the arch_freq_get_on_cpu
> > > implementation beign added and dropping the changes in the cpufreq and
> > All my tests are done when cpufreq policy is "performance" and OS isn't on a
> > high load.
> > Reading "scaling_cur_freq" or "scaling_cur_freq" for each physical core on
> > platform
> >
> > The testing result for "cpuinfo_cur_freq" with your changes on a fairly idle
> > and high loaded system can also be found in this thread.
> > *A: the result with your changes*
> > --> Reading "scaling_cur_freq"
> > 0: 2688720 2: 2696628 4: 2699264 6: 2696628
> > 8: 2699264 10: 2696628 12: 2699264 14: 2699264
> > 16: 2699264 18: 2696628 20: 2696628 22: 2696628
> > 24: 2699264 26: 2696628 28: 2696628 30: 2696628
> > 32: 2699264 34: 2691356 36: 2696628 38: 2699264
> > 40: 2699264 42: 2696628 44: 2696628 46: 2699264
> > 48: 2699264 50: 2696628 52: 2696628 54: 2696628
> > 56: 2696628 58: 2699264 60: 2691356 62: 2696628
> > 64: 2696628 66: 2696628 68: 2696628 70: 2696628
> > 72: 2696628 74: 2696628 76: 2699264 78: 2696628
> > 80: 2696628 82: 2696628 84: 2699264 86: 2696628
> > 88: 2625456 90: 2696628 92: 2699264 94: 2696628
> > 96: 2696628 98: 2696628 100: 2699264 102: 2699264
> > 104: 2699264 106: 2696628 108: 2699264 110: 2696628
> > 112: 2699264 114: 2699264 116: 2696628 118: 2696628
> > 120: 2696628 122: 2699264 124: 2696628 126: 2696628
> > -->Reading "cpuinfo_cur_freq"
> > 0: 2696628 2: 2696628 4: 2699264 6: 2688720
> > 8: 2699264 10: 2700000 12: 2696628 14: 2698322
> > 16: 2699264 18: 2699264 20: 2696628 22: 2699264
> > 24: 2699264 26: 2699264 28: 2699264 30: 2699264
> > 32: 2699264 34: 2693992 36: 2696628 38: 2696628
> > 40: 2699264 42: 2699264 44: 2699264 46: 2696628
> > 48: 2696628 50: 2699264 52: 2696628 54: 2696628
> > 56: 2699264 58: 2699264 60: 2696628 62: 2699264
> > 64: 2696628 66: 2699264 68: 2696628 70: 2699264
> > 72: 2696628 74: 2696628 76: 2696628 78: 2693992
> > 80: 2696628 82: 2696628 84: 2696628 86: 2696628
> > 88: 2696628 90: 2699264 92: 2696628 94: 2699264
> > 96: 2699264 98: 2696628 100: 2699264 102: 2699264
> > 104: 2691356 106: 2699264 108: 2699264 110: 2699264
> > 112: 2699264 114: 2696628 116: 2699264 118: 2699264
> > 120: 2696628 122: 2696628 124: 2696628 126: 2696628
> >
> > *B: the result without your changes*
> > -->Reading "scaling_cur_freq"
> > 0: 2698245 2: 2706690 4: 2699649 6: 2702105
> > 8: 2704362 10: 2697993 12: 2701672 14: 2704362
> > 16: 2701052 18: 2701052 20: 2694385 22: 2699650
> > 24: 2706802 26: 2702389 28: 2698299 30: 2698299
> > 32: 2697333 34: 2697993 36: 2701337 38: 2699328
> > 40: 2700330 42: 2700330 44: 2698019 46: 2697697
> > 48: 2699659 50: 2701700 52: 2703401 54: 2701700
> > 56: 2704013 58: 2697658 60: 2695000 62: 2697666
> > 64: 2697902 66: 2701052 68: 2698245 70: 2695789
> > 72: 2701315 74: 2696655 76: 2693666 78: 2695317
> > 80: 2704912 82: 2699649 84: 2698245 86: 2695454
> > 88: 2697966 90: 2697959 92: 2699319 94: 2700680
> > 96: 2695317 98: 2698996 100: 2700000 102: 2700334
> > 104: 2701320 106: 2695065 108: 2700986 110: 2703960
> > 112: 2697635 114: 2704421 116: 2700680 118: 2702040
> > 120: 2700334 122: 2697993 124: 2700334 126: 2705351
> > -->Reading "cpuinfo_cur_freq"
> > 0: 2696853 2: 2695454 4: 2699649 6: 2706993
> > 8: 2706060 10: 2704362 12: 2704362 14: 2697658
> > 16: 2707719 18: 2697192 20: 2702456 22: 2699650
> > 24: 2705782 26: 2698299 28: 2703061 30: 2705802
> > 32: 2700000 34: 2700671 36: 2701337 38: 2697658
> > 40: 2700330 42: 2700330 44: 2699672 46: 2697697
> > 48: 2703061 50: 2696610 52: 2692542 54: 2704406
> > 56: 2695317 58: 2699331 60: 2698996 62: 2702675
> > 64: 2704912 66: 2703859 68: 2699649 70: 2698596
> > 72: 2703908 74: 2703355 76: 2697658 78: 2695317
> > 80: 2702105 82: 2707719 84: 2702105 86: 2699649
> > 88: 2697966 90: 2691525 92: 2701700 94: 2700680
> > 96: 2695317 98: 2698996 100: 2698666 102: 2700334
> > 104: 2690429 106: 2707590 108: 2700986 110: 2701320
> > 112: 2696283 114: 2692881 116: 2697627 118: 2704421
> > 120: 2698996 122: 2696321 124: 2696655 126: 2695000
> >
> So in both cases : whether you use arch_freq_get_on_cpu or not
> (so with and without the patch) you get roughly the same frequencies
> on all cores - or am I missing smth from the dump above ?
> And those are reflecting max freq you have provided earlier (?)
> Note that the arch_freq_get_on_cpu will return an average frequency for
> the last tick, so even if your system is roughly idle with your performance
> governor those numbers make sense (some/most of the cores might be idle
> but you will see the last freq the core was running at before going to idle).
> I do not think there is an agreement what should be shown for idle core when
> querying their freq through sysfs. Showing last known freq makes sense, even
> more than waking up core just to try to get one.
>
> @Ionela: Please jump in if I got things wrong.
Yes, that's how I see things as well. When using the performance
governor, when the CPU is active, the frequency of the CPU should be the
maximum one (unless there has been firmware/hardware capping) and that
would be reflected by cpuinfo_cur_freq, either through the use of the
frequency scale factor (based on the samples on the last tick) or the
driver's .get() function (having woken up the CPU to sample the
counters).
So the values above look alright to me.
Thanks,
Ionela.
>
> > > then read 'scaling_cur_freq', doing several reads in some intervals ?
> > It seems that above phenomenon has not a lot to do with reading intervals.
> > > The change has been tested on RD-N2 model (Neoverse N2 ref platform),
> > > it has also been discussed here [1]
> > I doesn't get the testing result on this platform in its thread.
> It might be missing exact numbers but the conclusions should be here [1]
>
> > > > > > *--> Step 2: *read 'cpuinfo_cur_freq' in the high memory access pressure.
> > > > > > 0: 2696628 2: 2696628 4: 2696628 6: 2696628
> > > > > > 8: 2696628 10: 2696628 12: 2696628 14: 2696628
> > > > > > 16: 2696628 18: 2696628 20: 2696628 22: 2696628
> > > > > > 24: 2696628 26: 2696628 28: 2696628 30: 2696628
> > > > > > 32: 2696628 34: 2696628 36: 2696628 38: 2696628
> > > > > > 40: 2696628 42: 2696628 44: 2696628 46: 2696628
> > > > > > 48: 2696628 50: 2696628 52: 2696628 54: 2696628
> > > > > > 56: 2696628 58: 2696628 60: 2696628 62: 2696628
> > > > > > 64: 2696628 66: 2696628 68: 2696628 70: 2696628
> > > > > > 72: 2696628 74: 2696628 76: 2696628 78: 2696628
> > > > > > 80: 2696628 82: 2696628 84: 2696628 86: 2696628
> > > > > > 88: 2696628 90: 2696628 92: 2696628 94: 2696628
> > > > > > 96: 2696628 98: 2696628 100: 2696628 102: 2696628
> > > > > > 104: 2696628 106: 2696628 108: 2696628 110: 2696628
> > > > > > 112: 2696628 114: 2696628 116: 2696628 118: 2696628
> > > > > > 120: 2696628 122: 2696628 124: 2696628 126: 2696628
> > > > > >
> > > > > > *Case 2: setting nohz_full and cpufreq use ondemand governor*
> > > > > > There is "isolcpus=1-10,41-50 nohz_full=1-10,41-50 rcu_nocbs=1-10,41-50" in
> > > > > > /proc/cmdline.
> > > > > Right, so if I remember correctly nohz_full implies rcu_nocbs, so no need to
> > > > > set that one.
> > > > > Now, afair, isolcpus will make the selected CPUs to disappear from the
> > > > > schedulers view (no balancing, no migrating), so unless you affine smth
> > > > > explicitly to those CPUs, you will not see much of an activity there.
> > > > Correct.
> > > > > Need to double check though as it has been a while ...
> > > > > > *--> Step 1: *setting ondemand governor to all policy and query
> > > > > > 'cpuinfo_cur_freq' in no pressure case.
> > > > > > And the frequency of CPUs all are about 400MHz.
> > > > > > *--> Step 2:* read 'cpuinfo_cur_freq' in the high memory access pressure.
> > > > > > The high memory access pressure is from the command: "stress-ng -c 64
> > > > > > --cpu-load 100% --taskset 0-63"
> > > > > I'm not entirely convinced that this will affine to isolated cpus, especially
> > > > > that the affinity mask spans all available cpus. If that is the case, no wonder
> > > > > your isolated cpus are getting wasted being idle. But I would have to double
> > > > > check how this is being handled.
> > > > > > The result:
> > > > > > 0: 2696628 1: 400000 2: 400000 3: 400909
> > > > > > 4: 400000 5: 400000 6: 400000 7: 400000
> > > > > > 8: 400000 9: 400000 10: 400600 11: 2696628
> > > > > > 12: 2696628 13: 2696628 14: 2696628 15: 2696628
> > > > > > 16: 2696628 17: 2696628 18: 2696628 19: 2696628
> > > > > > 20: 2696628 21: 2696628 22: 2696628 23: 2696628
> > > > > > 24: 2696628 25: 2696628 26: 2696628 27: 2696628
> > > > > > 28: 2696628 29: 2696628 30: 2696628 31: 2696628
> > > > > > 32: 2696628 33: 2696628 34: 2696628 35: 2696628
> > > > > > 36: 2696628 37: 2696628 38: 2696628 39: 2696628
> > > > > > 40: 2696628 41: 400000 42: 400000 43: 400000
> > > > > > 44: 400000 45: 398847 46: 400000 47: 400000
> > > > > > 48: 400000 49: 400000 50: 400000 51: 2696628
> > > > > > 52: 2696628 53: 2696628 54: 2696628 55: 2696628
> > > > > > 56: 2696628 57: 2696628 58: 2696628 59: 2696628
> > > > > > 60: 2696628 61: 2696628 62: 2696628 63: 2699264
> > > > > >
> > > > > > Note:
> > > > > > (1) The frequency of 1-10 and 41-50 CPUs work on the lowest frequency.
> > > > > > It turned out that nohz full was already work.
> > > > > > I guess that stress-ng cannot use the CPU in the range of nohz full.
> > > > > > Because the CPU frequency will be increased to 2.7G by binding CPU to
> > > > > > other application.
> > > > > > (2) The frequency of the nohz full core is calculated by get() callback
> > > > > > according to ftrace.
> > > > > It is as there is no sched tick on those, and apparently there is nothing
> > > > > running on them either.
> > > > Yes.
> > > > If we select your approach and the above phenomenon is normal,
> > > > the large frequency discrepancy issue can be resolved for CPUs with sched
> > > > tick by the way.
> > > > But the nohz full cores still have to face this issue. So this patch is also
> > > > needed.
> > > >
> > > Yes, nohz cores full have to be handled by the cpufreq driver.
> > Correct. So we still have to face the issue in this patch and push this
> > patch.
> > Beata, would you please review this patch?
> Just to clarify for my benefit (apologies but I do have to contex switch
> pretty often these days): by reviewing this patch do you mean:
> 1) review your changes (if so I think there are few comments already to be
> addressed, but I can try to have another look)
> 2) review changes for AMU-based arch_freq_get_on_cpu ?
>
> *note: I will still try to have a look at the non-housekeeping cpus case
>
> ---
> [1] https://lore.kernel.org/lkml/691d3eb2-cd93-f0fc-a7a4-2a8c0d44262c@xxxxxxxxxx/
> ---
>
> BR
> Beata
> >
> >
> > /Huisong
> > >
> [...]