Re: [PATCH v2 1/1] x86,sched: On AMD EPYC set freq_max = max_boost in schedutil invariant formula

From: Rafael J. Wysocki
Date: Wed Feb 03 2021 - 08:42:19 EST


On Wed, Feb 3, 2021 at 9:39 AM Giovanni Gherdovich <ggherdovich@xxxxxxx> wrote:
>
> Hello,
>
> both Rafael and Viresh make a similar remark: why adding a new "max_boost"
> variable, since "max_freq" is already available and could be used instead.
>
> Replying here to both.
>
> On Tue, 2021-02-02 at 20:26 +0100, Rafael J. Wysocki wrote:
> > On Tue, Feb 2, 2021 at 7:59 PM Rafael J. Wysocki <rafael@xxxxxxxxxx> wrote:
> > >
> > > On Fri, Jan 22, 2021 at 9:47 PM Giovanni Gherdovich <ggherdovich@xxxxxxx> wrote:
> > >
> > > [cut]
> > > > @@ -779,15 +829,25 @@ static int acpi_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > > > freq_table[valid_states-1].frequency / 1000)
> > > > continue;
> > > >
> > > > + freq = perf->states[i].core_frequency * 1000;
> > > > freq_table[valid_states].driver_data = i;
> > > > - freq_table[valid_states].frequency =
> > > > - perf->states[i].core_frequency * 1000;
> > > > + freq_table[valid_states].frequency = freq;
> > > > +
> > > > + if (freq > max_freq)
> > > > + max_freq = freq;
> > > > +
> > > > valid_states++;
> > > > }
> > > > freq_table[valid_states].frequency = CPUFREQ_TABLE_END;
> > > > policy->freq_table = freq_table;
> > > > perf->state = 0;
> > > >
> > > > + if (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
> > > > + amd_max_boost(max_freq, &max_boost)) {
> > > > + policy->cpuinfo.max_boost = max_boost;
> > >
> > > Why not to set max_freq to max_boost instead?
> >
> > I mean, would setting the frequency in the last table entry to max_boost work?
> >
> > Alternatively, one more (artificial) entry with the frequency equal to
> > max_boost could be added.
>
> On Wed, 2021-02-03 at 11:34 +0530, Viresh Kumar wrote:
> > [cut]
> >
> > On 22-01-21, 21:40, Giovanni Gherdovich wrote:
> > > diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> > > index 6931f0cdeb80..541f3db3f576 100644
> > > --- a/kernel/sched/cpufreq_schedutil.c
> > > +++ b/kernel/sched/cpufreq_schedutil.c
> > > @@ -159,8 +159,12 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
> > > unsigned long util, unsigned long max)
> > > {
> > > struct cpufreq_policy *policy = sg_policy->policy;
> > > - unsigned int freq = arch_scale_freq_invariant() ?
> > > - policy->cpuinfo.max_freq : policy->cur;
> > > + unsigned int freq, max_freq;
> > > +
> > > + max_freq = cpufreq_driver_has_max_boost() ?
> > > + policy->cpuinfo.max_boost : policy->cpuinfo.max_freq;
> >
> > Also, can't we update max_freq itself from the cpufreq driver? What
> > troubles will it cost ?
>
> I could add the max boost frequency to the frequency table (and
> policy->cpuinfo.max_freq would follow), yes, but that would trigger the
> following warning from acpi-cpufreq.c:
>
> static void acpi_cpufreq_cpu_ready(struct cpufreq_policy *policy)
> {
> struct acpi_processor_performance *perf = per_cpu_ptr(acpi_perf_data,
> policy->cpu);
>
> if (perf->states[0].core_frequency * 1000 != policy->cpuinfo.max_freq)
> pr_warn(FW_WARN "P-state 0 is not max freq\n");
> }

This check can be changed, though.

> so I thought that to stay out of troubles I'd supply a different variable,
> max_boost, to be used only in the schedutil formula.

Which is not necessary and the extra static branch is not necessary.

Moreover, there is no reason whatsoever to believe that EPYC is the
only affected processor model. If I'm not mistaken, the regression
will be visible on every CPU where the scale invariance algorithm uses
the max frequency greater than the max frequency used acpi_cpufreq.

Also, AFAICS, it should be sufficient to modify acpi_cpufreq to remedy
this for all of the affected CPUs, not just EPYC.

> After schedutil figures out a desired next_freq then the usual comparison with the
> firmware-supplied frequency table could take place.
>
> Altering the frequency table seemed more invasive because once a freq value is
> in there, it's going to be actually requested by the driver to the platform.

This need not be the case if the control structure for the new entry
is copied from an existing one.

> I only want my max_boost to stretch the range of schedutil's next_freq.

Right, but that can be done in a different way which would be cleaner too IMO.

I'm going to send an alternative patch to fix this problem.

> On Tue, 2021-02-02 at 19:59 +0100, Rafael J. Wysocki wrote:
> >
> > [cut]
> > Also notice that the static branch is global and the max_boost value
> > for different CPUs may be different, at least in theory.
>
> In theory yes, but I'm guarding the code with two conditions:
>
> * vendor is X86_VENDOR_AMD
> * cppc_get_perf_caps() returns success
>
> this identifies AMD EPYC cpus with model 7xx2 and later, where max_boost is
> the same on all cores. I may have added synchronization so that only one cpu
> sets the value, but I didn't in the interest of simplicity for an -rc patch
> (I'd have to consider hotplug, the maxcpus= command line param, ecc).

And what about the other potentially affected processors?

I wouldn't worry about the -rc time frame too much. If we can do a
better fix now, let's do it.