Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver
From: David Dai
Date: Fri Aug 04 2023 - 19:46:32 EST
Hi Pavan,
Thanks for reviewing!
On Wed, Aug 2, 2023 at 9:18 PM Pavan Kondeti <quic_pkondeti@xxxxxxxxxxx> wrote:
>
> On Mon, Jul 31, 2023 at 10:46:09AM -0700, David Dai wrote:
> > Introduce a virtualized cpufreq driver for guest kernels to improve
> > performance and power of workloads within VMs.
> >
> > This driver does two main things:
> >
> > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the
> > hint to schedule the vCPU threads and decide physical CPU frequency.
> >
> > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the
> > host CPU frequency by reading a MMIO region of a virtual cpufreq device
> > to update the guest's frequency scaling factor periodically. This enables
> > accurate Per-Entity Load Tracking for tasks running in the guest.
> >
> > Co-developed-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > Signed-off-by: David Dai <davidai@xxxxxxxxxx>
>
> [...]
>
> > +static void virt_scale_freq_tick(void)
> > +{
> > + struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> > + struct virt_cpufreq_drv_data *data = policy->driver_data;
> > + u32 max_freq = (u32)policy->cpuinfo.max_freq;
> > + u64 cur_freq;
> > + u64 scale;
> > +
> > + cpufreq_cpu_put(policy);
> > +
> > + cur_freq = (u64)data->ops->get_freq(policy);
> > + cur_freq <<= SCHED_CAPACITY_SHIFT;
> > + scale = div_u64(cur_freq, max_freq);
> > +
> > + this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > +}
> > +
>
> We expect the host to provide the frequency in kHz, can you please add a
> comment about it. It is not very obvious when you look at the
> REG_CUR_FREQ_OFFSET register name.
I’ll include a KHZ in the offset names.
>
> > +static struct scale_freq_data virt_sfd = {
> > + .source = SCALE_FREQ_SOURCE_VIRT,
> > + .set_freq_scale = virt_scale_freq_tick,
> > +};
> > +
> > +static unsigned int virt_cpufreq_set_perf(struct cpufreq_policy *policy)
> > +{
> > + struct virt_cpufreq_drv_data *data = policy->driver_data;
> > + /*
> > + * Use cached frequency to avoid rounding to freq table entries
> > + * and undo 25% frequency boost applied by schedutil.
> > + */
> > + u32 freq = mult_frac(policy->cached_target_freq, 80, 100);
> > +
> > + data->ops->set_freq(policy, freq);
> > + return 0;
> > +}
>
> Why do we undo the frequency boost? A governor may apply other boosts
> like RT (uclamp), iowait. It is not clear why we need to worry about
> governor policies here.
See Saravana’s response to Quentin for more details, but in short,
we’ll remove this particular snippet in the driver.
>
> > +
> > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > + unsigned int target_freq)
> > +{
> > + virt_cpufreq_set_perf(policy);
> > + return target_freq;
> > +}
> > +
> > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> > + unsigned int index)
> > +{
> > + return virt_cpufreq_set_perf(policy);
> > +}
> > +
> > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > +{
> > + struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> > + struct cpufreq_frequency_table *table;
> > + struct device *cpu_dev;
> > + int ret;
> > +
> > + cpu_dev = get_cpu_device(policy->cpu);
> > + if (!cpu_dev)
> > + return -ENODEV;
> > +
> > + ret = dev_pm_opp_of_add_table(cpu_dev);
> > + if (ret)
> > + return ret;
> > +
> > + ret = dev_pm_opp_get_opp_count(cpu_dev);
> > + if (ret <= 0) {
> > + dev_err(cpu_dev, "OPP table can't be empty\n");
> > + return -ENODEV;
> > + }
> > +
> > + ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> > + if (ret) {
> > + dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> > + return ret;
> > + }
> > +
> > + policy->freq_table = table;
> > + policy->dvfs_possible_from_any_cpu = false;
> > + policy->fast_switch_possible = true;
> > + policy->driver_data = drv_data;
> > +
> > + /*
> > + * Only takes effect if another FIE source such as AMUs
> > + * have not been registered.
> > + */
> > + topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> > +
> > + return 0;
> > +
> > +}
> > +
>
> Do we need to register as FIE source even with the below commit? By
> registering as a source, we are not supplying any accurate metric. We
> still fallback on the same source that cpufreq implements it.
The arch_set_freq_scale() done at cpufreq driver’s frequency updates
at cpufreq_freq_transition_end() and cpufreq_driver_fast_switch() only
represent the guest’s frequency request. However, this does not
accurately represent the physical CPU’s frequency that the vCPU is
running on. E.g. There may be other processes sharing the same
physical CPU that results in a much higher CPU frequency than what’s
requested by the vCPU.
Thanks,
David
>
> 874f63531064 ("cpufreq: report whether cpufreq supports Frequency
> Invariance (FI)")
>
> Thanks,
> Pavan
>
> --
> To unsubscribe from this group and stop receiving emails from it, send an email to kernel-team+unsubscribe@xxxxxxxxxxx.
>