Re: [PATCH v3 2/2] cpufreq: add virtual-cpufreq driver

From: Pavan Kondeti
Date: Sun Aug 06 2023 - 23:23:06 EST


On Fri, Aug 04, 2023 at 04:46:11PM -0700, David Dai wrote:
> Hi Pavan,
>
> Thanks for reviewing!
>
> On Wed, Aug 2, 2023 at 9:18 PM Pavan Kondeti <quic_pkondeti@xxxxxxxxxxx> wrote:
> >
> > On Mon, Jul 31, 2023 at 10:46:09AM -0700, David Dai wrote:
> > > Introduce a virtualized cpufreq driver for guest kernels to improve
> > > performance and power of workloads within VMs.
> > >
> > > This driver does two main things:
> > >
> > > 1. Sends the frequency of vCPUs as a hint to the host. The host uses the
> > > hint to schedule the vCPU threads and decide physical CPU frequency.
> > >
> > > 2. If a VM does not support a virtualized FIE(like AMUs), it queries the
> > > host CPU frequency by reading a MMIO region of a virtual cpufreq device
> > > to update the guest's frequency scaling factor periodically. This enables
> > > accurate Per-Entity Load Tracking for tasks running in the guest.
> > >
> > > Co-developed-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > > Signed-off-by: Saravana Kannan <saravanak@xxxxxxxxxx>
> > > Signed-off-by: David Dai <davidai@xxxxxxxxxx>
> >
> > [...]
> >
> > > +static void virt_scale_freq_tick(void)
> > > +{
> > > + struct cpufreq_policy *policy = cpufreq_cpu_get(smp_processor_id());
> > > + struct virt_cpufreq_drv_data *data = policy->driver_data;
> > > + u32 max_freq = (u32)policy->cpuinfo.max_freq;
> > > + u64 cur_freq;
> > > + u64 scale;
> > > +
> > > + cpufreq_cpu_put(policy);
> > > +
> > > + cur_freq = (u64)data->ops->get_freq(policy);
> > > + cur_freq <<= SCHED_CAPACITY_SHIFT;
> > > + scale = div_u64(cur_freq, max_freq);
> > > +
> > > + this_cpu_write(arch_freq_scale, (unsigned long)scale);
> > > +}
> > > +
> >
> > We expect the host to provide the frequency in kHz, can you please add a
> > comment about it. It is not very obvious when you look at the
> > REG_CUR_FREQ_OFFSET register name.
>
> I’ll include a KHZ in the offset names.
>

Sure, that would help. Also, can you limit the scale to
SCHED_CAPACITY_SCALE? It may be possible that host may be running at a
higher frequency than max_freq advertised on this guest.

> >
> > > +
> > > +static unsigned int virt_cpufreq_fast_switch(struct cpufreq_policy *policy,
> > > + unsigned int target_freq)
> > > +{
> > > + virt_cpufreq_set_perf(policy);
> > > + return target_freq;
> > > +}
> > > +
> > > +static int virt_cpufreq_target_index(struct cpufreq_policy *policy,
> > > + unsigned int index)
> > > +{
> > > + return virt_cpufreq_set_perf(policy);
> > > +}
> > > +
> > > +static int virt_cpufreq_cpu_init(struct cpufreq_policy *policy)
> > > +{
> > > + struct virt_cpufreq_drv_data *drv_data = cpufreq_get_driver_data();
> > > + struct cpufreq_frequency_table *table;
> > > + struct device *cpu_dev;
> > > + int ret;
> > > +
> > > + cpu_dev = get_cpu_device(policy->cpu);
> > > + if (!cpu_dev)
> > > + return -ENODEV;
> > > +
> > > + ret = dev_pm_opp_of_add_table(cpu_dev);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + ret = dev_pm_opp_get_opp_count(cpu_dev);
> > > + if (ret <= 0) {
> > > + dev_err(cpu_dev, "OPP table can't be empty\n");
> > > + return -ENODEV;
> > > + }
> > > +
> > > + ret = dev_pm_opp_init_cpufreq_table(cpu_dev, &table);
> > > + if (ret) {
> > > + dev_err(cpu_dev, "failed to init cpufreq table: %d\n", ret);
> > > + return ret;
> > > + }
> > > +
> > > + policy->freq_table = table;
> > > + policy->dvfs_possible_from_any_cpu = false;
> > > + policy->fast_switch_possible = true;
> > > + policy->driver_data = drv_data;
> > > +
> > > + /*
> > > + * Only takes effect if another FIE source such as AMUs
> > > + * have not been registered.
> > > + */
> > > + topology_set_scale_freq_source(&virt_sfd, policy->cpus);
> > > +
> > > + return 0;
> > > +
> > > +}
> > > +
> >
> > Do we need to register as FIE source even with the below commit? By
> > registering as a source, we are not supplying any accurate metric. We
> > still fallback on the same source that cpufreq implements it.
>
> The arch_set_freq_scale() done at cpufreq driver’s frequency updates
> at cpufreq_freq_transition_end() and cpufreq_driver_fast_switch() only
> represent the guest’s frequency request. However, this does not
> accurately represent the physical CPU’s frequency that the vCPU is
> running on. E.g. There may be other processes sharing the same
> physical CPU that results in a much higher CPU frequency than what’s
> requested by the vCPU.
>

understood that policy->cur may not reflect the actual frequency. Is this
something needs to be advertised to cpufreq core so that it query the
underlying cpufreq driver and use it for frequency scale updates. This
also gives userspace to read the actual frequency when read from sysfs.

In fact, cpufreq_driver_fast_switch() comment says that
cpufreq_driver::fast_switch() should return the *actual* frequency and
the same is used to update frequency scale updates. I understand that it
depends on other things like if host defer the frequency switch, the
value read from REG_CUR_FREQ_OFFSET may reflect the old value..

May be a comment in code would help.

Thanks,
Pavan