Re: [PATCH 4/4] sched: cpufreq_cfs: pelt-based cpu frequency scaling

From: Michael Turquette
Date: Thu May 07 2015 - 10:25:31 EST


Quoting Juri Lelli (2015-05-07 03:49:27)
> Hi Mike,
>
> On 07/05/15 05:17, Michael Turquette wrote:
> > Quoting Peter Zijlstra (2015-05-06 05:22:40)
> >> On Tue, May 05, 2015 at 11:23:47AM -0700, Michael Turquette wrote:
> >>> Quoting Peter Zijlstra (2015-05-05 02:00:42)
> >>>> On Mon, May 04, 2015 at 03:10:41PM -0700, Michael Turquette wrote:
> >>>>> This policy is implemented using the cpufreq governor interface for two
> >>>>> main reasons:
> >>>>>
> >>>>> 1) re-using the cpufreq machine drivers without using the governor
> >>>>> interface is hard.
> >>>>>
> >>>>> 2) using the cpufreq interface allows us to switch between the
> >>>>> scheduler-driven policy and legacy cpufreq governors such as ondemand at
> >>>>> run-time. This is very useful for comparative testing and tuning.
> >>>>
> >>>> Urgh,. so I don't really like that. It adds a lot of noise to the
> >>>> system. You do the irq work thing to kick the cpufreq threads which do
> >>>> their little thing -- and their wakeup will influence the cfs
> >>>> accounting, which in turn will start the whole thing anew.
> >>>>
> >>>> I would really prefer you did a whole new system with directly invoked
> >>>> drivers that avoid the silly dance. Your 'new' ARM systems should be
> >>>> well capable of that.
> >>>
> >>> Thanks for the review Peter.
> >>
> >> Well, I didn't actually get beyond the Changelog; although I have
> >> rectified this now. A few more comments below.
> >
> > Thanks for the Real Deal review Peter.
> >
> >>
> >>> We'll need something in process context for the many cpufreq drivers
> >>> that might sleep during their cpu frequency transition, no? This is due
> >>> to calls into the regulator framework, the clock framework and sometimes
> >>> other things such as conversing with a power management IC or voting
> >>> with some system controller.
> >>
> >> Yes, we need _something_. I just spoke to a bunch of people on IRC and
> >> it does indeed seem that I was mistaken in my assumption that modern ARM
> >> systems were 'easy' in this regard.
> >>
> >>>> As to the drivers, they're mostly fairly small and self contained, it
> >>>> should not be too hard to hack them up to work without cpufreq.
> >>>
> >>> The drivers are not the only thing. I want to leverage the existing
> >>> cpufreq core infrastructure:
> >>>
> >>> * rate change notifiers
> >>> * cpu hotplug awareness
> >>> * methods to fetch frequency tables from firmware (acpi, devicetree)
> >>> * other stuff I can't think of now
> >>>
> >>> So I do not think we should throw out the baby with the bath water. The
> >>> thing that people really don't like about cpufreq are the governors IMO.
> >>> Let's fix that by creating a list of requirements that we really want
> >>> for scheduler-driven cpu frequency selection:
> >>>
> >>> 0) do something smart with scheduler statistics to create an improved
> >>> frequency selection algorithm over existing cpufreq governors
> >>>
> >>> 1) support upcoming and legacy hardware, within reason
> >>>
> >>> 2) if a system exports a fast, async frequency selection interface to
> >>> Linux, then provide a solution that doesn't do any irq_work or kthread
> >>> stuff. Do it all in the fast path
> >>>
> >>> 3) if a system has a slow, synchronous interface for frequency
> >>> selection, then provide an acceptable solution for kicking this work to
> >>> process context. Limit time in the fast path
> >>>
> >>> The patch set tries to tackle 0, 1 and 3. Would the inclusion of #2 make
> >>> you feel better about supporting "newer" hardware with a fire-and-forget
> >>> frequency selection interface?
> >>
> >> I should probably look at doing a x86 support patch to try some of this
> >> out, I'll try and find some time to play with that.
> >>
> >> So two comments on the actual code:
> >>
> >> 1) Ideally these hooks would be called from places where we've just
> >> computed the cpu utilization already. I understand this is currently not
> >> the case and we need to do it in-situ.
> >
> > Are you thinking of placing the hook somewhere such as
> > update_entity_load_avg?
> >
> > I did not choose that as a call site since I was worried that it would
> > be too noisy; we can enter that function multiple times in the same
> > pass through the scheduler.
> >
> > On the other hand if dvfs transitions are cheap for a platform then it
> > might not be so bad to call it multiple times. For platforms where dvfs
> > transitions are expensive we could store per-cpu new_capacity data
> > multiple times and make sure that we only try to program the hardware as
> > deferred work later on.
> >
> >>
> >> That said; it would be good to have the interface such that we pass the
> >> utilization in; this would of course mean we'd have to compute it at the
> >> call sites, this should be fine I think.
> >
> > I changed the code to do this but didn't have time to test it. Will send
> > the patch with these changes tomorrow.
> >
> >>
> >>
> >> 2) I dislike how policy->cpus is handled; it feels upside down.
> >
> > I agree. It feels better to push the utilization from cfs instead of
> > pull it from the frequency selection policy.
> >
> > I chose to do it this way since there isn't an obvious way to pass some
> > private data to an irq_work callback, but I've overcome this with some
> > per-cpu data internal to the governor.
> >
>
> This new interface looks like what I was also proposing in the
> discussion we had [1], are you going to send out something along this
> line?

Yes, it will look something like your patch, but different as well. I'll
send it out shortly.

I'm not reducing how we kick the whole machinery, but I'm adding extra
points to bail early if:

1) capacity utilization did not change
2) frequency won't be changed as a result of capacity utilization change

Regards,
Mike

>
> I think it would be good to have something like that. As Peter is
> saying, together with Morten's/Dietmar's series, we could try to do
> wiser decisions on when to trigger the whole machinery.
>
> Thanks,
>
> - Juri
>
> [1] https://lists.linaro.org/pipermail/eas-dev/2015-April/000129.html
>
> >> If per 1 each CPU already computed its utilization and provides it in
> >> the call, we should not have to recompute it and its scale factor (which
> >> btw seems done slightly odd too, I know humans like 10 base, but
> >> computers suck at it).
> >>
> >> Why not something like:
> >>
> >> usage = ((1024 + 256) * usage) >> 10; /* +25% */
> >
> > This works when the max capacity is always 1024, but that is not always
> > the case for some new ARM systems. For cpu's whose max capacity is less
> > than 1024 we would still need to normalize against capacity_orig_of().
> >
> > Doing the SCHED_LOAD_SCALE thing is fine for me now, but at some point
> > it will probably need to be changed by Morten/Juri/Dietmar for their EAS
> > patch series, which deals with some of these cpu capacity scale issues.
> >
> >>
> >> old_usage = __this_cpu_read(gd->usage);
> >> __this_cpu_write(gd->usage, usage);
> >>
> >> max_usage = 0;
> >> for_each_cpu(cpu, policy->cpus)
> >> max_usage = max(max_usage, per_cpu(gd->usage, cpu));
> >>
> >> if (max_usage < old_usage || /* dropped */
> >> (max_usage == usage && max_usage != old_usage)) /* raised */
> >> request_change(max_usage);
> >
> > Yes this should work fine. I've pulled it into the patch that I'll
> > test and publish tomorrow.
> >
> > Thanks,
> > Mike
> >
> >>
> >>
> >
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/