Re: [PATCH 0/2] Add exit_prepare callback to the cpufreq_driver interface.

From: Dirk Brandewie
Date: Fri Mar 14 2014 - 11:10:57 EST


On 03/13/2014 09:59 PM, Viresh Kumar wrote:
On Thu, Mar 13, 2014 at 11:06 PM, <dirk.brandewie@xxxxxxxxx> wrote:
From: Dirk Brandewie <dirk.j.brandewie@xxxxxxxxx>

Some drivers (intel_pstate) need to modify state on a core before it
is completely offline. The ->exit() callback is executed during the
CPU_POST_DEAD phase of the cpu offline process which is too late to
change the state of the core.

Patch 1 adds an optional callback function to the cpufreq_driver
interface which is called by the core during the CPU_DOWN_PREPARE
phase of cpu offline in __cpufreq_remove_dev_prepare().

Patch 2 changes intel_pstate to use the ->exit_prepare callback to do
its cleanup during cpu offline.

Copying stuff from other mail thread here so that we can discuss on a
single mail chain.

On 14 March 2014 03:09, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
On Thursday, March 13, 2014 12:56:02 PM Viresh Kumar wrote:
On 13 March 2014 08:07, Rafael J. Wysocki <rjw@xxxxxxxxxxxxx> wrote:
On Wednesday, March 12, 2014 02:27:07 PM Dirk Brandewie wrote:

I see two possibilities:
1. Move the exit() callback to __cpufreq_remove_dev_prepare(). I don't
have a good understanding of what carnage this would cause in the core
or other scaling drivers.

2. Add another callback to the cpufreq_driver interface that would be call
from __cpufreq_remove_dev_prepare() if the callback was set.

I prefer 2, the reason being that it pretty much is guaranteed not to break
anything. For the record, I'm not a fan of adding new cpufreq driver callbacks,
but in this particular case it seems we can't really do anything better.

I haven't thought a lot about which one of these two looks better, probably
Rafael might be correct. But I can see another way out here as this is very
much driver specific. Why can't we do a register_hotcpu_notifier() from
pstate driver alone?

Why would that be better than adding a new callback?

Because its becoming more and more confusing. Probably we got the problem
right but have wrong solutions for it.

But having considered this issue in detail now, I have more inputs. All Dirk and
Patrick wanted is to set core to min P-state before it gets offlined. We already
have some infrastructure in core which is called this today:
cpufreq_generic_suspend(). We can rework on that to get a more ideal solution
for both the problems.

Are you proposing adding cpufreq_generic_suspend() to the core I can not find
it in the mainline code.


Over that I don't think Dirk's solution is going to work if we twist
the systems a bit.

Could you explain what "twist the systems a bit" means?

For example, Dirk probably wanted to set P-STATE of every core to MIN
when it goes down. But his solution probably doesn't do that right now.


No, I wanted to set the core that was being off-lined to min P state.

As exit() is called only when the policy expires or all the CPUs of that policy
are down. Suppose only one CPU out of 4 goes down from a policy, then
pstate driver would never know that happened. And that core wouldn't go
to min state.

My patch does not change the semantics of exit() or when it is called. For
intel_pstate their is one cpu per policy so I moved all the cleanup to
exit_prepare() if there were work I needed to do at CPU_POST_DEAD I would have
continued to export the *optional* exit callback.

The callback name exit_prepare() was probably unfortunate and might be causing
some confusion. I will be changing the name per Rafael's feedback.

I think we have two solutions here:
- If its just about setting core a particular freq when it goes down, I think it
looks a generic enough problem and so better fix core for that. Probably with
help of flags field/suspend-freq (maybe renamed) and without calling drivers
exit() at all..

ATM the only thing that needs to be done in this case is to allow intel_pstate
to set the P state on the core when it is going done. My solution from the
cores point of view is more generic, it allows any driver that needs to do work
during CPU_DOWN_PREPARE to do it without adding any new logic to the core.


- If this is highly driver specific (which doesn't look like if all we
have to do is setting freq to MIN),

That is why intel_pstate is the *only* driver using exit_prepare(). If
other drivers need to do work during this phase of the cpu hotplug process
then they can export the callback otherwise it has no affect on existing
or future scaling drivers.

then better have something like
register_hotcpu_notifier() with priority set to -1, so that it gets called after
cpufreq.

intel_pstate or any scaling driver having its own hotcpu_notifier makes possible
for the core and the driver to disagree on the state of the driver. Having
the driver take hotplug notifications from two directions is a bad idea IMHO.


--
viresh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/