Re: [PATCH V4 2/3] tick/cpuidle: Initialize hrtimer mode of broadcast

From: Daniel Lezcano
Date: Tue Feb 11 2014 - 17:05:36 EST


On 02/11/2014 05:09 PM, Preeti U Murthy wrote:
Hi Daniel,

Thank you very much for the review.

On 02/11/2014 03:46 PM, Daniel Lezcano wrote:
On 02/07/2014 09:06 AM, Preeti U Murthy wrote:
From: Thomas Gleixner <tglx@xxxxxxxxxxxxx>

On some architectures, in certain CPU deep idle states the local
timers stop.
An external clock device is used to wakeup these CPUs. The kernel
support for the
wakeup of these CPUs is provided by the tick broadcast framework by
using the
external clock device as the wakeup source.

However not all implementations of architectures provide such an external
clock device. This patch includes support in the broadcast framework
to handle
the wakeup of the CPUs in deep idle states on such systems by queuing
a hrtimer
on one of the CPUs, which is meant to handle the wakeup of CPUs in
deep idle states.

This patchset introduces a pseudo clock device which can be registered
by the
archs as tick_broadcast_device in the absence of a real external clock
device. Once registered, the broadcast framework will work as is for
these
architectures as long as the archs take care of the BROADCAST_ENTER
notification failing for one of the CPUs. This CPU is made the stand
by CPU to
handle wakeup of the CPUs in deep idle and it *must not enter deep
idle states*.

The CPU with the earliest wakeup is chosen to be this CPU. Hence this
way the
stand by CPU dynamically moves around and so does the hrtimer which is
queued
to trigger at the next earliest wakeup time. This is consistent with
the case where
an external clock device is present. The smp affinity of this clock
device is
set to the CPU with the earliest wakeup.

Hi Preeti,

jumping a bit late in the thread...

Setting the smp affinity on the earliest timer should be handled
automatically with the CLOCK_EVT_FEAT_DYNIRQ flag. Did you look at using
this flag ?

This patch is not setting the smp affinity of the pseudo clock device at
all. Its not required to for the reason that it does not exist.

I mentioned this point because we assign a CPU with the earliest wakeup
as standby. I compared this logic to the one used by the tick broadcast
framework for archs which have an external clock device to set the smp
affinity of the device.

If these archs do not have the flag CLOCK_EVT_FEAT_DYNIRQ set for the
external clock device, the tick broadcast framework sets the smp
affinity of this device to the CPU with the earliest wakeup. We are
using the same logic in this patchset as well to assign the stand by CPU.


Another comment is the overall approach. We enter the cpuidle idle
framework with a specific state to go to and it is the tick framework
telling us we mustn't go to this state. IMO the logic is wrong, the
decision to not enter this state should be moved somewhere else.

Its not the tick framework which tells us that we cannot enter deep idle
state, its the *tick broadcast* framework specifically. The tick
broadcast framework was introduced with the primary intention of
handling wakeup of CPUs in deep idle states when the local timers become
non-functional. Therefore there is a co-operation between this tick
broadcast framework and cpuidle. This has always been the case.

That is why just before cpus go into deep idle, they call into the
broadcast framework. Till now it was assumed that the tick broadcast
framework would find no problems with the cpus entering deep idle.
Therefore cpuidle would simply assume that all is well and go ahead and
enter deep idle state.
But today there is a scenario when there could be problems if all cpus
enter deep idle states and the tick broadcast framework now notifies the
cpuidle framework to hold back one cpu. This is just a simple extension
of the current interaction between cpuidle and tick broadcast framework.


Why don't you create a cpuidle driver with the shallow idle states
assigned to a cpu (let's say cpu0) and another one with all the deeper
idle states for the rest of the cpus ? Using the multiple cpuidle driver
support makes it possible. The timer won't be moving around and a cpu
will be dedicated to act as the broadcast timer.


Having a dedicated stand by cpu for broadcasting has some issues which
were pointed to when I posted the initial versions of this patchset.
https://lkml.org/lkml/2013/7/27/14

1. This could create power/thermal imbalance on the chip since only the
standby cpu cannot enter deep idle state at all times.

2. If it is cpu0 it is fine, else with the logic that you suggest,
hot-plugging out the dedicated stand by cpu would mean moving the work
of broadcasting to another cpu and modifying the cpuidle state table for
it. Even with cpu0, if support to hotplug it out is enabled (maybe it is
already), we will face the same issue and this gets very messy.

Wouldn't make sense and be less intrusive than the patchset you proposed ?

Actually this patchset brings in a solution that is as less intrusive as
possible. It makes the problem nearly invisible except for a failed
return from a call into the broadcast framework. It simply asks the
archs which do not have an external clock device to register a pseudo
device with the broadcast framework and from then on everything just
falls in place to enable deep idle states for such archs.

Hi Preeti,

thanks for the detailed explanation. I understand better now why you did it this way. Furthermore, as Thomas pointed me, what I missed is one cpu can't setup a local timer for another cpu, so what I was talking about does not make sense.

-- Daniel


--
<http://www.linaro.org/> Linaro.org â Open source software for ARM SoCs

Follow Linaro: <http://www.facebook.com/pages/Linaro> Facebook |
<http://twitter.com/#!/linaroorg> Twitter |
<http://www.linaro.org/linaro-blog/> Blog

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/