Re: [RFC 08/14] sched/tune: add detailed documentation

From: Ricky Liang
Date: Fri Sep 04 2015 - 04:07:27 EST


Hi Patrick,

Please find my replies inline.

On Thu, Sep 3, 2015 at 5:18 PM, Patrick Bellasi <patrick.bellasi@xxxxxxx> wrote:
> On Wed, Sep 02, 2015 at 07:49:58AM +0100, Ricky Liang wrote:
>> Hi Patrick,
>
> Hi Ricky,
>
>> I wonder if this can replace the boost function in the interactive
>> governor [0], which is widely used in both Android and ChromeOS
>> kernels.
>
> In my view, one of the main goals of sched-DVFS is actually that to be
> a solid and generic replacement of different CPUFreq governors.
> Being driven by the scheduler, sched-DVFS can exploit information on
> CPU demand of active tasks in order to select the optimal Operating
> Performance Point (OPP) using a "proactive" approach instead of the
> "reactive" approach commonly used by existing governors.
>
> In the current implementation proposed by this RFC, SchedTune is just
> a simple mechanism on top of sched-DVFS to bias the selection of the
> OPP.
> In case a task is running for a limited amount of time at each of its
> (sporadic) activation, it does not contribute a CPU load which selects
> an higher OPP. Thus, the actual performance (i.e. time to completion)
> of that task depends on which other tasks are co-scheduled with it.
> If it has the chance to be scheduled on a loaded CPU it will run fast,
> to the contrary it will be slower when scheduled alone on a CPU
> running at the lowest OPP.
>
> If this task is (for whatever reason) "important" and should always
> complete an activation as soon as possible, the current situation is:
> a) we use the "performance" governor when we know that the task could
> be active, thus running the whole system in "race-to-idle" mode
> b) we use the "interactive" governor (if possible, since it is not not
> in mainline) and ensure that this task pokes the "boost" attribute
> when it is active
>
> Notice that, for both these solutions:
> 1) unless we pin the task on a specific set of CPUs, we must enable
> this governor for all the frequency domains since we do not know on
> which CPU the scheduler will end up to run the task
> 2) the tuning for a single task is likely to affect the whole system
> once the task as been started all the tasks are going to be boosted
> even when this task is not runnable
>
> SchedTune provides a "global tunable" which allows to get the same
> results as a) and b) with the main advantage that only the specific
> frequency domain where the task is RUNNING is boosted. Since we do not
> need to pin the task to get this result this can simplify (eventually)
> the modification required in user-space while still getting optimal
> performances for the task without compromising overall system
> consumption.
>
> AFAIU, regarding specifically the boost modes supported by the
> Interactive governor:
>
> 1) the "boost" tunable is substantially similar to setting to 100% the
> SchedTune boost value. Userspace is in charge to trigger the start
> and end of a boost period
>
> 2) the "boostpulse" tunable triggers a 100% boost.
>
> The main difference is that the Interactive governor resets the
> boost after a configurable time (usually 80ms) while in SchedTune
> the boost value is asserted until release by userspace.
>
> This has advantages and disadvantages. By using SchedTune the
> userspace has to release the boost explicitly. With the Interactive
> governor this is automatic but still the userspace has to defined a
> suitable timeout. However, this can be different for different
> tasks.
>
> 3) the "boost_input" tunable is just an hook exposed to kernel drivers
> which can generate input events expected to impact on user the
> experience.
> The actual implementation is just similar to the previous knob.
>
> IMHO the "boostpulse/input_pulse" tunables are a simple solution to
> the problem of running fast to get better UI interactive response.
> Indeed, the driver/task which generates the input event is not
> necessary the actual target of the load and/or user perceived
> response.
> Moreover, it boosts all the frequency domains independently from where
> the actual UI related workload is running.
>
> By exploiting scheduler information on the actual workload demand of
> some tasks, we could aim at a more effective solution which boost just
> the required CPUs and only when the task affecting the UI experience
> is actually running. This is what the "per-task" SchedTune boosting is
> trying to enable.
>
> I'm wondering if you could provide some example to better describe
> when the "boostpulse" tunables are used in ChromiumOS.
> Maybe that by starting from the description of some use-case we could
> better understand if the tunables provided by the Interactive governor
> are really required of if we can figure out a possible better even if
> different approach to be implemented in SchedTune.
>

In addition to the "boost" or "boost pulse" that are triggered by user space, in
ChromiumOS we register a input event handler in the interactive governor
which triggers interactive boost upon receiving any input events. The handler
causes the interactive governor to boost all CPUs and the boost lasts until the
CPUs go idle - in other words the boost lasts until there's no work for the
CPUs to do. Sometimes it's not trivial to tell which processes are crucial to
interactive response, so we are doing a global boost.

This is a use case specific to ChromiumOS, so it's probably not suitable to
be included in the mainline kernel. However, there are probably other similar
use cases out there so it's interesting to explore how SchedTune could
support this use case.

>> My understanding is that the boost in interactive governor is to
>> simply raise the OPP on selected cores.
>
> AFAIU the "boost" of the Interactive governor affects all the (online)
> CPUs. Thus if you have a multi frequency domain system (e.g.
> big.LITTLE), the Interactive governor switch to performance mode for
> all the CPUs. This makes sense since that boosting is triggered by an
> event but does not exploit any information on which tasks really need
> boosting and where they are executed by the scheduler.
>

You can also boost specific CPU in user space. The boost can be enabled
in a per-policy granularity. In any case you are right, the interactive governor
doesn't have context about tasks so SchedTune can be more effective.

>> The SchedTune boost works by adding a margin to the original load of
>> a task which makes the kernel think that the task is more demanding
>> than it actually is. My intuition was that they work differently and
>> could cause different reaction in the kernel.
>
> That's absolutely true, they works differently. However it is worth to
> notice that the SchedTune boost value is "consumed" just by
> sched-DVFS, when it has to select an OPP.
> There are not other links with the scheduler and/or signals "consumed"
> by the scheduler. Specifically, all the task/RQ specific signals used
> by the scheduler are not affected by the SchedTune value.
>
> This is what happens in the SchedTune version presented by this
> RFC. Internally we are working on an extension which integrates the
> Energy-Aware scheduler (EAS).
> In that case you are right, the boost value could affect some decision
> of the EAS scheduler. For example, boosted tasks could end up being
> moved into a more capable CPU of a big.LITTLE system even if they are
> not generating a big utilization.
>
>> I feel that the per-task cgroup ScheTune boost should work as
>> expected as it only boosts a set of tasks and make them appear
>> relatively high demanding comparing to other tasks. But if the
>> ScheTune boost is applied globally to boost all the tasks in the
>> system, will it cause unnecessary task migrations as all the tasks
>> appear to be high demanding to the kernel?
>
> IMHO the best usage of SchedTune is via "per-task" boosting, where it
> is more easy to control when the system must work at higher OPPs.
> However, this will probably require more efforts in the user-space
> middleware layers to feed the scheduler with sensible information
> about tasks demands.
>
> Meanwhile, the current solutions are based on system-wide tuning, and
> that's why SchedTune has been proposed with a support for "global"
> boosting.
>
> When we are boosting globally the only information we are providing to
> the kernel is that we are in a rush and everything is important. Thus
> yes, small tasks could eventually end up being moved into a more
> capable CPU.
>
> However, how SchedTune is going to bias tasks allocation is part of our
> internal developments targeting its integration with EAS.
>
>> Specifically, my questions is: When the global SchedTune boost is
>> enabled in a on-demand manner, is it possible that a light task gets
>> migrated to the big core, and in turn kicks out a heavy task
>> originally on that core?
>
> In this RFC we presented just the initial idea of task boosting with a
> solution which is generic enough to possibly replace some of the most
> commonly used CPUFreq governors (e.g. Performance, Ondemand and
> Interactive) while still being completely unrelated from the scheduler
> decisions on tasks allocation.
>
> We think that the approach of posting small and self-contained updates
> can be more effective on creating consensus by working together on
> designing and building a solution which fits many different needs.
>
>> I'm wondering whether global SchedTune boost could result in a
>> "priority inversion" causing the heavy task to run on the little
>> core and the light task to run on the big core.
>
> That's an interesting point we should keep into consideration for the
> design of the complete solution.
> I would prefer to post-pone this discussion on the list once we will
> present the next extension of SchedTune which integrates into EAS.
>
>
>> [0]: https://android.googlesource.com/kernel/common.git/+/android-3.18/drivers/cpufreq/cpufreq_interactive.c
>>
>> Thanks,
>> Ricky
>
> Cheers,
> Patrick
>
> --
> #include <best/regards.h>
>
> Patrick Bellasi
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/