Re: [RFC PATCH 00/14] sched: entity load-tracking re-work

From: Paul E. McKenney
Date: Tue Mar 13 2012 - 12:49:58 EST


On Mon, Mar 12, 2012 at 10:39:27AM +0000, Morten Rasmussen wrote:
> On Thu, Feb 02, 2012 at 01:38:26AM +0000, Paul Turner wrote:
> > As referenced above this also allows us to potentially improve decisions within
> > the load-balancer, for both distribution and power-management.
> >
> > Exmaple: consider 1x80% task and 2x40% tasks on a 2-core machine. It's
> > currently a bit of a gamble as to whether you get an {AB, B} or {A,
> > BB} split since they have equal weight (assume 1024). With per-task
> > tracking we can actually consider them at their contributed weight and
> > see a stable ~{800,{400, 400}} load-split. Likewise within balance_tasks we can
> > consider the load migrated to be that actually contributed.
>
> Hi Paul (and LKML),

Hello, Morten!

> As a follow up to the discussions held during the scheduler mini-summit
> at the last Linaro Connect I would like to share what I (working for
> ARM) have observed so far in my experiments with big.LITTLE scheduling.
>
> I see task affinity on big.LITTLE systems as a combination of
> user-space affinity (via cgroups+cpuset etc) and introspective affinity
> as result of intelligent load balancing in the scheduler. I see the
> entity load tracking in this patch set as a step towards the latter. I
> am very interested in better task profiling in the scheduler as this is
> crucial for selecting which tasks that should go on which type of core.
>
> I am using the patches for some very crude experiments with scheduling
> on big.LITTLE to explore possibilities and learn about potential issues.
> What I want to achieve is that high priority CPU-intensive tasks will
> be scheduled on fast and less power-efficient big cores and background
> tasks will be scheduled on power-efficient little cores. At the same
> time I would also like to minimize the performance impact experienced
> by the user. The following is a summary of the observation that I have
> made so far. I would appreciate comments and suggestions on the best way
> to go from here.
>
> I have set up two sched_domains on a 4-core ARM system with two cores
> each that represents big and little clusters and disabled load balancing
> between them. The aim is to separate heavy and high priority tasks from
> less important tasks using the two domains. Based on load_avg_contrib
> tasks will be assigned to one of the domains by select_task_rq().
> However, this does not work out very well. If a task in the little
> domain suddenly consumes more CPU time and never goes to sleep it will
> never get the chance to migrate to the big domain. On a homogeneous
> system it doesn't really matter _where_ a task goes if imbalance is
> unavoidable as all cores have equal performance. For heterogeneous
> systems like big.LITTLE it makes a huge difference. To mitigate this
> issue I am periodically checking the currently running task on each
> little core to see if a CPU-intensive task is stuck there. If there is
> it will be migrated to a core in the big domain using
> stop_one_cpu_nowait() similar to the active load balance mechanism. It
> is not a pretty solution, so I am open for suggestions. Furthermore, by
> only checking the current task there is a chance of missing busy tasks
> waiting on the runqueue but checking the entire runqueue seems too
> expensive.
>
> My observations are based on a simple mobile workload modelling web
> browsing. That is basically two threads waking up occasionally to render
> a web page. Using my current setup the most CPU intensive of the two
> will be scheduled on the big cluster as intended. The remaining
> background threads are always on the little cluster leaving the big
> cluster idle when it is not rendering to save power. The
> task-stuck-on-little problem can most easily be observed with CPU
> intensive workloads such the sysbench CPU workload.
>
> I have looked at traces of both runnable time and usage time trying to
> understand why you use runnable time as your load metric and not usage
> time which seems more intuitive. What I see is that runnable time
> depends on the total runqueue load. If you have many tasks on the
> runqueue they will wait longer and therefore have higher individual
> load_avg_contrib than they would if the were scheduled across more CPUs.
> Usage time is also affected by the number of tasks on the runqueue as
> more tasks means less CPU time. However, less usage can also just mean
> that the task does not execute very often. This would make a load
> contribution estimate based on usage time less accurate. Is this your
> reason for choosing runnable time?

It might be a tradeoff between accuracy of scheduling and CPU cost of
scheduling, but I have to defer to Peter Z, Paul Turner, and the rest
of the scheduler guys on this one.

Thanx, Paul

> Do you have any thoughts or comments on how entity load tracking could
> be applied to introspectively select tasks for appropriate CPUs in
> system like big.LITTLE?
>
> Best regards,
> Morten
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/