Re: [PATCH?] Livelock in pick_next_task_fair() / idle_balance()
From: Morten Rasmussen
Date: Thu Jul 02 2015 - 07:38:11 EST
On Thu, Jul 02, 2015 at 09:05:39AM +0800, Yuyang Du wrote:
> Hi Mike,
>
> On Thu, Jul 02, 2015 at 10:05:47AM +0200, Mike Galbraith wrote:
> > On Thu, 2015-07-02 at 07:25 +0800, Yuyang Du wrote:
> >
> > > That being said, it is also obvious to prevent the livelock from happening:
> > > idle pulling until the source rq's nr_running is 1, becuase otherwise we
> > > just avoid idleness by making another idleness.
> >
> > Yeah, but that's just the symptom, not the disease. Better for the idle
> > balance symptom may actually be to only pull one when idle balancing.
> > After all, the immediate goal is to find something better to do than
> > idle, not to achieve continual perfect (is the enemy of good) balance.
> >
> Symptom? :)
>
> You mean "pull one and stop, can't be greedy"? Right, but still need to
> assure you don't make another idle CPU (meaning until nr_running == 1), which
> is the cure to disease.
>
> I am ok with at most "pull one", but probably we stick to the load_balance()
> by pulling an fair amount, assuming load_balance() magically computes the
> right imbalance, otherwise you may have to do multiple "pull one"s.
Talking about the disease and looking at the debug data that Rabin has
provided I think the problem is due to the way blocked load is handled
(or not handled) in calculate_imbalance().
We have three entities in the root cfs_rq on cpu1:
1. Task entity pid 7, load_avg_contrib = 5.
2. Task entity pid 30, load_avg_contrib = 10.
3. Group entity, load_avg_contrib = 118, but contains task entity pid
413 further down the hierarchy with task_h_load() = 0. The 118 comes
from the blocked load contribution in the system.slice task group.
calculate_imbalance() figures out the average loads are:
cpu0: load/capacity = 0*1024/1024 = 0
cpu1: load/capacity = (5 + 10 + 118)*1024/1024 = 133
domain: load/capacity = (0 + 133)*1024/(2*1024) = 62
env->imbalance = 62
Rabin reported env->imbalance = 60 after pulling the rcu task with
load_avg_contrib = 5. It doesn't match my numbers exactly, but it pretty
close ;-)
detach_tasks() will attempts to pull 62 based on tasks task_h_load() but
the task_h_load() sum is only 5 + 10 + 0 and hence detach_tasks() will
empty the src_rq.
IOW, since task groups include blocked load in the load_avg_contrib (see
__update_group_entity_contrib() and __update_cfs_rq_tg_load_contrib()) the
imbalance includes blocked load and hence env->imbalance >=
sum(task_h_load(p)) for all tasks p on the rq. Which leads to
detach_tasks() emptying the rq completely in the reported scenario where
blocked load > runnable load.
Whether emptying the src_rq is the right thing to do depends on on your
point of view. Does balanced load (runnable+blocked) take priority over
keeping cpus busy or not? For idle_balance() it seems intuitively
correct to not empty the rq and hence you could consider env->imbalance
to be too big.
I think we will see more of this kind of problems if we include
weighted_cpuload() as well. Parts of the imbalance calculation code is
quite old and could use some attention first.
A short term fix could be what Yuyang propose, stop pulling tasks when
there is only one left in detach_tasks(). It won't affect active load
balance where we may want to migrate the last task as it active load
balance doesn't use detach_tasks().
Morten
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/