Re: [PATCH v2] sched/fair: Fix insertion in rq->leaf_cfs_rq_list
From: Peter Zijlstra
Date: Wed Jan 30 2019 - 08:04:23 EST
On Wed, Jan 30, 2019 at 06:22:47AM +0100, Vincent Guittot wrote:
> The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that
> it will walk down to root the 1st time a cfs_rq is used and we will finish
> to add either a cfs_rq without parent or a cfs_rq with a parent that is
> already on the list. But this is not always true in presence of throttling.
> Because a cfs_rq can be throttled even if it has never been used but other CPUs
> of the cgroup have already used all the bandwdith, we are not sure to go down to
> the root and add all cfs_rq in the list.
>
> Ensure that all cfs_rq will be added in the list even if they are throttled.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e2ff4b6..826fbe5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -352,6 +352,20 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> }
> }
>
> +static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq *rq)
> +{
> + struct cfs_rq *cfs_rq;
> +
> + for_each_sched_entity(se) {
> + cfs_rq = cfs_rq_of(se);
> + list_add_leaf_cfs_rq(cfs_rq);
> +
> + /* If parent is already in the list, we can stop */
> + if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
> + break;
> + }
> +}
> +
> /* Iterate through all leaf cfs_rq's on a runqueue: */
> #define for_each_leaf_cfs_rq(rq, cfs_rq) \
> list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> @@ -5179,6 +5197,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> }
>
> + /* Ensure that all cfs_rq have been added to the list */
> + list_add_branch_cfs_rq(se, rq);
> +
> hrtick_update(rq);
> }
So I don't much like this; at all. But maybe I misunderstand, this is
somewhat tricky stuff and I've not looked at it in a while.
So per normal we do:
enqueue_task_fair()
for_each_sched_entity() {
if (se->on_rq)
break;
enqueue_entity()
list_add_leaf_cfs_rq();
}
This ensures that all parents are already enqueued, right? because this
is what enqueues those parents.
And in this case you add an unconditional second
for_each_sched_entity(); even though it is completely redundant, afaict.
The problem seems to stem from the whole throttled crud; which (also)
breaks the above enqueue loop on throttle state, and there the parent can
go missing.
So why doesn't this live in unthrottle_cfs_rq() ?