Re: Crash in list_add_leaf_cfs_rq due to bad tmp_alone_branch

From: Vincent Guittot
Date: Fri Jan 25 2019 - 09:31:57 EST


Hi Sargun,

On Mon, 21 Jan 2019 at 15:46, Vincent Guittot
<vincent.guittot@xxxxxxxxxx> wrote:
>
> Hi Sargun,
>
> Le Friday 18 Jan 2019 Ã 15:06:28 (+0100), Vincent Guittot a Ãcrit :
> > On Fri, 18 Jan 2019 at 11:16, Vincent Guittot
> > <vincent.guittot@xxxxxxxxxx> wrote:
> > >
> > > On Wed, 9 Jan 2019 at 23:43, Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> > > >
> > > > On Wed, Jan 9, 2019 at 2:14 PM Sargun Dhillon <sargun@xxxxxxxxx> wrote:
> > > > >
> > > > > I picked up c40f7d74c741a907cfaeb73a7697081881c497d0 sched/fair: Fix
> > > > > infinite loop in update_blocked_averages() by reverting a9e7f6544b9c
> > > > > and put it on top of 4.19.13. In addition to this, I uninlined
> > > > > list_add_leaf_cfs_rq for debugging.
> >
> > With the fix above applied, the code that manages the leaf_cfs_rq_list
> > is the same since v4.9.
> > Have you noticed similar problem on other older kernel version between
> > v4.9 and v4.19 ? The problem might have been introduce while modifying
> > other part of the scheduler like the sequence for adding/removing
> > cgroup.
> >
> > Knowing the most recent kernel version without the problem could help
> > to narrow the problem
> >
> > Thanks,
> > Vincent
> >
> > > > >
> > > > > This revealed a new bug that we didn't get to because we kept getting
> > > > > crashes from the previous issue. When we are running with cgroups that
> > > > > are rapidly changing, with CFS bandwidth control, and in addition
> > > > > using the cpusets cgroup, we see this crash. Specifically, it seems to
> > > > > occur with cgroups that are throttled and we change the allowed
> > > > > cpuset.
> > >
> > > Thanks for the context, I will try to reproduce the problem and
> > > understand how we can stop in the middle of walking to the
> > > sched_entity branch with a parent not already added
> > >
> > > How many cgroup level have you got in you setup ?
> > >
> > > > >
> > > >
> > > > This patch from Gabriel should fix the problem:
> > > >
> > > >
> > > > [PATCH] sched/fair: Reset tmp_alone_branch on cfs_rq delete
> > > >
> > > > When a child cfs_rq is added to the leaf cfs_rq list before its parent
> > > > tmp_alone_branch is set to point to the child in preparation for the
> > > > parent being added.
> > > >
> > > > If the child is deleted before the parent is added then tmp_alone_branch
> > > > points to a freed cfs_rq. Any future reference to tmp_alone_branch will
> > > > result in a use after free.
> > >
> > > So, the patch below is a temporary fix that helps to recover from the
> > > situation where tmp_alone_branch doesn't finished back to
> > > rq->leaf_cfs_rq_list
> > > But this situation should not happened at the beginning
>
> I have been able to reproduce the situation where tmp_alone_branch doesn't
> point to rq->leaf_cfs_rq_list after enqueuing a task.
>
> Can you try the patch below which ensures all cfs_rq of a cgroup branch will
> be added in the list even if throttled ?

Did you get a chance to test this patch ?

Regards,
Vincent

>
> The algorithm used to order cfs_rq in rq->leaf_cfs_rq_list assumes that
> it will walk down to root the 1st time a cfs_rq is used and we will finished
> to add either a cfs_rq without parent or a cfs_rq with a parent that is already
> on the list. But this is not always true in presence of throttling.
> Because a cfs_rq can be throttled even if it has never been used but other CPUS
> of the cgroup have already used all the bandwdith, we are not sure to go down to
> the root and add all cfs_rq in the list.
>
> Ensure that all cfs_rq will be added in the list even if they are throttled.
>
> Signed-off-by: Vincent Guittot <vincent.guittot@xxxxxxxxxx>
> ---
> kernel/sched/fair.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6483834..ae468ab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -352,6 +352,20 @@ static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> }
> }
>
> +static inline void list_add_branch_cfs_rq(struct sched_entity *se, struct rq *rq)
> +{
> +struct cfs_rq *cfs_rq;
> +
> + for_each_sched_entity(se) {
> + cfs_rq = cfs_rq_of(se);
> + list_add_leaf_cfs_rq(cfs_rq);
> +
> + /* If parent is already in the list, we can stop */
> + if (rq->tmp_alone_branch == &rq->leaf_cfs_rq_list)
> + break;
> + }
> +}
> +
> /* Iterate through all leaf cfs_rq's on a runqueue: */
> #define for_each_leaf_cfs_rq(rq, cfs_rq) \
> list_for_each_entry_rcu(cfs_rq, &rq->leaf_cfs_rq_list, leaf_cfs_rq_list)
> @@ -5177,6 +5191,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>
> }
>
> + /* Ensure that all cfs_rq have been added to the list */
> + list_add_branch_cfs_rq(se, rq);
> +
> hrtick_update(rq);
> }
>
>
>
> > >
> > >
> > > >
> > > > Signed-off-by: Gabriel Hartmann <gabriel.hartmann@xxxxxxxxx>
> > > > Reported-by: Sargun Dhillon <sargun@xxxxxxxxx>
> > > > ---
> > > > kernel/sched/fair.c | 5 +++++
> > > > 1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 7137bc343b4a..0987629cbb76 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -347,6 +347,11 @@ static inline void list_add_leaf_cfs_rq(struct
> > > > cfs_rq *cfs_rq)
> > > > static inline void list_del_leaf_cfs_rq(struct cfs_rq *cfs_rq)
> > > > {
> > > > if (cfs_rq->on_list) {
> > > > + struct rq *rq = rq_of(cfs_rq);
> > > > +
> > > > + if (rq->tmp_alone_branch == &cfs_rq->leaf_cfs_rq_list)
> > > > + rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;
> > > > +
> > > > list_del_rcu(&cfs_rq->leaf_cfs_rq_list);
> > > > cfs_rq->on_list = 0;
> > > > }