Re: [PATCH] sched: fix infinity loop in update_blocked_averages

From: Linus Torvalds
Date: Thu Dec 27 2018 - 20:37:17 EST


On Thu, Dec 27, 2018 at 5:15 PM Tejun Heo <tj@xxxxxxxxxx> wrote:
>
> I'm pretty sure enqueue_entity() *has* to be called with rq lock.
> unthrottle_cfs_rq() is called from tg_set_cfs_bandwidth(),
> distribute_cfs_runtime() and unthrottle_offline_cfs_rqs. The first
> two grabs the rq_lock just around the calls and the last one has a
> lockdep assert on the rq_lock. What am I missing?

No, I think you're right, and I just didn't follow things deep enough,
didn't see any rq locking in the loop in unthrottle_offline_cfs_rqs(),
and didn't realize that the rq is locked by the caller.

> > But that still makes me go "how come is this only noticed 18 months
> > after the fact"?
>
> Unless I'm totally confused, which is definitely possible, I don't
> think there's a race condition and the only bug is the
> tmp_alone_branch pointer getting dangled, which maybe doesn't happen
> all that much?

Ahh. That would explain the list corruption. The next
list_add_leaf_cfs_rq() could try to add to a removed entry.

How would you reset it? Do something like

rq->tmp_alone_branch = &rq->leaf_cfs_rq_list;

for every removal, or make it conditional on it matching the removed entry?

Linus