Re: [PATCH 2/2] sched: Rewrite per entity runnable load average tracking

From: bsegall
Date: Wed Jul 09 2014 - 13:08:52 EST

Yuyang Du <yuyang.du@xxxxxxxxx> writes:

> Thanks, Ben.
> On Tue, Jul 08, 2014 at 10:04:22AM -0700, bsegall@xxxxxxxxxx wrote:
>> > The sampling of cfs_rq->load.weight should be equivalent to the current
>> > code in that at the end of day cfs_rq->load.weight worth of runnable would
>> > contribute to runnable_load_avg/blocked_load_avg for both the current and
>> > the rewrite.
>> Yes, but the point is that it looks at load.weight when delta/1024 > 0
>> and assumes that it has had that load.weight the entire time, when this
>> might not actually be true. The largest error I can think of is if you
>> have a very high weight task that tends to run for very short periods of
>> time, you could end up losing their entire contribution to load. This
>> may be acceptable, I'm not certain.
> That I believe is not a problem. It is a matter of when cfs_rq->load.weight
> changes and when we look at it to contribute to the cfs_rq's load_avg.
> Fortunately, we will not miss any change of cfs_rq->load.weight, always
> contributing to the load_avg the right amount. Put another way, we always
> use the right cfs_rq->load.weight.

You always call __update_load_avg with every needed load.weight, but if
now - sa->last_update_time < 1024, then it will not do anything with
that weight, and the next actual update may be with a different weight.

>> > So left are the migrate_task_rq_fair() not holding lock issue and cfs_rq->avg.load_avg
>> > overflow issue. I need some time to study them.
>> >
>> > Overall, I think none of these issues are originally caused by combination/split
>> > of runnable and blocked. It is just a matter of how synchronized we want to be
>> > (this rewrite is the most synchronized), and the workaround I need to borrow from the
>> > current codes.
>> If you can manage to find a to deal with the migration issue without it,
>> that would be great, I'm just pretty sure that's why we did the split.
> After thinking about it the whole morning, this issue is a killer... :)
> But I think, even by spliting and by using atomic operations, the current code
> does not substract the migrating task's load absolutely synchronized:
> When se migrating, you atomic_read cfs_rq->decay_counter (say t1), and then atomic_add
> the se's load_avg_contrib to removed_load. However, when updating the cfs_rq->blocked_load_avg
> it is not guranteed when substracting the removed_load from the blocked_load_avg,
> the actual decay_counter (say t2) is the same as t1. Because at t1 time (when you
> atomic_read it), the decay_counter can be being updated (before you atomic_add to
> removed_load). Atomic operation does not help synchronize them. Right?
> So essentially, to perfectly substract the right amount, we must first synchronize
> the migrating task with its cfs_rq (catch up), and then substract it right away.
> Those two steps must be synchronized (with lock) or atomically done at once. Considering
> migrating and load averaging are not sequential (strict interleaving). Separating the
> two steps but synchronizing them is not enough, they must be done
> atomically.

Yeah, I don't think it is actually entirely safe, and looking at it
again, there's not even any guarantee that the decay_counter read is
anything close to current, even on something like x86, which is... bad.
I know this is why subtract_blocked_load_contrib makes sure to clamp at

To do this actually correctly, you'd need cmpxchg16b or locking. Given
that we end up pulling a cacheline to do the removed_load add RMW
anyway, it might be reasonable to do something like that, I don't know.

> That is chalenging... Can someone (Peter) grant us a lock of the remote rq? :)
> Thanks,
> Yuyang
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at