Re: [PATCH v3 03/12] sched/fair: Change the variable to hold the number of periods to 32bit
From: Morten Rasmussen
Date: Tue May 10 2016 - 05:10:21 EST
On Fri, May 06, 2016 at 02:19:32AM +0800, Yuyang Du wrote:
> Hi Morten,
>
> On Thu, May 05, 2016 at 12:13:10PM +0100, Morten Rasmussen wrote:
> > On Wed, May 04, 2016 at 04:02:44AM +0800, Yuyang Du wrote:
> > > In sched average update, a period is about 1ms, so a 32-bit unsigned
> > > integer can approximately hold a maximum of 49 (=2^32/1000/3600/24)
> > > days.
> > >
> > > For usual cases, 32bit is big enough and 64bit is needless. But if
> > > a task sleeps longer than it, there can be two outcomes:
> > >
> > > Consider a task sleeps whatever m milliseconds, then n = (u32)m.
> >
> > Maybe it crystal clear that you mean: For any m > UINT_MAX?
>
> En, literally whatever.
Your second case below is not correct for m < UINT_MAX. You are only
talking about cases where n != m which is m > UINT_MAX.
>
> > > 1. If n >= 32*64, then the task's sched avgs will be surely decayed
> > > to 0. In this case, it really doesn't matter that the 32bit is not
> > > big enough to hold m. In other words, a task sleeps 2 secs or sleeps
> > > 50 days are the same from sched average point of view.
> > >
> > > 2. If m < 32*64, the chance to be here is very very low, but if so,
> >
> > Should that be: n < 32*64 ?
> >
> > Talking about 32*64, I don't get why we don't use LOAD_AVG_MAX_N. I had
> > a patch ready to post for that:
> >
> > >From 5055e5f82c8d207880035c2ec4ecf1ac1e7f9e91 Mon Sep 17 00:00:00 2001
> > From: Morten Rasmussen <morten.rasmussen@xxxxxxx>
> > Date: Mon, 11 Apr 2016 15:41:37 +0100
> > Subject: [PATCH] sched/fair: Fix decay to zero period in decay_load()
> >
> > In __compute_runnable_contrib() we are happy with returning LOAD_AVG_MAX
> > when the update period n >= LOAD_AVG_MAX_N (=345), so we should be happy
> > with returning zero for n >= LOAD_AVG_MAX_N when decaying in
> > decay_load() as well instead of only returning zero for n >
> > LOAD_AVG_PERIOD * 63 (=2016).
>
> So basically, you want to add another rule in addition to the exponential
> decay rule.
No, not at all. I want to make the 'rules' symmetrical for accumulation
and decay exactly like the patch does.
>
> > > the task's sched avgs MAY NOT be decayed to 0, depending on how
> > > big its sums are, and the chance to 0 is still good as load_sum
> > > is way less than ~0ULL and util_sum way less than ~0U.
> >
> > I don't get the last bit about load_sum < ~0ULL and util_sum < ~0U.
> > Whether you get to zero depends on the sums (as you say) and the actual
> > value of 'n'. It is true that you might get to zero even if n <
> > LOAD_AVG_MAX_N if the sums are small.
>
> Frankly, util Ben brought it up, I didn't think a task sleeping so long
> is even possible. But I do admit it may happen.
>
> However, I will say this. A task sleeping so long is already very rare,
> and among all those long sleep cases, the chance that after waking up the
> avgs will not be decayed to zero is much less than 0.5 in a million
> (=32*64/2^32=1/2^21), assuming the sleeping time is uniformly distributed.
You can't just ignore cases because they have a low probability. Going
by that logic we could remove a lot of synchronization overhead in the
kernel.
My concern is whether we introduce any assumptions that might hit us
later when everybody has forgotten about them. This one would be
extremely hard to debug later.
> > > Nevertheless, what really maters is what happens in the worst-case
> > > scenario, which is when (u32)m = 0? So in that case, it would be like
> > > after so long a sleep, we treat the task as it never slept, and has the
> > > same sched averages as before it slept. That is actually not bad or
> > > nothing to worry about, and think of it as the same as how we treat
> > > a new born task.
> >
> > There is subtle but important difference between not decaying a task
> > correctly and adding new task: The sleeping task is already accounted
> > for in the cfs_rq.avg.{load,util}_sum. The sleeping task's contribution
> > to cfs_rq.avg has been decayed correctly in the mean time which means
> > that by not guaranteeing a decay of the se.avg at wake-up you introduce
> > a discrepancy between the task's owen view of its contribution (se.avg)
> > and the cfs_rq view (cfs_rq.avg). That might lead to trouble if the task
> > is migrated at wake-up as we remove se.avg amount of contribution from
> > the previous cpu's cfs_rq.avg. In other words, you remove too much from
> > the cfs_rq.avg.
> >
> > The discrepancy will decay and disappear after LOAD_AVG_MAX_N ms, which
> > might be acceptable, but it is not a totally harmless change IMHO.
>
> That is just an explanation, :) nevermind, or I really don't think that is
> a deal.
I don't really follow. I analysed the implications of the overflow that
you are introducing. In my opinion this is what you should have done
before proposing this patch. I think it is essential to understand what
assumptions we make and introducing new ones should be carefully
considered. I think it is a big 'deal' if you are not more careful when you
are submitting patches and just ignore feedback. We spent a lot of time
reviewing them.
Morten