Re: [PATCH] fix granularity of task_u/stime(), v2

From: Hidetoshi Seto
Date: Thu Nov 19 2009 - 21:01:03 EST


Stanislaw Gruszka wrote:
> On Tue, Nov 17, 2009 at 02:24:48PM +0100, Peter Zijlstra wrote:
>>> Seems issue reported then was exactly the same as reported now by
>>> you. Looks like commit 49048622eae698e5c4ae61f7e71200f265ccc529 just
>>> make probability of bug smaller and you did not note it until now.
>>>
>>> Could you please test this patch, if it solve all utime decrease
>>> problems for you:
>>>
>>> http://patchwork.kernel.org/patch/59795/
>>>
>>> If you confirm it work, I think we should apply it. Otherwise
>>> we need to go to propagate task_{u,s}time everywhere, which is not
>>> (my) preferred solution.
>> That patch will create another issue, it will allow a process to hide
>> from top by arranging to never run when the tick hits.
>

Yes, nowadays there are many threads on high speed hardware,
such process can exist all around, easier than before.

E.g. assume that there are 2 tasks:

Task A: interrupted by timer few times
(utime, stime, se.sum_sched_runtime) = (50, 50, 1000000000)
=> total of runtime is 1 sec, but utime + stime is 100 ms

Task B: interrupted by timer many times
(utime, stime, se.sum_sched_runtime) = (50, 50, 10000000)
=> total of runtime is 10 ms, but utime + stime is 100 ms

You can see task_[su]time() works well for these tasks.

> What about that?
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 1f8d028..9db1cbc 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5194,7 +5194,7 @@ cputime_t task_utime(struct task_struct *p)
> }
> utime = (cputime_t)temp;
>
> - p->prev_utime = max(p->prev_utime, utime);
> + p->prev_utime = max(p->prev_utime, max(p->utime, utime));
> return p->prev_utime;
> }

I think this makes things worse.

without this patch:
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 5 ms (= accurate)
with this patch:
Task A prev_utime: 500 ms (= accurate)
Task B prev_utime: 50 ms (= not accurate)

Note that task_stime() calculates prev_stime using this prev_utime:

without this patch:
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 5 ms (= not accurate)
with this patch:
Task A prev_stime: 500 ms (= accurate)
Task B prev_stime: 0 ms (= not accurate)

>
> diff --git a/kernel/sys.c b/kernel/sys.c
> index ce17760..8be5b75 100644
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -914,8 +914,8 @@ void do_sys_times(struct tms *tms)
> struct task_cputime cputime;
> cputime_t cutime, cstime;
>
> - thread_group_cputime(current, &cputime);
> spin_lock_irq(&current->sighand->siglock);
> + thread_group_cputime(current, &cputime);
> cutime = current->signal->cutime;
> cstime = current->signal->cstime;
> spin_unlock_irq(&current->sighand->siglock);
>
> It's on top of Hidetoshi patch and fix utime decrease problem
> on my system.

How about the stime decrease problem which can be caused by same
logic?

According to my labeling, there are 2 unresolved problem [1]
"thread_group_cputime() vs exit" and [2] "use of task_s/utime()".

Still I believe the real fix for this problem is combination of
above fix for do_sys_times() (for problem[1]) and (I know it is
not preferred, but for [2]) the following:

>> diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
>> >> index 5c9dc22..e065b8a 100644
>> >> --- a/kernel/posix-cpu-timers.c
>> >> +++ b/kernel/posix-cpu-timers.c
>> >> @@ -248,8 +248,8 @@ void thread_group_cputime(struct task_struct *tsk, struct task_cputime *times)
>> >>
>> >> t = tsk;
>> >> do {
>> >> - times->utime = cputime_add(times->utime, t->utime);
>> >> - times->stime = cputime_add(times->stime, t->stime);
>> >> + times->utime = cputime_add(times->utime, task_utime(t));
>> >> + times->stime = cputime_add(times->stime, task_stime(t));
>> >> times->sum_exec_runtime += t->se.sum_exec_runtime;
>> >>
>> >> t = next_thread(t);

Think about this diff, assuming task C is in same group of task A and B:

sys_times() on C while A and B are living returns:
(utime, stime)
= task_[su]time(C) + ([su]time(A)+[su]time(B)+...) + in_signal(exited)
= task_[su]time(C) + ( (50,50) + (50,50) +...) + in_signal(exited)
If A exited, it increases:
(utime, stime)
= task_[su]time(C) + ([su]time(B)+...) + in_signal(exited)+task_[su]time(A)
= task_[su]time(C) + ( (50,50) +...) + in_signal(exited)+(500,500)
Otherwise if B exited, it decreases:
(utime, stime)
= task_[su]time(C) + ([su]time(A)+...) + in_signal(exited)+task_[su]time(B)
= task_[su]time(C) + ( (50,50) +...) + in_signal(exited)+(5,5)

With this fix, sys_times() returns:
(utime, stime)
= task_[su]time(C) + (task_[su]time(A)+task_[su]time(B)+...) + in_signal(exited)
= task_[su]time(C) + ( (500,500) + (5,5) +...) + in_signal(exited)

> Are we not doing something nasty here?
>
> cputime_t utime = p->utime, total = utime + p->stime;
> u64 temp;
>
> /*
> * Use CFS's precise accounting:
> */
> temp = (u64)nsecs_to_cputime(p->se.sum_exec_runtime);
>
> if (total) {
> temp *= utime;
> do_div(temp, total);
> }
> utime = (cputime_t)temp;

Not here, but doing do_div() for each thread could be said nasty.
I mean
__task_[su]time(sum(A, B, ...))
would be better than:
sum(task_[su]time(A)+task_[su]time(B)+...)

However it would bring another issue, because
__task_[su]time(sum(A, B, ...))
might not equal to
__task_[su]time(sum(B, ...)) + task_[su]time(A)


Thanks,
H.Seto

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/