Re: [PATCH] sched/fair: update_curr changed sum_exec_runtime to 1 when sum_exec_runtime is 0 beacuse some kernel code use sum_exec_runtime==0 to test task just be forked.

From: äå
Date: Wed Aug 28 2019 - 01:02:55 EST


I find this when I run a docker test script.

#-----script begin--

for((i=0; i<1000000; i++))
do
ÂÂÂ echo "=================================== test $i start ================================"
ÂÂÂ rm -rf /media/ram/docker/test/*
ÂÂÂ docker run -d --name test -h test -v /media/docker/app/dataindex:/media/dataindex -v /media/ram/docker/test:/run -v /media/docker/app/test:/media/container -v /media/wave:/media/wave:ro -v /run/systemd/journal/dev-log:/dev/log --ipc=host --log-driver=journald --init arpubt:mips64el-1.2

ÂÂÂ docker ps -s
ÂÂÂ docker stats --no-stream

ÂÂÂ cp /media/arp/testsvc.tar.gz /media/ram/docker/test/
ÂÂÂ docker exec -it test app-manager install testsvc
ÂÂÂ sleep 3

ÂÂÂ docker ps -s
ÂÂÂ docker stats --no-stream
ÂÂÂ docker exec -it test monit status
ÂÂÂ sleep 3

ÂÂÂ docker exec -it test app-manager disable testsvc
ÂÂÂ sleep 1

ÂÂÂ docker exec -it test monit status
ÂÂÂ sleep 3

ÂÂÂ docker exec -it test app-manager enable testsvc
ÂÂÂ sleep 1

ÂÂÂ docker exec -it test app-manager disable testsvc
ÂÂÂ sleep 1

ÂÂÂ docker exec -it test app-manager enable testsvc
ÂÂÂ sleep 1

ÂÂÂ docker exec -it test app-manager disable testsvc
ÂÂÂ sleep 1

ÂÂÂ docker exec -it test app-manager enable testsvc
ÂÂÂ sleep 1

ÂÂÂ docker ps -s
ÂÂÂ docker stats --no-stream
ÂÂÂ sleep 1

ÂÂÂ docker exec -it test app-manager uninstall testsvc
ÂÂÂ sleep 1

ÂÂÂ docker rm -f test

ÂÂÂ sleep 1
ÂÂÂ docker ps -s
ÂÂÂ docker stats --no-stream

ÂÂÂ echo "================================== test $i end =================================="
done

#---- script end---

some times, docker exec will stop longtime.

testsvc is a busy loop application.

int i;

int main()

{

while(1) i++;

}

When use kgdb, I see there are two tasks on run queue, docker exec's vruntime is very large, but which is value is almost same to linux cgroup's min_vruntime. like bellow:

CFS0 0x98000002de50f6e0, enter cgroup 0xffffffff81349900
CFS1 0x98000002dba65700, enter cgroup 0x98000002dbc4a700
SE:0x98000002dab95618 vruntime:49388638792 14348 14344 runc:[2:INIT] ra=0xffffffff80e8e110 sp=0x98000002d2b33d40 epc=0x12006b0f0 usp=0xffeafc6810 task=0x98000002dab955b0 tinfo=0x98000002d2b30000 uregs=0x98000002d2b33eb0 stat=0x1
CFS2 0x98000002dba57700, enter cgroup 0x98000000014bf100
se:0x98000002daa92a78 vruntime:49391638792 (always run)14266 13758 testsvc ra=0xffffffff80e8e110 sp=0x98000002d2cbfe30 epc=0x1200017f0 usp=0xffffc8bc10 task=0x98000002daa92a10 tinfo=0x98000002d2cbc000 uregs=0x98000002d2cbfeb0 stat=0x0
se:0x98000002dab5b488 vruntime:242669408647 (large) 14349 14344 runc:[2:INIT] ra=0xffffffff80e8e110 sp=0x98000002d2b37ba0 epc=0x12006b3d8 usp=0xffea7c6798 task=0x98000002dab5b420 tinfo=0x98000002d2b34000 uregs=0x98000002d2b37eb0 stat=0x0

I add check code in sched_move_task, and found that task has been run and is stopped(on_rq == 0), vruntime is not changed through sched_move_task function, because se->sum_exec_runtime == 0

I add debug code in update_curr:

ÂÂÂ if (unlikely((s64)delta_exec <= 0))
ÂÂÂ {
ÂÂÂ Âif (!curr->sum_exec_runtime)
ÂÂÂ Â{
ÂÂÂ ÂÂÂ struct task_struct *p;
ÂÂÂ ÂÂÂ p = container_of(curr, struct task_struct, se);
ÂÂÂ ÂÂÂ if ((p->nvcsw + p->nivcsw) != 0 && entity_is_task(curr))
ÂÂÂ ÂÂÂ {
ÂÂÂ ÂÂÂ Âcurr->sum_exec_runtime += 1;
ÂÂÂ ÂÂÂ Âdump_stack();
ÂÂÂ ÂÂÂ }
ÂÂÂ Â}
dump_stack also sometimes was called.

Now after I read your letter, I know the reason is:

ÂÂÂ unsigned long long __weak sched_clock(void)
ÂÂÂÂÂ {
ÂÂÂÂÂÂÂÂÂÂÂÂÂ return (unsigned long long)(jiffies - INITIAL_JIFFIES)
ÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂÂ * (NSEC_PER_SEC / HZ);
ÂÂÂÂ }

My kernel use default sched_clock, which use jiffies as schedule clock.

I should use another high resolution schedule clock, then update_curr's dalta_exec will not be 0.

So thanks alot, but I still think curr->sum_exec_runtime check code is not accurate enough.


å 2019/8/27 äå10:36, Peter Zijlstra åé:
On Mon, Aug 26, 2019 at 07:46:50PM +0800, QiaoChong wrote:
From: Chong Qiao <qiaochong@xxxxxxxxxxx>

Such as:
cpu_cgroup_attach>
sched_move_task>
task_change_group_fair>
task_move_group_fair>
detach_task_cfs_rq>
vruntime_normalized>

/*
* When !on_rq, vruntime of the task has usually NOT been normalized.
* But there are some cases where it has already been normalized:
*
* - A forked child which is waiting for being woken up by
* wake_up_new_task().
* - A task which has been woken up by try_to_wake_up() and
* waiting for actually being woken up by sched_ttwu_pending().
*/
if (!se->sum_exec_runtime ||
(p->state == TASK_WAKING && p->sched_remote_wakeup))
return true;

p->se.sum_exec_runtime is 0, does not mean task not been run (A forked child which is waiting for being woken up by wake_up_new_task()).

Task may have been scheduled multimes, but p->se.sum_exec_runtime is still 0, because delta_exec maybe 0 in update_curr.

static void update_curr(struct cfs_rq *cfs_rq)
{
...
delta_exec = now - curr->exec_start;
if (unlikely((s64)delta_exec <= 0))
return;
...

curr->sum_exec_runtime += delta_exec;
...
}

Task has been run and is stopped(on_rq == 0), vruntime not been normalized, but se->sum_exec_runtime == 0.
This cause vruntime_normalized set on_rq 1, and does not normalize vruntime.
This may cause task use old vruntime in old cgroup, which maybe very large than task's vruntime in new cgroup.
Which may cause task may not scheduled in run queue for long time after been waked up.

Now I change sum_exec_runtime to 1 when sum_exec_runtime == 0 in update_curr to make sun_exec_runtime not 0.
Have you actually observed this? It is very hard to have a 0 delta
between two scheduling events.


--


äå qiaochong@xxxxxxxxxxx

2019å 08æ 28æ ææä 11:31:43 CST