Re: high latency in CFS when disable autogroup, bug or not?
From: Li RongQing
Date: Mon Oct 23 2017 - 21:48:59 EST
I find the root cause, the delayed process run a very long time
before, and its vruntime become very large, and must wait all process;
so it is delayed
the reason is below:
1. there is a low weight load process A (nice=19, weight=15)
2. there is a process B which is doing IO
3. there is a process C whose nice is 0
the running status likes below:
Step 1: when C is running, B wakeup, and preempt C; B start to run
Step 2: when B sleeps, vruntime of A is min, but A cannot preempt
C(cfq->last); then C restore to run; then repeat Step 1;
A can not preempt B and C, since vruntime of B and C is not larger
that 4ms*1024/15 [sched_wakeup_granularity_ns *1024/wight of nice 19]
but this condition will block other all process(about 500 processes)
running in 4ms*1024/15;
so I think we should consider more when doing preempt
/*
* Pick the next process, keeping these things in mind, in this order:
* 1) keep things fair between processes/task groups
* 2) pick the "next" process, since someone really wants that to run
* 3) pick the "last" process, for cache locality
* 4) do not run the "skip" process, if something else is available
*/
static struct sched_entity *pick_next_entity(struct cfs_rq *cfs_rq)
{
struct sched_entity *se = __pick_first_entity(cfs_rq);
struct sched_entity *left = se;
/*
* Avoid running the skip buddy, if running something else can
* be done without getting too unfair.
*/
if (cfs_rq->skip == se) {
struct sched_entity *second = __pick_next_entity(se);
if (second && wakeup_preempt_entity(second, left) < 1)
se = second;
}
/*
* Prefer last buddy, try to return the CPU to a preempted task.
*/
if (cfs_rq->last && wakeup_preempt_entity(cfs_rq->last, left) < 1)
se = cfs_rq->last;
/*
* Someone really wants this to run. If it's not unfair, run it.
*/
if (cfs_rq->next && wakeup_preempt_entity(cfs_rq->next, left) < 1)
se = cfs_rq->next;
clear_buddies(cfs_rq, se);
return se;
}
On Fri, Sep 1, 2017 at 11:28 AM, Li RongQing <roy.qing.li@xxxxxxxxx> wrote:
> I have a programe to test CFS latency or fairness on high load,
> the test shows CFS has high latency when disable autogroup
>
> the program is in
> https://gist.github.com/anonymous/af21ae289cfa5c6310eeac73b7a478ff.
> the programe creates many threads, every thread does thing like:
>
> do {
> now =gettime()
> loop to spend 10 microseconds
> delta = gettime()-now
>
> if (delta > max_latency) {
> max_latency=delta
> printf( max_latency);
> }
> } while(1)
>
>
> I run this program on a machine with 48 processor, this programe
> created 2000 thread,
> then every processor has about 40 pthread, every thread should finish
> its 10ms computation in 400ms= 40pthread*10ms
> but test result show some thread takes very long time
>
> # ./pipe_test -n 2000
> eat cpu with 2000 threads, delay time 1000 ms
> 648 delay 1037 ms
> 48923 delay 1038 ms
> 1810 delay 1349 ms
> 49142 delay 1482 ms
> 1728 delay 1574 ms
> 1518 delay 1713 ms
> 808 delay 1714 ms
> 1702 delay 1733 ms
> 49004 delay 1783 ms
> 48821 delay 1785 ms
> 451 delay 1865 ms
> 990 delay 1910 ms
> 1626 delay 1957 ms
> 537 delay 2420 ms
> 2021 delay 3242 ms
> 763 delay 3488 ms
> 622 delay 3614 ms
> 1887 delay 4474 ms
> 1267 delay 4924 ms
> 721 delay 5406 ms
> 1810 delay 5873 ms
> 1470 delay 5969 ms
> 1702 delay 7834 ms
> 48821 delay 8063 ms
> ^C
>
> The kernel version is 4.9.23 and I disable autogroup; if autogroup
> is enabled, no this issue
>
>
> ftrace result:
>
> <...>-48821 [003] d... 64151.635476: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=17118 next_prio=120
> <...>-48821 [003] d... 64151.636531: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=17118 next_prio=120
> <...>-48821 [003] d... 64151.639570: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=17275 next_prio=120
> <...>-48821 [003] d... 64159.703051: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R+ ==>
> next_comm=kworker/u97:0 next_pid=36929 next_prio=120
> <...>-48821 [003] d... 64159.703091: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=kworker/u97:0 next_pid=36929 next_prio=120
> <...>-48821 [003] d... 64159.703978: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=17275 next_prio=120
> <...>-48821 [003] d... 64159.705542: sched_switch:
> prev_comm=a.out prev_pid=48821 prev_prio=120 prev_state=R ==>
> next_comm=baas_agent next_pid=16879 next_prio=120
>
>
> # grep sched_migrate_task trace|grep 48821
> <...>-688 [019] d... 64161.828654: sched_migrate_task:
> comm=a.out pid=48821 prio=120 orig_cpu=3 dest_cpu=19
> <...>-48821 [019] d... 64161.828862: sched_migrate_task:
> comm=a.out pid=49053 prio=120 orig_cpu=43 dest_cpu=19
> #