Re: PSI idle-shutoff
From: Pavan Kondeti
Date: Thu Sep 15 2022 - 02:20:44 EST
On Tue, Sep 13, 2022 at 07:38:17PM +0530, Pavan Kondeti wrote:
> Hi
>
> The fact that psi_avgs_work()->collect_percpu_times()->get_recent_times()
> run from a kworker thread, PSI_NONIDLE condition would be observed as
> there is a RUNNING task. So we would always end up re-arming the work.
>
> If the work is re-armed from the psi_avgs_work() it self, the backing off
> logic in psi_task_change() (will be moved to psi_task_switch soon) can't
> help. The work is already scheduled. so we don't do anything there.
>
> Probably I am missing some thing here. Can you please clarify how we
> shut off re-arming the psi avg work?
>
I have collected traces on an idle system (running android12-5.10 with minimal
user space). This is a older kernel, however the issue remain on latest kernel
as per code inspection.
I have eliminated noise created by other work items. For example, vmstat_work.
This is a deferrable work but gets executed since this is queued on the same
CPU on which PSI work timer is queued. So I have increased
sysctl_stat_interval to 60 * HZ to supress this work.
As we can see from the traces, CPU#7 comes out of idle only to execute PSI
work for every 2 seconds. The work is always re-armed from the psi_avgs_work()
as it finds PSI_NONIDLE condition. The non-idle time is essentially
non_idle_time = (work_start_now - wakeup_now) + (sleep_prev - work_end_prev)
The first term accounts the non-idle time since the task woken up (queued) to
the execution of the work item. It is around ~4 usec (54.119420 - 54.119416)
The second term account for the previous update. ~2 usec (52.135424 -
52.135422).
PSI work needs to be run when there is some activity after the last update is done
i.e last time the work is run. Since we use non-deferrable timer, the other
deferrable timers gets woken up and they might queue work or wakeup other threads
and creates activity which inturn makes PSI work to be scheduled.
PSI work can't just be made deferrable work. Because, it is a system level
work and if the CPU on which it is queued is idle for longer duration but the
other CPUs are active, we miss PSI updates. What we probably need is a global
deferrable timers [1] i.e this timer should not be bound to any CPU but
run when any of the CPU comes out of idle. As long as one CPU is busy, we keep
running the PSI but if the whole system is idle, we never wakeup.
<idle>-0 [007] 52.135402: cpu_idle: state=4294967295 cpu_id=7
<idle>-0 [007] 52.135415: workqueue_activate_work: work struct 0xffffffc011bd5010
<idle>-0 [007] 52.135417: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
<idle>-0 [007] 52.135421: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
kworker/7:3-196 [007] 52.135421: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
kworker/7:3-196 [007] 52.135422: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294905814 [timeout=494] cpu=7 idx=123 flags=D|P|I
kworker/7:3-196 [007] 52.135422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
kworker/7:3-196 [007] 52.135424: sched_switch: prev_comm=kworker/7:3 prev_pid=196 prev_prio=120 prev_state=I ==> next_comm=swapper/7 next_pid=0 next_prio=120
<idle>-0 [007] 52.135428: cpu_idle: state=0 cpu_id=7
<system is idle and gets woken up after 2 seconds due to PSI work>
<idle>-0 [007] 54.119402: cpu_idle: state=4294967295 cpu_id=7
<idle>-0 [007] 54.119414: workqueue_activate_work: work struct 0xffffffc011bd5010
<idle>-0 [007] 54.119416: sched_wakeup: comm=kworker/7:3 pid=196 prio=120 target_cpu=007
<idle>-0 [007] 54.119420: sched_switch: prev_comm=swapper/7 prev_pid=0 prev_prio=120 prev_state=R ==> next_comm=kworker/7:3 next_pid=196 next_prio=120
kworker/7:3-196 [007] 54.119420: workqueue_execute_start: work struct 0xffffffc011bd5010: function psi_avgs_work
kworker/7:3-196 [007] 54.119421: timer_start: timer=0xffffffc011bd5040 function=delayed_work_timer_fn expires=4294906315 [timeout=499] cpu=7 idx=122 flags=D|P|I
kworker/7:3-196 [007] 54.119422: workqueue_execute_end: work struct 0xffffffc011bd5010: function psi_avgs_work
[1]
https://lore.kernel.org/lkml/1430188744-24737-1-git-send-email-joonwoop@xxxxxxxxxxxxxx/
Thanks,
Pavan