Re: [PATCH] psi: fix aggregation idle shut-off
From: Suren Baghdasaryan
Date: Mon Jan 28 2019 - 18:11:10 EST
On Mon, Jan 28, 2019 at 3:06 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> > psi has provisions to shut off the periodic aggregation worker when
> > there is a period of no task activity - and thus no data that needs
> > aggregating. However, while developing psi monitoring, Suren noticed
> > that the aggregation clock currently won't stay shut off for good.
> >
> > Debugging this revealed a flaw in the idle design: an aggregation run
> > will see no task activity and decide to go to sleep; shortly
> > thereafter, the kworker thread that executed the aggregation will go
> > idle and cause a scheduling change, during which the psi callback will
> > kick the !pending worker again. This will ping-pong forever, and is
> > equivalent to having no shut-off logic at all (but with more code!)
> >
> > Fix this by exempting aggregation workers from psi's clock waking
> > logic when the state change is them going to sleep. To do this, tag
> > workers with the last work function they executed, and if in psi we
> > see a worker going to sleep after aggregating psi data, we will not
> > reschedule the aggregation work item.
> >
> > What if the worker is also executing other items before or after?
> >
> > Any psi state times that were incurred by work items preceding the
> > aggregation work will have been collected from the per-cpu buckets
> > during the aggregation itself. If there are work items following the
> > aggregation work, the worker's last_func tag will be overwritten and
> > the aggregator will be kept alive to process this genuine new activity.
> >
> > If the aggregation work is the last thing the worker does, and we
> > decide to go idle, the brief period of non-idle time incurred between
> > the aggregation run and the kworker's dequeue will be stranded in the
> > per-cpu buckets until the clock is woken by later activity. But that
> > should not be a problem. The buckets can hold 4s worth of time, and
> > future activity will wake the clock with a 2s delay, giving us 2s
> > worth of data we can leave behind when disabling aggregation. If it
> > takes a worker more than two seconds to go idle after it finishes its
> > last work item, we likely have bigger problems in the system, and
> > won't notice one sample that was averaged with a bogus per-CPU weight.
> >
> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -480,9 +481,6 @@ static void psi_group_change(struct psi_group *group, int cpu,
> > groupc->tasks[t]++;
> >
> > write_seqcount_end(&groupc->seq);
> > -
> > - if (!delayed_work_pending(&group->clock_work))
> > - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> > }
> >
> > static struct psi_group *iterate_groups(struct task_struct *task, void **iter)
>
> This breaks Suren's "psi: introduce psi monitor":
>
> --- kernel/sched/psi.c~psi-introduce-psi-monitor
> +++ kernel/sched/psi.c
> @@ -752,8 +1012,25 @@ static void psi_group_change(struct psi_
>
> write_seqcount_end(&groupc->seq);
>
> - if (!delayed_work_pending(&group->clock_work))
> - schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + /*
> + * Polling flag resets to 0 at the max rate of once per update window
> + * (at least 500ms interval). smp_wmb is required after group->polling
> + * 0-to-1 transition to order groupc->times and group->polling writes
> + * because stall detection logic in the slowpath relies on groupc->times
> + * changing before group->polling. Explicit smp_wmb is missing because
> + * cmpxchg() implies smp_mb.
> + */
> + if ((state_mask & group->trigger_mask) &&
> + atomic_cmpxchg(&group->polling, 0, 1) == 0) {
> + /*
> + * Start polling immediately even if the work is already
> + * scheduled
> + */
> + mod_delayed_work(system_wq, &group->clock_work, 1);
> + } else {
> + if (!delayed_work_pending(&group->clock_work))
> + schedule_delayed_work(&group->clock_work, PSI_FREQ);
> + }
> }
>
> and I'm too lazy to go in and figure out how to fix it.
>
> If we're sure about "psi: fix aggregation idle shut-off" (and I am not)
> then can I ask for a redo of "psi: introduce psi monitor"?
Yes, I'll redo it after applying "psi: fix aggregation idle shut-off".
I have to repost it anyway to address the comment from Johannes.