Re: [PATCH] psi: fix aggregation idle shut-off

From: Suren Baghdasaryan
Date: Mon Jan 28 2019 - 17:20:42 EST


On Mon, Jan 28, 2019 at 2:03 PM Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> > psi has provisions to shut off the periodic aggregation worker when
> > there is a period of no task activity - and thus no data that needs
> > aggregating. However, while developing psi monitoring, Suren noticed
> > that the aggregation clock currently won't stay shut off for good.
> >
> > Debugging this revealed a flaw in the idle design: an aggregation run
> > will see no task activity and decide to go to sleep; shortly
> > thereafter, the kworker thread that executed the aggregation will go
> > idle and cause a scheduling change, during which the psi callback will
> > kick the !pending worker again. This will ping-pong forever, and is
> > equivalent to having no shut-off logic at all (but with more code!)
> >
> > Fix this by exempting aggregation workers from psi's clock waking
> > logic when the state change is them going to sleep. To do this, tag
> > workers with the last work function they executed, and if in psi we
> > see a worker going to sleep after aggregating psi data, we will not
> > reschedule the aggregation work item.
> >
> > What if the worker is also executing other items before or after?
> >
> > Any psi state times that were incurred by work items preceding the
> > aggregation work will have been collected from the per-cpu buckets
> > during the aggregation itself. If there are work items following the
> > aggregation work, the worker's last_func tag will be overwritten and
> > the aggregator will be kept alive to process this genuine new activity.
> >
> > If the aggregation work is the last thing the worker does, and we
> > decide to go idle, the brief period of non-idle time incurred between
> > the aggregation run and the kworker's dequeue will be stranded in the
> > per-cpu buckets until the clock is woken by later activity. But that
> > should not be a problem. The buckets can hold 4s worth of time, and
> > future activity will wake the clock with a 2s delay, giving us 2s
> > worth of data we can leave behind when disabling aggregation. If it
> > takes a worker more than two seconds to go idle after it finishes its
> > last work item, we likely have bigger problems in the system, and
> > won't notice one sample that was averaged with a bogus per-CPU weight.
>
> Did we ever hear back from Suren about the testing?
>
> Some words here about the new wq_worker_last_func() would be
> appropriate.
>
> It's an ugly-looking thing :( Tejun, did you check this change?

Yes, sorry for the delay. I tried the patch as is and with some
tweaking of my local setup was able to see the cases when this
mechanism prevents the system from unnecessary reactivation. I
encountered some weird things in my traces (some state transitions
were missing in the trace) and wanted to investigate further, however
did not get a chance to do that yet. Overall looks like the patch is
doing what it's supposed to do. Just wanted to get the full picture
before reporting the results.

> > --- a/kernel/sched/psi.c
> > +++ b/kernel/sched/psi.c
> > @@ -124,6 +124,7 @@
> > * sampling of the aggregate task states would be.
> > */
> >
> > +#include "../workqueue_internal.h"
>
> "Only to be included by workqueue and core kernel subsystems"
>
> I'm not sure that psi qualifies. Perhaps wq_worker_last_func() should
> be declared in include/linux/workqueue.h.
>
> And perhaps implemented there as well. It's similar to
> current_wq_worker(), which is inlined and wq_worker_last_func() is
> small enough to justify inlining.
>
> > --- a/kernel/workqueue.c
> > +++ b/kernel/workqueue.c
> > @@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
> > return to_wakeup ? to_wakeup->task : NULL;
> > }
> >
> > +/**
> > + * wq_worker_last_func - retrieve worker's last work function
> > + *
> > + * Determine the last function a worker executed. This is called from
> > + * the scheduler to get a worker's last known identity.
> > + *
> > + * CONTEXT:
> > + * spin_lock_irq(rq->lock)
> > + *
> > + * Return:
> > + * The last work function %current executed as a worker, NULL if it
> > + * hasn't executed any work yet.
> > + */
> > +work_func_t wq_worker_last_func(struct task_struct *task)
> > +{
> > + struct worker *worker = kthread_data(task);
> > +
> > + return worker->last_func;
> > +}
>
> The semantics are troublesome. What guarantees that worker->last_func
> won't change under the caller's feet? The caller should hold some lock
> (presumably worker->pool->lock) in order to stabilize the
> wq_worker_last_func() return value?
>
> Also, the comment isn't really true - this is called from PSI, which is
> hardly "the scheduler"?
>
>