Re: [PATCH] psi: fix aggregation idle shut-off

From: Andrew Morton
Date: Mon Jan 28 2019 - 17:03:41 EST


On Wed, 16 Jan 2019 14:35:01 -0500 Johannes Weiner <hannes@xxxxxxxxxxx> wrote:

> psi has provisions to shut off the periodic aggregation worker when
> there is a period of no task activity - and thus no data that needs
> aggregating. However, while developing psi monitoring, Suren noticed
> that the aggregation clock currently won't stay shut off for good.
>
> Debugging this revealed a flaw in the idle design: an aggregation run
> will see no task activity and decide to go to sleep; shortly
> thereafter, the kworker thread that executed the aggregation will go
> idle and cause a scheduling change, during which the psi callback will
> kick the !pending worker again. This will ping-pong forever, and is
> equivalent to having no shut-off logic at all (but with more code!)
>
> Fix this by exempting aggregation workers from psi's clock waking
> logic when the state change is them going to sleep. To do this, tag
> workers with the last work function they executed, and if in psi we
> see a worker going to sleep after aggregating psi data, we will not
> reschedule the aggregation work item.
>
> What if the worker is also executing other items before or after?
>
> Any psi state times that were incurred by work items preceding the
> aggregation work will have been collected from the per-cpu buckets
> during the aggregation itself. If there are work items following the
> aggregation work, the worker's last_func tag will be overwritten and
> the aggregator will be kept alive to process this genuine new activity.
>
> If the aggregation work is the last thing the worker does, and we
> decide to go idle, the brief period of non-idle time incurred between
> the aggregation run and the kworker's dequeue will be stranded in the
> per-cpu buckets until the clock is woken by later activity. But that
> should not be a problem. The buckets can hold 4s worth of time, and
> future activity will wake the clock with a 2s delay, giving us 2s
> worth of data we can leave behind when disabling aggregation. If it
> takes a worker more than two seconds to go idle after it finishes its
> last work item, we likely have bigger problems in the system, and
> won't notice one sample that was averaged with a bogus per-CPU weight.

Did we ever hear back from Suren about the testing?

Some words here about the new wq_worker_last_func() would be
appropriate.

It's an ugly-looking thing :( Tejun, did you check this change?


> --- a/kernel/sched/psi.c
> +++ b/kernel/sched/psi.c
> @@ -124,6 +124,7 @@
> * sampling of the aggregate task states would be.
> */
>
> +#include "../workqueue_internal.h"

"Only to be included by workqueue and core kernel subsystems"

I'm not sure that psi qualifies. Perhaps wq_worker_last_func() should
be declared in include/linux/workqueue.h.

And perhaps implemented there as well. It's similar to
current_wq_worker(), which is inlined and wq_worker_last_func() is
small enough to justify inlining.

> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -909,6 +909,26 @@ struct task_struct *wq_worker_sleeping(struct task_struct *task)
> return to_wakeup ? to_wakeup->task : NULL;
> }
>
> +/**
> + * wq_worker_last_func - retrieve worker's last work function
> + *
> + * Determine the last function a worker executed. This is called from
> + * the scheduler to get a worker's last known identity.
> + *
> + * CONTEXT:
> + * spin_lock_irq(rq->lock)
> + *
> + * Return:
> + * The last work function %current executed as a worker, NULL if it
> + * hasn't executed any work yet.
> + */
> +work_func_t wq_worker_last_func(struct task_struct *task)
> +{
> + struct worker *worker = kthread_data(task);
> +
> + return worker->last_func;
> +}

The semantics are troublesome. What guarantees that worker->last_func
won't change under the caller's feet? The caller should hold some lock
(presumably worker->pool->lock) in order to stabilize the
wq_worker_last_func() return value?

Also, the comment isn't really true - this is called from PSI, which is
hardly "the scheduler"?