Re: [PATCH v7 2/3] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too

From: Marcelo Tosatti
Date: Fri Sep 09 2022 - 15:36:11 EST


On Fri, Sep 09, 2022 at 02:12:24PM +0200, Frederic Weisbecker wrote:
> On Wed, Aug 17, 2022 at 04:13:48PM -0300, Marcelo Tosatti wrote:
> > From: Aaron Tomlin <atomlin@xxxxxxxxxx>
> >
> > In the context of the idle task and an adaptive-tick mode/or a nohz_full
> > CPU, quiet_vmstat() can be called: before stopping the idle tick,
> > entering an idle state and on exit. In particular, for the latter case,
> > when the idle task is required to reschedule, the idle tick can remain
> > stopped
>
> Since quiet_vmstat() is only called when ts->tick_stopped = false, this
> can only happen if the idle loop did not enter into dynticks idle mode
> but the exiting idle task eventually stops the tick
> (tick_nohz_idle_update_tick()).
>
> This can happen for example if we enter the idle loop with a timer callback
> pending in one jiffies, then once that timer fires, which wakes up a task,
> we exit the idle loop and then tick_nohz_idle_update_tick() doesn't see any
> timer callback pending left and the tick can be stopped.
>
> Or am I missing something?

For the scenario where we re-enter idle without calling quiet_vmstat:


CPU-0 CPU-1

0) vmstat_shepherd notices its necessary to queue vmstat work
to remote CPU, queues deferrable timer into timer wheel, and calls
trigger_dyntick_cpu (target_cpu == cpu-1).

1) Stop the tick (get_next_timer_interrupt will not take deferrable
timers into account), calls quiet_vmstat, which keeps the vmstat work
(vmstat_update function) queued.
2) Idle
3) Idle exit
4) Run thread on CPU, some activity marks vmstat dirty
5) Idle
6) Goto 3

At 5, since the tick is already stopped, the deferrable
timer for the delayed work item will not execute,
and vmstat_shepherd will consider

static void vmstat_shepherd(struct work_struct *w)
{
int cpu;

cpus_read_lock();
/* Check processors whose vmstat worker threads have been disabled */
for_each_online_cpu(cpu) {
struct delayed_work *dw = &per_cpu(vmstat_work, cpu);

if (!delayed_work_pending(dw) && need_update(cpu))
queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);

cond_resched();
}
cpus_read_unlock();

schedule_delayed_work(&shepherd,
round_jiffies_relative(sysctl_stat_interval));
}

As far as i can tell...

> > and the timer expiration time endless i.e., KTIME_MAX. Now,
> > indeed before a nohz_full CPU enters an idle state, CPU-specific vmstat
> > counters should be processed to ensure the respective values have been
> > reset and folded into the zone specific 'vm_stat[]'. That being said, it
> > can only occur when: the idle tick was previously stopped, and
> > reprogramming of the timer is not required.
> >
> > A customer provided some evidence which indicates that the idle tick was
> > stopped; albeit, CPU-specific vmstat counters still remained populated.
> > Thus one can only assume quiet_vmstat() was not invoked on return to the
> > idle loop.
> >
> > If I understand correctly, I suspect this divergence might erroneously
> > prevent a reclaim attempt by kswapd. If the number of zone specific free
> > pages are below their per-cpu drift value then
> > zone_page_state_snapshot() is used to compute a more accurate view of
> > the aforementioned statistic. Thus any task blocked on the NUMA node
> > specific pfmemalloc_wait queue will be unable to make significant
> > progress via direct reclaim unless it is killed after being woken up by
> > kswapd (see throttle_direct_reclaim()).
> >
> > Consider the following theoretical scenario:
> >
> > 1. CPU Y migrated running task A to CPU X that was
> > in an idle state i.e. waiting for an IRQ - not
> > polling; marked the current task on CPU X to
> > need/or require a reschedule i.e., set
> > TIF_NEED_RESCHED and invoked a reschedule IPI to
> > CPU X (see sched_move_task())
>
> CPU Y is nohz_full right?
>
> >
> > 2. CPU X acknowledged the reschedule IPI from CPU Y;
> > generic idle loop code noticed the
> > TIF_NEED_RESCHED flag against the idle task and
> > attempts to exit of the loop and calls the main
> > scheduler function i.e. __schedule().
> >
> > Since the idle tick was previously stopped no
> > scheduling-clock tick would occur.
> > So, no deferred timers would be handled
> >
> > 3. Post transition to kernel execution Task A
> > running on CPU Y, indirectly released a few pages
> > (e.g. see __free_one_page()); CPU Y's
> > 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
> > specific 'vm_stat[]' update was deferred as per the
> > CPU-specific stat threshold
> >
> > 4. Task A does invoke exit(2) and the kernel does
> > remove the task from the run-queue; the idle task
> > was selected to execute next since there are no
> > other runnable tasks assigned to the given CPU
> > (see pick_next_task() and pick_next_task_idle())
>
> This happens on CPU X, right?
>
> >
> > 5. On return to the idle loop since the idle tick
> > was already stopped and can remain so (see [1]
> > below) e.g. no pending soft IRQs, no attempt is
> > made to zero and fold CPU Y's vmstat counters
> > since reprogramming of the scheduling-clock tick
> > is not required/or needed (see [2])
>
> And now back to CPU Y, confused...

Aaron, can you explain the diagram above?

AFAIU the problem can also be understood with the simpler
explanation above.

> [...]
> > Index: linux-2.6/kernel/time/tick-sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/time/tick-sched.c
> > +++ linux-2.6/kernel/time/tick-sched.c
> > @@ -26,6 +26,7 @@
> > #include <linux/posix-timers.h>
> > #include <linux/context_tracking.h>
> > #include <linux/mm.h>
> > +#include <linux/rcupdate.h>
> >
> > #include <asm/irq_regs.h>
> >
> > @@ -519,6 +520,20 @@ void __tick_nohz_task_switch(void)
> > }
> > }
> >
> > +void __tick_nohz_user_enter_prepare(void)
> > +{
> > + struct tick_sched *ts;
> > +
> > + if (tick_nohz_full_cpu(smp_processor_id())) {
> > + ts = this_cpu_ptr(&tick_cpu_sched);
> > +
> > + if (ts->tick_stopped)
> > + quiet_vmstat();
>
> Wasn't it supposed to be part of the quiescing in task isolation
> mode?

Not requiring application changes seems useful (so if we can drop
the need for task isolation ioctls from userspace, that is better).

> Because currently vmstat is a deferrable timer but that deferrability
> may not apply to nohz_full anymore (outside idle). And quiet_vmstat()
> doesn't cancel the timer so you'll still get the disturbance.
>
> See this patch: https://lore.kernel.org/lkml/20220725104356.GA2950296@lothringen/

Right.

But i think the nohz_full applications prefer not to be interrupted
with the vmstat_work in the first place.

> > + rcu_nocb_flush_deferred_wakeup();
> > + }
> > +}
> >
> > +EXPORT_SYMBOL_GPL(__tick_nohz_user_enter_prepare);
> > +
> > /* Get the boot-time nohz CPU list from the kernel parameters. */
> > void __init tick_nohz_full_setup(cpumask_var_t cpumask)
> > {
> > @@ -890,6 +905,9 @@ static void tick_nohz_stop_tick(struct t
> > ts->do_timer_last = 0;
> > }
> >
> > + /* Attempt to fold when the idle tick is stopped or not */
> > + quiet_vmstat();
> > +
> > /* Skip reprogram of event if its not changed */
> > if (ts->tick_stopped && (expires == ts->next_tick)) {
> > /* Sanity check: make sure clockevent is actually programmed */
>
> But that chunk looks good.
>
> Thanks.

Do you see any issue with syncing the vmstat on return to userspace as
well, if nohz_full and tick is disabled?

Note the syscall entry/exit numbers, the overhead is minimal.

> > @@ -911,7 +929,6 @@ static void tick_nohz_stop_tick(struct t
> > */
> > if (!ts->tick_stopped) {
> > calc_load_nohz_start();
> > - quiet_vmstat();
> >
> > ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
> > ts->tick_stopped = 1;
> >
> >
>
>