Re: [PATCH v7 2/3] tick/sched: Ensure quiet_vmstat() is called when the idle tick was stopped too

From: Aaron Tomlin
Date: Mon Sep 12 2022 - 10:38:34 EST


On Fri 2022-09-09 16:35 -0300, Marcelo Tosatti wrote:
> For the scenario where we re-enter idle without calling quiet_vmstat:
>
>
> CPU-0 CPU-1
>
> 0) vmstat_shepherd notices its necessary to queue vmstat work
> to remote CPU, queues deferrable timer into timer wheel, and calls
> trigger_dyntick_cpu (target_cpu == cpu-1).
>
> 1) Stop the tick (get_next_timer_interrupt will not take deferrable
> timers into account), calls quiet_vmstat, which keeps the vmstat work
> (vmstat_update function) queued.
> 2) Idle
> 3) Idle exit
> 4) Run thread on CPU, some activity marks vmstat dirty
> 5) Idle
> 6) Goto 3
>
> At 5, since the tick is already stopped, the deferrable
> timer for the delayed work item will not execute,
> and vmstat_shepherd will consider
>
> static void vmstat_shepherd(struct work_struct *w)
> {
> int cpu;
>
> cpus_read_lock();
> /* Check processors whose vmstat worker threads have been disabled */
> for_each_online_cpu(cpu) {
> struct delayed_work *dw = &per_cpu(vmstat_work, cpu);
>
> if (!delayed_work_pending(dw) && need_update(cpu))
> queue_delayed_work_on(cpu, mm_percpu_wq, dw, 0);
>
> cond_resched();
> }
> cpus_read_unlock();
>
> schedule_delayed_work(&shepherd,
> round_jiffies_relative(sysctl_stat_interval));
> }
>
> As far as i can tell...

Hi Marcelo,

Yes, I agree with the scenario above.

> > > Consider the following theoretical scenario:
> > >
> > > 1. CPU Y migrated running task A to CPU X that was
> > > in an idle state i.e. waiting for an IRQ - not
> > > polling; marked the current task on CPU X to
> > > need/or require a reschedule i.e., set
> > > TIF_NEED_RESCHED and invoked a reschedule IPI to
> > > CPU X (see sched_move_task())
> >
> > CPU Y is nohz_full right?
> >
> > >
> > > 2. CPU X acknowledged the reschedule IPI from CPU Y;
> > > generic idle loop code noticed the
> > > TIF_NEED_RESCHED flag against the idle task and
> > > attempts to exit of the loop and calls the main
> > > scheduler function i.e. __schedule().
> > >
> > > Since the idle tick was previously stopped no
> > > scheduling-clock tick would occur.
> > > So, no deferred timers would be handled
> > >
> > > 3. Post transition to kernel execution Task A
> > > running on CPU Y, indirectly released a few pages
> > > (e.g. see __free_one_page()); CPU Y's
> > > 'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
> > > specific 'vm_stat[]' update was deferred as per the
> > > CPU-specific stat threshold
> > >
> > > 4. Task A does invoke exit(2) and the kernel does
> > > remove the task from the run-queue; the idle task
> > > was selected to execute next since there are no
> > > other runnable tasks assigned to the given CPU
> > > (see pick_next_task() and pick_next_task_idle())
> >
> > This happens on CPU X, right?
> >
> > >
> > > 5. On return to the idle loop since the idle tick
> > > was already stopped and can remain so (see [1]
> > > below) e.g. no pending soft IRQs, no attempt is
> > > made to zero and fold CPU Y's vmstat counters
> > > since reprogramming of the scheduling-clock tick
> > > is not required/or needed (see [2])
> >
> > And now back to CPU Y, confused...
>
> Aaron, can you explain the diagram above?

Hi Frederic,

Sorry about that. How about the following:

- Note: CPU X is part of 'tick_nohz_full_mask'

1. CPU Y migrated running task A to CPU X that
was in an idle state i.e. waiting for an IRQ;
marked the current task on CPU X to need/or
require a reschedule i.e., set TIF_NEED_RESCHED
and invoked a reschedule IPI to CPU X
(see sched_move_task())

2. CPU X acknowledged the reschedule IPI. Generic
idle loop code noticed the TIF_NEED_RESCHED flag
against the idle task and attempts to exit of the
loop and calls the main scheduler function i.e.
__schedule().

Since the idle tick was previously stopped no
scheduling-clock tick would occur.
So, no deferred timers would be handled

3. Post transition to kernel execution Task A
running on CPU X, indirectly released a few pages
(e.g. see __free_one_page()); CPU X's
'vm_stat_diff[NR_FREE_PAGES]' was updated and zone
specific 'vm_stat[]' update was deferred as per the
CPU-specific stat threshold

4. Task A does invoke exit(2) and the kernel does
remove the task from the run-queue; the idle task
was selected to execute next since there are no
other runnable tasks assigned to the given CPU
(see pick_next_task() and pick_next_task_idle())

5. On return to the idle loop since the idle tick
was already stopped and can remain so (see [1]
below) e.g. no pending soft IRQs, no attempt is
made to zero and fold CPU X's vmstat counters
since reprogramming of the scheduling-clock tick
is not required/or needed (see [2])



Kind regards,

--
Aaron Tomlin