Re: fast path cycle muncher (vmstat: make vmstat_updater deferrable again and shut down on idle)

From: Michal Hocko
Date: Mon Jan 25 2016 - 15:13:30 EST

On Mon 25-01-16 12:02:06, Christoph Lameter wrote:
> On Mon, 25 Jan 2016, Michal Hocko wrote:
> > On Sat 23-01-16 17:21:55, Mike Galbraith wrote:
> > > Hi Christoph,
> > >
> > > While you're fixing that commit up, can you perhaps find a better home
> > > for quiet_vmstat()? It not only munches cycles when switching cross
> > > -core mightily, for -rt it injects a sleeping lock into the idle task.
> > >
> > > 12.89% [kernel] [k] refresh_cpu_vm_stats.isra.12
> > > 4.75% [kernel] [k] __schedule
> > > 4.70% [kernel] [k] mutex_unlock
> > > 3.14% [kernel] [k] __switch_to
> >
> > Hmm, I wouldn't have expected that refresh_cpu_vm_stats could have
> > such a large footprint. I guess this would be just an expensive noop
> > because we have to check all the zones*counters and do an expensive
> > this_cpu_xchg. Is the whole deferred thing worth this overhead?
> Why would the deferring cause this overhead?

I guess the profile speaks for itself, doesn't it?

> Also there is no cross core activity from quiet_vmstat(). It simply
> disables the local vmstat updates.

It doesn't go cross core but it still does nr_zones * counters atomic

> > Unless there is a clear and huge win from doing the vmstat update
> > deferrable then I think a revert is more appropriate IMHO.
> It reduces the OS events that the application experiences by folding it
> into the tick events. If its not deferrable then a timer event will be
> generated in addition to the tick. We do not want that.

Yes this is what I have read in the changelog. But "how much" part is
really missing. Is this even quantifiable?

> Workqueues are used in many places. If RT can sleep within workqueue
> management functions then spinlocks cannot be taken anymore and there may
> be issues with preemption.

RT can sleep in _any_ spinlock except for raw spin locks. Even though
the !RT kernel is not sleeping doesn't really matter much because
cancel_delayed_work is quite a heavy function which shouldn't be called
from the idle context AFAIU. Sure most of the time it will boil down to
del_timer but it can hit the slowpath as well if the timer got migrated
to a different CPU and we have to race with the WQ pool management IIUC.

Maybe this overhead can be reduced by outsourcing the functionality to
vmstat_shepherd which can check idle CPUs, cancel the timer for them
update the differentials and put them to cpu_stat_off?

> The regression that I know of (independent of "RT") is due as far as I
> know due to the switch of the parameters of some vmstat functions to 64
> bit instead of 32 bit.

I am not sure I am following.

Michal Hocko