Re: [PATCH 00/36] cputime: Convert core use of cputime_t to nsecs

From: Martin Schwidefsky
Date: Mon Nov 21 2016 - 02:00:11 EST

On Fri, 18 Nov 2016 15:47:02 +0100
Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:

> On Fri, Nov 18, 2016 at 01:08:46PM +0100, Martin Schwidefsky wrote:
> > On Thu, 17 Nov 2016 19:08:07 +0100
> > Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> >
> > > I'm sorry for the patchbomb, especially as I usually complain about
> > > these myself but I don't see any way to split this patchset into
> > > standalone pieces, none of which would make any sense... All I can do
> > > is to isolate about 3 cleanup patches.
> >
> > On first glance the patches look ok-ish, but I am not happy about the
> > direction this takes.
> >
> > I can understand the wish to consolidate the common code to a single
> > format which is nano-seconds. It will have repercussions though.
> >
> > First the obvious problem, it does not compile for s390:
> >
> > arch/s390/kernel/vtime.c: In function 'do_account_vtime':
> > arch/s390/kernel/vtime.c:140:25: error: implicit declaration of function
> > 'cputime_to_nsecs' [-Werror=implicit-function-declaration]
> > account_user_time(tsk, cputime_to_nsecs(user));
> > ^~~~~~~~~~~~~~~~
> > arch/s390/kernel/idle.c: In function 'enabled_wait':
> > arch/s390/kernel/idle.c:46:20: error: implicit declaration of function
> > 'cputime_to_nsecs' [-Werror=implicit-function-declaration]
> > account_idle_time(cputime_to_nsecs(idle_time));
> > ^~~~~~~~~~~~~~~~
> > arch/s390/kernel/idle.c: In function 'arch_cpu_idle_time':
> > arch/s390/kernel/idle.c:100:9: error: implicit declaration of function
> > 'cputime_to_nsec' [-Werror=implicit-function-declaration]
> > return cputime_to_nsec(idle_enter ? ((idle_exit ?: now) - idle_enter) : 0);
> > ^~~~~~~~~~~~~~~
> Yes sorry I haven't yet done much build-testing. I should have written that it's
> not build-tested yet. This patchset in its current state is rather an RFC.

No big deal, I got it to compile with a small change.

> > The error at idle.c:100 is a typo cputime_to_nsec vs cputime_to_nsecs.
> > The other two could probably be solved with an additional include but the
> > default cputime_to_nsecs is in include/linux/cputime.h is this:
> >
> > #ifndef cputime_to_nsecs
> > # define cputime_to_nsecs(__ct) \
> > (cputime_to_usecs(__ct) * NSEC_PER_USEC)
> > #endif
> >
> > which downgrades the accuracy for s390 from better than nano-seconds
> > to micro-seconds. Not good. For the s390 cputime format you would have
> > to do
> >
> > static inline unsigned long long cputime_to_nsecs(const cputime_t cputime)
> > {
> > return ((__force unsigned long long) cputime * 1000) >> 12;
> > }
> I agree, that loss of acurracy is my biggest worry. Hence the accumulation
> idea, but more about that later.

We can not allow that to happen, but the accumulation should take care of it.

> >
> > But this *example* function has an overflow problem.
> >
> > > So currently, cputime_t serves the purpose, for s390 and
> > > powerpc (on CONFIG_VIRT_CPU_ACCOUNTING_NATIVE=y), to avoid converting
> > > arch clock counters to nanosecs or jiffies while accounting cputime.
> >
> > The cputime_t has several purposes:
> > 1) Allow for different units in the calculations for virtual cpu time.
> > There are currently three models: jiffies, nano-seconds and the native
> > TOD clock format for s390 which is a bit better than nano-seconds.
> Sure, I don't disagree with that, just with the way it is done (ie: stored
> and maintained in the core to this very obscure type).
> > 2) Act as a marker in the common code where a virtual cpu time is used.
> > This is more important than you might think, unfortunately it is very
> > easy to confuse a wall-clock delta with cpu time.
> There you lost me, I don't get which confusion you're pointing.

The confusion stems from the fact that you do *not* have a simple nano-second
value but a modal value that depends on the architecture. More below..

> > 3) Avoid expensive operations on the fast path to convert the native cpu
> > time to something else. Instead move the expensive calculation to the
> > read-out code, e.g. fs/proc.
> >
> > You patches breaks all three of these purposes. My main gripe is with 3).
> >
> > > But this comes at the cost of a lot of complexity and uglification
> > > in the core code to deal with such an opaque type that relies on lots of
> > > mutators and accessors in order to deal with a random granularity time
> > > unit that also involve lots of workarounds and likely some performance
> > > penalties.
> >
> > Having an opaque type with a set of helper functions is the whole point, no?
> > And I would not call the generic implementations for jiffies or nano-seconds
> > complex, these are easy enough to understand. And what are the performance
> > penalties you are talking about?
> Just because some code isn't too complex doesn't mean we really want to keep it.
> I get regular questions about what unit does cputime_t map to on a given
> configuration. Everybody gets confused about that. On many of the
> patches we got on cputime for the last years, I had to fix quite some issues
> with bad granularity assumption. In fact most fixes that came to kernel/sched/cputime.c
> recently, after merge or review, were about people getting confused with cputime_t granularity.

These regular question you get about the cputime_t is exactly what I was referring
to. If the value would just be a u64 the guys asking the question about cputime_t
would just assume the value to be nano-seconds and then go ahead and break things.

> Especially for stats that come from nsecs clocks (steal and irqtime), we always have to maintain an
> accumulator and make sure we don't lose some nanosec deltas.


> And we have to maintain several workarounds, sometimes even in the fastpath in
> order to cope with the cputime_t random granularity all over.
> Some fastpath examples:
> * steal time accounting (need to convert nsecs to cputime then back)
> * irqtime accounting (maintain accumulators)
> * cputime_adjust, used on any user read of cputime (need to convert from nsecs
> to cputime on cputime_adjust)
> But the worst really is about maintainance. This patchset removes around 600 lines.

Well 300 lines is from the powerpc and s390 cputime.h header and ~200 from
the generic cputime_jiffies.h and cputime_nsecs.h.

> >
> > > So this patchset proposes to convert most of the cputime_t uses to nsecs.
> > > In the end it's only used by s390 and powerpc. This all comes at the
> > > expense of those two archs which then need to perform a cputime_to_nsec()
> > > conversion everytime they update the cputime to the core. Now I expect
> > > we can leverage this performance loss with flushing the cputime only on
> > > ticks so that we accumulate time as cputime_t in between and make the
> > > conversions more rare.
> >
> > It is not just one cputime_to_nsec that we would have to add but several.
> > Three in do_account_vtime and one in vtime_account_irq_enter.
> >
> > The do_account_vtime function is called once per jiffy and once per task
> > switch. HZ is usually set to 100 for s390, the conversion once per jiffy
> > would not be so bad, but the call on the scheduling path *will* hurt.
> I don't think we need to flush on task switch. If we maintain the accumulators
> on the task/thread struct instead of per-cpu, then the remaining time after
> task switch out will be accounted on next tick after after next task switch in.

You can not properly calculate steal time if you allow sleeping tasks to sit on
up to 5*HZ worth of cpu time. I think we *have* to do accounting on task switch.
At least on s390, likely on powerpc as well. Why not make that an option for
the architecture with the yet-to-be-written accumulating code.

> > What is even worse is the vtime_account_irq_enter path, that is call several
> > times for each *interrupt*, at least two times for an interrupt without
> > additional processing and four times if a softirq is triggered.
> Actually maintaining an accumulator to flush on ticks is probably going to increase
> the perf because of that. account_system_time() is called twice per interrupt, and
> such function do much more than just account the time to the task_struct and cpustat
> fields. The same applies to userspace boundaries and context switch. The account_*_time()
> functions can be expensive.

The account_system_time twice per interrupt can be removed with the accumulation
idea. We will have to see how expensive the accounting_xxx_time calls are on
the context switch path.

> >
> > Now it has been proposed to implement lazy accounting to accumulate deltas
> > and do the expensive conversions only infrequently. This is pretty straight-
> > forward for account_user_time but to do this for the account_system_time
> > function is more complicated. The function has to differentiate between
> > guest/hardirq/softirq and pure system time. We would need to keep sums for
> > each bucket and provide a separate function to add to each bucket. Like
> > account_guest_time(), account_hardirq_time(), account_softirq_time() and
> > account_system_time(). Then it is up to the arch code to sort out the details
> > and call the accounting code once per jiffy for each of the buckets.
> That wouldn't be too hard really. The s390 code in vtime.c already does that.

Yes, I agree that the accumulating change would not be too hard. Can I make the
request that we try to get that done first before doing the cleanup ?

> > We still have to do the whole thing on each task switch though.
> Not if we maintain the deltas in the task_struct.
> >
> > But I am still not happy about the approach. What is the compelling reason
> > for this change except for the "but it looks ugly"?
> The diffstat (600 lines removed). Also the fact that we have all these workarounds
> in the core code just for the special case of 1 arch (s390) and a half
> I'd much rather have all that complexity moved in a vtime_native.c shared by s390 and powerpc
> that takes care of proper accumulation in cputime_t and flushes that on ticks in nsecs rather
> than having all these cputime_t game all over the kernel.

The goal to have nano-seconds only in the core code is a good one. And with the
accumulator I think s390 can live with it. The change would have a real upside
too. There are these stupid divisions for scaled cputime that we have to calculate
for every call to account_xxx_time(). These would not be done for the interrupts

blue skies,

"Reality continues to ruin my life." - Calvin.