Re: [RFC PATCH 07/30] cputime: Convert kcpustat to nsecs

From: Frederic Weisbecker
Date: Mon Dec 01 2014 - 11:10:44 EST

On Mon, Dec 01, 2014 at 03:14:02PM +0100, Martin Schwidefsky wrote:
> On Fri, 28 Nov 2014 19:23:37 +0100
> Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
> > Kernel cpu stats are stored in cputime_t which is an architecture
> > defined type, and hence a bit opaque and requiring accessors and mutators
> > for any operation.
> >
> > Converting them to nsecs simplifies the code a little bit.
> Quite honestly I do not see much of an improvement here, on set of
> functions (cputime_to_xxx) gets replaced by another (nsecs_to_xxx).

Well it's not just that. Irqtime accounting gets simplified (no more
temporary buffer getting flushed on tick), same goes for guest accounting.
cpufreq gets one less level of conversion. Some places also lost their cputime_t
accessors due to internal nsecs use.
Plus a few other simplifications here and there that I haven't yet finished
like VIRT_CPU_ACCOUNTING_GEN that won't need cputime_t accessors anymore.

Also once the patchset is complete, we should be able to remove a significant
part of cputime_t accessors and mutators if only archs use them for one or
two conversions (probably cputime_to_nsecs() alone would be enough). And there are
many implementations of cputime_t: jiffies, nsecs, powerpc and s390. Expect
the removal of jiffies and nsecs based cputime_t plus the largest part of powerpc
and s390 implementations.

> On the contrary for s390 I see a degradation, consider a hunk like
> this:
> > @@ -128,9 +128,9 @@ static inline u64 get_cpu_idle_time_jiffy(unsigned int cpu, u64 *wall)
> >
> > idle_time = cur_wall_time - busy_time;
> > if (wall)
> > - *wall = cputime_to_usecs(cur_wall_time);
> > + *wall = div_u64(cur_wall_time, NSEC_PER_USEC);
> >
> > - return cputime_to_usecs(idle_time);
> > + return div_u64(idle_time, NSEC_PER_USEC);
> > }
> >
> > u64 get_cpu_idle_time(unsigned int cpu, u64 *wall, int io_busy)
> For s390 cputime_to_usecs is a shift, with the new code we now have a division.
> Fortunately this is in a piece of code that s390 does not use..

Speaking about the degradation in s390:

s390 is really a special case. And it would be a shame if we prevent from a
real core cleanup just for this special case especially as it's fairly possible
to keep a specific treatment for s390 in order not to impact its performances
and time precision. We could simply accumulate the cputime in per-cpu values:

struct s390_cputime {
cputime_t user, sys, softirq, hardirq, steal;

DEFINE_PER_CPU(struct s390_cputime, s390_cputime);

Then on irq entry/exit, just add the accumulated time to the relevant buffer
and account for real (through any account_...time() functions) only on tick
and task switch. There the costly operations (unit conversion and call to
account_...._time() functions) are deferred to a rarer yet periodic enough
event. This is what s390 does already for user/system time and kernel

This way we should even improve the situation compared to what we have
upstream. It's going to be faster because calling the accounting functions
can be costlier than simple per-cpu ops. And also we keep the cputime_t
granularity. For archs like s390 which have a granularity higher than nsecs,
we can have:

u64 cputime_to_nsecs(cputime_t time, u64 *rem);

And to avoid remainder losses, we can do that from the tick:

delta_cputime = this_cpu_read(s390_cputime.hardirq);
delta_nsec = cputime_to_nsecs(delta_cputime, &rem);
account_system_time(delta_nsec, HARDIRQ_OFFSET);
this_cpu_write(s390_cputime.hardirq, rem);

Although I doubt that remainders below one nsec lost each tick matter that much.
But if it does, it's fairly possible to handle like above.
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at