Re: [RFC PATCH 07/30] cputime: Convert kcpustat to nsecs

From: Christian Borntraeger
Date: Mon Dec 01 2014 - 15:14:18 EST

Am 01.12.2014 um 18:15 schrieb Thomas Gleixner:
> On Mon, 1 Dec 2014, Martin Schwidefsky wrote:
>> On Mon, 1 Dec 2014 17:10:34 +0100
>> Frederic Weisbecker <fweisbec@xxxxxxxxx> wrote:
>>> Speaking about the degradation in s390:
>>> s390 is really a special case. And it would be a shame if we prevent from a
>>> real core cleanup just for this special case especially as it's fairly possible
>>> to keep a specific treatment for s390 in order not to impact its performances
>>> and time precision. We could simply accumulate the cputime in per-cpu values:
>>> struct s390_cputime {
>>> cputime_t user, sys, softirq, hardirq, steal;
>>> }
>>> DEFINE_PER_CPU(struct s390_cputime, s390_cputime);
>>> Then on irq entry/exit, just add the accumulated time to the relevant buffer
>>> and account for real (through any account_...time() functions) only on tick
>>> and task switch. There the costly operations (unit conversion and call to
>>> account_...._time() functions) are deferred to a rarer yet periodic enough
>>> event. This is what s390 does already for user/system time and kernel
>>> boundaries.
>>> This way we should even improve the situation compared to what we have
>>> upstream. It's going to be faster because calling the accounting functions
>>> can be costlier than simple per-cpu ops. And also we keep the cputime_t
>>> granularity. For archs like s390 which have a granularity higher than nsecs,
>>> we can have:
>>> u64 cputime_to_nsecs(cputime_t time, u64 *rem);
>>> And to avoid remainder losses, we can do that from the tick:
>>> delta_cputime = this_cpu_read(s390_cputime.hardirq);
>>> delta_nsec = cputime_to_nsecs(delta_cputime, &rem);
>>> account_system_time(delta_nsec, HARDIRQ_OFFSET);
>>> this_cpu_write(s390_cputime.hardirq, rem);
>>> Although I doubt that remainders below one nsec lost each tick matter that much.
>>> But if it does, it's fairly possible to handle like above.
>> To make that work we would have to move some of the logic from account_system_time
>> to the architecture code. The decision if a system time delta is guest time,
>> irq time, softirq time or simply system time is currently done in
>> kernel/sched/cputime.c.
>> As the conversion + the accounting is delayed to a regular tick we would have
>> to split the accounting code into decision functions which bucket a system time
>> delta should go to and introduce new function to account to the different buckets.
>> Instead of a single account_system_time we would have account_guest_time,
>> account_system_time, account_system_time_irq and account_system_time_softirq.
>> In principle not a bad idea, that would make the interrupt path for s390 faster
>> as we would not have to call account_system_time, only the decision function
>> which could be an inline function.
> Why make this s390 specific?
> We can decouple the accounting from the time accumulation for all
> architectures.
> struct cputime_record {
> u64 user, sys, softirq, hardirq, steal;
> };

Wont we need guest, nice, guest_nice as well?

> DEFINE_PER_CPU(struct cputime_record, cputime_record);
> Now let account_xxx_time() just work on that per cpu data
> structures. That would just accumulate the deltas based on whatever
> the architecture uses as a cputime source with whatever resolution it
> provides.
> Then we collect that accumulated results for the various buckets on a
> regular base and convert them to nano seconds. This is not even
> required to be at the tick, it could be done by some async worker and
> on idle enter/exit.
> Thanks,
> tglx
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@xxxxxxxxxxxxxxx
> More majordomo info at
> Please read the FAQ at

To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at