Re: [PATCH 1/5] sched,time: count actually elapsed irq & softirq time
From: Frederic Weisbecker
Date: Mon Jun 27 2016 - 08:26:04 EST
On Wed, Jun 22, 2016 at 10:25:47PM -0400, riel@xxxxxxxxxx wrote:
> From: Rik van Riel <riel@xxxxxxxxxx>
>
> Currently, if there was any irq or softirq time during 'ticks'
> jiffies, the entire period will be accounted as irq or softirq
> time.
>
> This is inaccurate if only a subset of 'ticks' jiffies was
> actually spent handling irqs, and could conceivably mis-count
> all of the ticks during a period as irq time, when there was
> some irq and some softirq time.
Good catch!
Many comments following.
>
> This can actually happen when irqtime_account_process_tick
> is called from account_idle_ticks, which can pass a larger
> number of ticks down all at once.
>
> Fix this by changing irqtime_account_hi_update and
> irqtime_account_si_update to round elapsed irq and softirq
> time to jiffies, and return the number of jiffies spent in
> each mode, similar to how steal time is handled.
>
> Additionally, have irqtime_account_process_tick take into
> account how much time was spent in each of steal, irq,
> and softirq time.
>
> The latter could help improve the accuracy of timekeeping
Maybe you meant cputime? Timekeeping is rather about jiffies and GTOD.
> when returning from idle on a NO_HZ_IDLE CPU.
>
> Properly accounting how much time was spent in hardirq and
> softirq time will also allow the NO_HZ_FULL code to re-use
> these same functions for hardirq and softirq accounting.
>
> Signed-off-by: Rik van Riel <riel@xxxxxxxxxx>
> ---
> kernel/sched/cputime.c | 81 +++++++++++++++++++++++++++++++++++++-------------
> 1 file changed, 60 insertions(+), 21 deletions(-)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 3d60e5d76fdb..15b813c014be 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -79,40 +79,54 @@ void irqtime_account_irq(struct task_struct *curr)
> }
> EXPORT_SYMBOL_GPL(irqtime_account_irq);
>
> -static int irqtime_account_hi_update(void)
> +static unsigned long irqtime_account_hi_update(unsigned long max_jiffies)
> {
> u64 *cpustat = kcpustat_this_cpu->cpustat;
> + unsigned long irq_jiffies = 0;
> unsigned long flags;
> - u64 latest_ns;
> - int ret = 0;
> + u64 irq;
Should be cputime_t ? And perhaps renamed to irq_cputime to distinguish
clearly against irq_jiffies?
>
> local_irq_save(flags);
> - latest_ns = this_cpu_read(cpu_hardirq_time);
> - if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_IRQ])
> - ret = 1;
> + irq = this_cpu_read(cpu_hardirq_time) - cpustat[CPUTIME_IRQ];
cpu_hardirq_time is made of nsecs whereas cpustat is of cputime_t (in fact
even cputime64_t). So you need to convert cpu_hardirq_time before doing the
substract.
> + if (irq > cputime_one_jiffy) {
> + irq_jiffies = min(max_jiffies, cputime_to_jiffies(irq));
> + cpustat[CPUTIME_IRQ] += jiffies_to_cputime(irq_jiffies);
> + }
> local_irq_restore(flags);
> - return ret;
> + return irq_jiffies;
I think we shouldn't use jiffies at all here and only rely on cputime_t.
max_jiffies should be of cputime_t and this function should return the cputime_t.
The resulting code should be more simple and in fact more precise (more below on that).
> }
>
> -static int irqtime_account_si_update(void)
> +static unsigned long irqtime_account_si_update(unsigned long max_jiffies)
> {
> u64 *cpustat = kcpustat_this_cpu->cpustat;
> + unsigned long si_jiffies = 0;
> unsigned long flags;
> - u64 latest_ns;
> - int ret = 0;
> + u64 softirq;
>
> local_irq_save(flags);
> - latest_ns = this_cpu_read(cpu_softirq_time);
> - if (nsecs_to_cputime64(latest_ns) > cpustat[CPUTIME_SOFTIRQ])
> - ret = 1;
> + softirq = this_cpu_read(cpu_softirq_time) - cpustat[CPUTIME_SOFTIRQ];
> + if (softirq > cputime_one_jiffy) {
> + si_jiffies = min(max_jiffies, cputime_to_jiffies(softirq));
> + cpustat[CPUTIME_SOFTIRQ] += jiffies_to_cputime(si_jiffies);
> + }
> local_irq_restore(flags);
> - return ret;
> + return si_jiffies;
So same comments apply here.
[...]
> * Accumulate raw cputime values of dead tasks (sig->[us]time) and live
> * tasks (sum on group iteration) belonging to @tsk's group.
> */
> @@ -344,19 +378,24 @@ static void irqtime_account_process_tick(struct task_struct *p, int user_tick,
> {
> cputime_t scaled = cputime_to_scaled(cputime_one_jiffy);
> u64 cputime = (__force u64) cputime_one_jiffy;
> - u64 *cpustat = kcpustat_this_cpu->cpustat;
> + unsigned long other;
>
> - if (steal_account_process_tick(ULONG_MAX))
> + /*
> + * When returning from idle, many ticks can get accounted at
> + * once, including some ticks of steal, irq, and softirq time.
> + * Subtract those ticks from the amount of time accounted to
> + * idle, or potentially user or system time. Due to rounding,
> + * other time can exceed ticks occasionally.
> + */
> + other = account_other_ticks(ticks);
> + if (other >= ticks)
> return;
> + ticks -= other;
>
> cputime *= ticks;
> scaled *= ticks;
So instead of dealing with ticks here, I think you should rather use the above
cputime as both the limit and the remaining time to account after steal/irqs.
This should avoid some middle conversions and improve precision when cputime_t == nsecs granularity.
If we account 2 ticks to idle (lets say HZ=100) and irq time to account is 15 ms. 2 ticks = 20 ms
so we have 5 ms left to account to idle. With the jiffies granularity in this patch, we would account
one tick to irqtime (1 tick = 10 ms, there will be 5 ms to account back later) and one tick to idle
time whereas if you deal with cputime_t, you are going to account the correct amount of idle time.
Beyond the scope of this patchset, we should probably kill cputime_t and account everything to nsecs.
I have a patchset that did that 2 years ago, I should probably re-iterate it.
Thanks.