Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle

From: Shrikanth Hegde

Date: Wed Feb 25 2026 - 05:37:31 EST


Hi Christophe.

On 2/25/26 3:15 PM, Christophe Leroy (CS GROUP) wrote:
Hi Hegde,

Le 25/02/2026 à 08:46, Shrikanth Hegde a écrit :
Hi Christophe,

On 2/24/26 9:11 PM, Christophe Leroy (CS GROUP) wrote:
Hi Hegde,

Le 19/02/2026 à 19:30, Shrikanth Hegde a écrit :


On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
Currently the tick subsystem stores the idle cputime accounting in
private fields, allowing cohabitation with architecture idle vtime
accounting. The former is fetched on online CPUs, the latter on offline
CPUs.

For consolidation purpose, architecture vtime accounting will continue
to account the cputime but will make a break when the idle tick is
stopped. The dyntick cputime accounting will then be relayed by the tick
subsystem so that the idle cputime is still seen advancing coherently
even when the tick isn't there to flush the idle vtime.

Prepare for that and introduce three new APIs which will be used in
subsequent patches:

_ vtime_dynticks_start() is deemed to be called when idle enters in
   dyntick mode. The idle cputime that elapsed so far is accumulated.

- vtime_dynticks_stop() is deemed to be called when idle exits from
   dyntick mode. The vtime entry clocks are fast-forward to current time
   so that idle accounting restarts elapsing from now.

- vtime_reset() is deemed to be called from dynticks idle IRQ entry to
   fast-forward the clock to current time so that the IRQ time is still
   accounted by vtime while nohz cputime is paused.

Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
accounting twice the idle cputime, along with nohz accounting.

Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx>

Reviewed-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxx>

---
  arch/powerpc/kernel/time.c | 41 +++++++++++++++++++++++++++++++++ + ++++
  include/linux/vtime.h      |  6 ++++++
  2 files changed, 47 insertions(+)

diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 4bbeb8644d3d..18506740f4a4 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
          acct->starttime = acct0->starttime;
      }
  }
+
+#ifdef CONFIG_NO_HZ_COMMON
+/**
+ * vtime_reset - Fast forward vtime entry clocks
+ *
+ * Called from dynticks idle IRQ entry to fast-forward the clocks to current time
+ * so that the IRQ time is still accounted by vtime while nohz cputime is paused.
+ */
+void vtime_reset(void)
+{
+    struct cpu_accounting_data *acct = get_accounting(current);
+
+    acct->starttime = mftb();

I figured out why those huge values happen.

This happens because mftb is from when the system is booted.
I was doing kexec to start the new kernel and mftb wasn't getting
reset.

I thought about this. This is concern for pseries too, where LPAR's
restart but system won't restart and mftb will continue to run instead of
reset.

I think we should be using sched_clock instead of mftb here.
Though we need it a few more places and some cosmetic changes around it.

Note: Some values being huge exists without series for few CPUs, with series it
shows up in most of the CPUs.

So I am planning send out fix below fix separately keeping your
series as dependency.

---
  arch/powerpc/include/asm/accounting.h |  4 ++--
  arch/powerpc/include/asm/cputime.h    | 14 +++++++-------
  arch/powerpc/kernel/time.c            | 22 +++++++++++-----------
  3 files changed, 20 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/ include/asm/accounting.h
index 6d79c31700e2..50f120646e6d 100644
--- a/arch/powerpc/include/asm/accounting.h
+++ b/arch/powerpc/include/asm/accounting.h
@@ -21,8 +21,8 @@ struct cpu_accounting_data {
      unsigned long steal_time;
      unsigned long idle_time;
      /* Internal counters */
-    unsigned long starttime;    /* TB value snapshot */
-    unsigned long starttime_user;    /* TB value on exit to usermode */
+    unsigned long starttime;    /* Time value snapshot */
+    unsigned long starttime_user;    /* Time value on exit to usermode */
  #ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
      unsigned long startspurr;    /* SPURR value snapshot */
      unsigned long utime_sspurr;    /* ->user_time when - >startspurr set */
diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/ include/ asm/cputime.h
index aff858ca99c0..eb6b629b113f 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -20,9 +20,9 @@
  #include <asm/time.h>
  #include <asm/param.h>
  #include <asm/firmware.h>
+#include <linux/sched/clock.h>

  #ifdef __KERNEL__
-#define cputime_to_nsecs(cputime) tb_to_ns(cputime)

  /*
   * PPC64 uses PACA which is task independent for storing accounting data while
@@ -44,20 +44,20 @@
   */
  static notrace inline void account_cpu_user_entry(void)
  {
-    unsigned long tb = mftb();
+    unsigned long now = sched_clock();

Now way !

By doing that you'll kill performance for no reason. All we need when accounting time spent in kernel or in user is the difference between time at entry and time at exit, no mater what the time was at boot time.


No. With this patch there will not be any performance difference.
All it does is, instead of using mftb uses sched_clock at those places.


In arch/powerpc/kernel/time.c we have sched_clock().
notrace unsigned long long sched_clock(void)
{
         return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
}

It does the same mftb call, and accounts only the time after boot, which is
what /proc/stat should do as well.

"
the amount of time, measured in units of USER_HZ
(1/100ths of a second on most architectures

user   (1) Time spent in user mode.

idle   (4) Time spent in the idle task.  This value
        should be USER_HZ times the second entry in
        the /proc/uptime pseudo-file.
"
/proc/uptime is based on sched_clock, so i infer /proc/stat also should show
values w.r.t to boot of the OS.


Also sched_clock() returns nanoseconds which implies calculation from timebase. This is pointless CPU consumption. The current implementation calculates nanoseconds at task switch when calling vtime_flush().Your change will now do it at every kernel entry and kernel exit by calling sched_clock().

This change doesn't add any additional paths. Even without patches, mftb would have
been called in every kernel entry/exit.  See mftb usage account_cpu_user_exit/enter

Now instead of mftb sched_clock is used, that's all. No additional entry/exit points.
And previously when accounting we would have done cputime_to_nsecs, now that conversion
is done automatically in sched_clock. So overall computation-wise it should be same.

What i am missing to see it here?

Ok, lets try to explain in more details:

While a process is running, it will enter and leave the kernel multiple times, without task switch. For instance for system calls or for interrupts.

At every kernel entry and exit, account_cpu_user_entry() and account_cpu_user_exit() are called. That's a very hot path.

I have added the following functions to see what the code looks like:

+
+void my_account_cpu_user_entry(void);
+void my_account_cpu_user_entry(void)
+{
+       account_cpu_user_entry();
+}
+
+void my_account_cpu_user_exit(void);
+void my_account_cpu_user_exit(void)
+{
+       account_cpu_user_exit();
+}

What we have today is very optimised:

00000148 <my_account_cpu_user_entry>:
 148:    7d 0c 42 e6     mftb    r8
 14c:    80 e2 00 08     lwz     r7,8(r2)
 150:    81 22 00 28     lwz     r9,40(r2)
 154:    91 02 00 24     stw     r8,36(r2)
 158:    7d 29 38 50     subf    r9,r9,r7
 15c:    7d 29 42 14     add     r9,r9,r8
 160:    91 22 00 08     stw     r9,8(r2)
 164:    4e 80 00 20     blr

00000168 <my_account_cpu_user_exit>:
 168:    7d 0c 42 e6     mftb    r8
 16c:    80 e2 00 0c     lwz     r7,12(r2)
 170:    81 22 00 24     lwz     r9,36(r2)
 174:    91 02 00 28     stw     r8,40(r2)
 178:    7d 29 38 50     subf    r9,r9,r7
 17c:    7d 29 42 14     add     r9,r9,r8
 180:    91 22 00 0c     stw     r9,12(r2)
 184:    4e 80 00 20     blr



With your change we now get a call to sched_clock() instead of a simple mftb,

00000154 <my_account_cpu_user_entry>:
 154:    94 21 ff f0     stwu    r1,-16(r1)
 158:    7c 08 02 a6     mflr    r0
 15c:    90 01 00 14     stw     r0,20(r1)
 160:    48 00 00 01     bl      160 <my_account_cpu_user_entry+0xc>
            160: R_PPC_REL24    sched_clock
 164:    81 02 00 08     lwz     r8,8(r2)
 168:    81 22 00 28     lwz     r9,40(r2)
 16c:    90 82 00 24     stw     r4,36(r2)
 170:    7d 29 40 50     subf    r9,r9,r8
 174:    7d 29 22 14     add     r9,r9,r4
 178:    91 22 00 08     stw     r9,8(r2)
 17c:    80 01 00 14     lwz     r0,20(r1)
 180:    38 21 00 10     addi    r1,r1,16
 184:    7c 08 03 a6     mtlr    r0
 188:    4e 80 00 20     blr

0000018c <my_account_cpu_user_exit>:
 18c:    94 21 ff f0     stwu    r1,-16(r1)
 190:    7c 08 02 a6     mflr    r0
 194:    90 01 00 14     stw     r0,20(r1)
 198:    48 00 00 01     bl      198 <my_account_cpu_user_exit+0xc>
            198: R_PPC_REL24    sched_clock
 19c:    81 02 00 0c     lwz     r8,12(r2)
 1a0:    81 22 00 24     lwz     r9,36(r2)
 1a4:    90 82 00 28     stw     r4,40(r2)
 1a8:    7d 29 40 50     subf    r9,r9,r8
 1ac:    7d 29 22 14     add     r9,r9,r4
 1b0:    91 22 00 0c     stw     r9,12(r2)
 1b4:    80 01 00 14     lwz     r0,20(r1)
 1b8:    38 21 00 10     addi    r1,r1,16
 1bc:    7c 08 03 a6     mtlr    r0
 1c0:    4e 80 00 20     blr

And sched_clock() is heavy, first it has the sequence mftbu/mftb/mftbu, and then it does awful lot of calculations including many multiply:

000004d8 <sched_clock>:
 4d8:    7d 2d 42 e6     mftbu   r9
 4dc:    7d 0c 42 e6     mftb    r8
 4e0:    7d 4d 42 e6     mftbu   r10
 4e4:    7c 09 50 40     cmplw   r9,r10
 4e8:    40 82 ff f0     bne     4d8 <sched_clock>
 4ec:    3d 40 00 00     lis     r10,0
            4ee: R_PPC_ADDR16_HA    .data..ro_after_init
 4f0:    38 ca 00 00     addi    r6,r10,0
            4f2: R_PPC_ADDR16_LO    .data..ro_after_init
 4f4:    3c e0 00 00     lis     r7,0
            4f6: R_PPC_ADDR16_HA    .data..read_mostly
 4f8:    38 87 00 00     addi    r4,r7,0
            4fa: R_PPC_ADDR16_LO    .data..read_mostly
 4fc:    80 66 00 04     lwz     r3,4(r6)
 500:    80 e7 00 00     lwz     r7,0(r7)
            502: R_PPC_ADDR16_LO    .data..read_mostly
 504:    80 c4 00 04     lwz     r6,4(r4)
 508:    81 4a 00 00     lwz     r10,0(r10)
            50a: R_PPC_ADDR16_LO    .data..ro_after_init
 50c:    7c 63 40 10     subfc   r3,r3,r8
 510:    7d 0a 49 10     subfe   r8,r10,r9
 514:    7d 27 19 d6     mullw   r9,r7,r3
 518:    7d 43 30 16     mulhwu  r10,r3,r6
 51c:    7c 08 31 d6     mullw   r0,r8,r6
 520:    7d 4a 48 14     addc    r10,r10,r9
 524:    7c 67 18 16     mulhwu  r3,r7,r3
 528:    39 20 00 00     li      r9,0
 52c:    7c c8 30 16     mulhwu  r6,r8,r6
 530:    7c a9 49 14     adde    r5,r9,r9
 534:    7d 67 41 d6     mullw   r11,r7,r8
 538:    7d 4a 00 14     addc    r10,r10,r0
 53c:    7c a5 01 94     addze   r5,r5
 540:    7c 63 30 14     addc    r3,r3,r6
 544:    7d 29 49 14     adde    r9,r9,r9
 548:    80 84 00 08     lwz     r4,8(r4)
 54c:    7c 63 58 14     addc    r3,r3,r11
 550:    7c e7 40 16     mulhwu  r7,r7,r8
 554:    7d 29 01 94     addze   r9,r9
 558:    7c 63 28 14     addc    r3,r3,r5
 55c:    7d 29 39 14     adde    r9,r9,r7
 560:    35 44 ff e0     addic.  r10,r4,-32
 564:    41 80 00 10     blt     574 <sched_clock+0x9c>
 568:    7c 63 50 30     slw     r3,r3,r10
 56c:    38 80 00 00     li      r4,0
 570:    4e 80 00 20     blr
 574:    21 04 00 1f     subfic  r8,r4,31
 578:    54 6a f8 7e     srwi    r10,r3,1
 57c:    7d 29 20 30     slw     r9,r9,r4
 580:    7d 4a 44 30     srw     r10,r10,r8
 584:    7c 64 20 30     slw     r4,r3,r4
 588:    7d 43 4b 78     or      r3,r10,r9
 58c:    4e 80 00 20     blr

I think the difference is obvious, no need of benchmarking. We shall refrain from calling sched_clock() at every kernel entry/exit. Converting from timebase to nanoseconds only need to be done in vtime_flush() called by vtime_task_switch() during task switch.

Hope it is more explicit now.


Got it. The main concern was around with additional computation that sched_clock,
not any additional paths per se.

yes, that would be possible,


How about we do below? This adds only one subtraction.
This achieves the same outcome.

---

diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index aff858ca99c0..7afba0202568 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -44,7 +44,7 @@
*/
static notrace inline void account_cpu_user_entry(void)
{
- unsigned long tb = mftb();
+ unsigned long tb = mftb() - get_boot_tb();
struct cpu_accounting_data *acct = raw_get_accounting(current);
acct->utime += (tb - acct->starttime_user);
@@ -53,7 +53,7 @@ static notrace inline void account_cpu_user_entry(void)
static notrace inline void account_cpu_user_exit(void)
{
- unsigned long tb = mftb();
+ unsigned long tb = mftb() - get_boot_tb();
struct cpu_accounting_data *acct = raw_get_accounting(current);
acct->stime += (tb - acct->starttime);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 18506740f4a4..ff5524e6cdc7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -215,7 +215,7 @@ static unsigned long vtime_delta(struct cpu_accounting_data *acct,
WARN_ON_ONCE(!irqs_disabled());
- now = mftb();
+ now = mftb() - get_boot_tb();
stime = now - acct->starttime;
acct->starttime = now;
@@ -388,7 +388,7 @@ void vtime_reset(void)
{
struct cpu_accounting_data *acct = get_accounting(current);
- acct->starttime = mftb();
+ acct->starttime = mftb() - get_boot_tb();
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
acct->startspurr = read_spurr(acct->starttime);
#endif