Re: [PATCH 04/15] powerpc/time: Prepare to stop elapsing in dynticks-idle
From: Shrikanth Hegde
Date: Wed Feb 25 2026 - 05:37:31 EST
Hi Christophe.
On 2/25/26 3:15 PM, Christophe Leroy (CS GROUP) wrote:
Hi Hegde,
Le 25/02/2026 à 08:46, Shrikanth Hegde a écrit :
Hi Christophe,
On 2/24/26 9:11 PM, Christophe Leroy (CS GROUP) wrote:
Hi Hegde,
Le 19/02/2026 à 19:30, Shrikanth Hegde a écrit :
On 2/6/26 7:52 PM, Frederic Weisbecker wrote:
Currently the tick subsystem stores the idle cputime accounting in
private fields, allowing cohabitation with architecture idle vtime
accounting. The former is fetched on online CPUs, the latter on offline
CPUs.
For consolidation purpose, architecture vtime accounting will continue
to account the cputime but will make a break when the idle tick is
stopped. The dyntick cputime accounting will then be relayed by the tick
subsystem so that the idle cputime is still seen advancing coherently
even when the tick isn't there to flush the idle vtime.
Prepare for that and introduce three new APIs which will be used in
subsequent patches:
_ vtime_dynticks_start() is deemed to be called when idle enters in
dyntick mode. The idle cputime that elapsed so far is accumulated.
- vtime_dynticks_stop() is deemed to be called when idle exits from
dyntick mode. The vtime entry clocks are fast-forward to current time
so that idle accounting restarts elapsing from now.
- vtime_reset() is deemed to be called from dynticks idle IRQ entry to
fast-forward the clock to current time so that the IRQ time is still
accounted by vtime while nohz cputime is paused.
Also accumulated vtime won't be flushed from dyntick-idle ticks to avoid
accounting twice the idle cputime, along with nohz accounting.
Signed-off-by: Frederic Weisbecker <frederic@xxxxxxxxxx>
Reviewed-by: Shrikanth Hegde <sshegde@xxxxxxxxxxxxx>
---
arch/powerpc/kernel/time.c | 41 +++++++++++++++++++++++++++++++++ + ++++
include/linux/vtime.h | 6 ++++++
2 files changed, 47 insertions(+)
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 4bbeb8644d3d..18506740f4a4 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -376,6 +376,47 @@ void vtime_task_switch(struct task_struct *prev)
acct->starttime = acct0->starttime;
}
}
+
+#ifdef CONFIG_NO_HZ_COMMON
+/**
+ * vtime_reset - Fast forward vtime entry clocks
+ *
+ * Called from dynticks idle IRQ entry to fast-forward the clocks to current time
+ * so that the IRQ time is still accounted by vtime while nohz cputime is paused.
+ */
+void vtime_reset(void)
+{
+ struct cpu_accounting_data *acct = get_accounting(current);
+
+ acct->starttime = mftb();
I figured out why those huge values happen.
This happens because mftb is from when the system is booted.
I was doing kexec to start the new kernel and mftb wasn't getting
reset.
I thought about this. This is concern for pseries too, where LPAR's
restart but system won't restart and mftb will continue to run instead of
reset.
I think we should be using sched_clock instead of mftb here.
Though we need it a few more places and some cosmetic changes around it.
Note: Some values being huge exists without series for few CPUs, with series it
shows up in most of the CPUs.
So I am planning send out fix below fix separately keeping your
series as dependency.
---
arch/powerpc/include/asm/accounting.h | 4 ++--
arch/powerpc/include/asm/cputime.h | 14 +++++++-------
arch/powerpc/kernel/time.c | 22 +++++++++++-----------
3 files changed, 20 insertions(+), 20 deletions(-)
diff --git a/arch/powerpc/include/asm/accounting.h b/arch/powerpc/ include/asm/accounting.h
index 6d79c31700e2..50f120646e6d 100644
--- a/arch/powerpc/include/asm/accounting.h
+++ b/arch/powerpc/include/asm/accounting.h
@@ -21,8 +21,8 @@ struct cpu_accounting_data {
unsigned long steal_time;
unsigned long idle_time;
/* Internal counters */
- unsigned long starttime; /* TB value snapshot */
- unsigned long starttime_user; /* TB value on exit to usermode */
+ unsigned long starttime; /* Time value snapshot */
+ unsigned long starttime_user; /* Time value on exit to usermode */
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
unsigned long startspurr; /* SPURR value snapshot */
unsigned long utime_sspurr; /* ->user_time when - >startspurr set */
diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/ include/ asm/cputime.h
index aff858ca99c0..eb6b629b113f 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -20,9 +20,9 @@
#include <asm/time.h>
#include <asm/param.h>
#include <asm/firmware.h>
+#include <linux/sched/clock.h>
#ifdef __KERNEL__
-#define cputime_to_nsecs(cputime) tb_to_ns(cputime)
/*
* PPC64 uses PACA which is task independent for storing accounting data while
@@ -44,20 +44,20 @@
*/
static notrace inline void account_cpu_user_entry(void)
{
- unsigned long tb = mftb();
+ unsigned long now = sched_clock();
Now way !
By doing that you'll kill performance for no reason. All we need when accounting time spent in kernel or in user is the difference between time at entry and time at exit, no mater what the time was at boot time.
No. With this patch there will not be any performance difference.
All it does is, instead of using mftb uses sched_clock at those places.
In arch/powerpc/kernel/time.c we have sched_clock().
notrace unsigned long long sched_clock(void)
{
return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
}
It does the same mftb call, and accounts only the time after boot, which is
what /proc/stat should do as well.
"
the amount of time, measured in units of USER_HZ
(1/100ths of a second on most architectures
user (1) Time spent in user mode.
idle (4) Time spent in the idle task. This value
should be USER_HZ times the second entry in
the /proc/uptime pseudo-file.
"
/proc/uptime is based on sched_clock, so i infer /proc/stat also should show
values w.r.t to boot of the OS.
Also sched_clock() returns nanoseconds which implies calculation from timebase. This is pointless CPU consumption. The current implementation calculates nanoseconds at task switch when calling vtime_flush().Your change will now do it at every kernel entry and kernel exit by calling sched_clock().
This change doesn't add any additional paths. Even without patches, mftb would have
been called in every kernel entry/exit. See mftb usage account_cpu_user_exit/enter
Now instead of mftb sched_clock is used, that's all. No additional entry/exit points.
And previously when accounting we would have done cputime_to_nsecs, now that conversion
is done automatically in sched_clock. So overall computation-wise it should be same.
What i am missing to see it here?
Ok, lets try to explain in more details:
While a process is running, it will enter and leave the kernel multiple times, without task switch. For instance for system calls or for interrupts.
At every kernel entry and exit, account_cpu_user_entry() and account_cpu_user_exit() are called. That's a very hot path.
I have added the following functions to see what the code looks like:
+
+void my_account_cpu_user_entry(void);
+void my_account_cpu_user_entry(void)
+{
+ account_cpu_user_entry();
+}
+
+void my_account_cpu_user_exit(void);
+void my_account_cpu_user_exit(void)
+{
+ account_cpu_user_exit();
+}
What we have today is very optimised:
00000148 <my_account_cpu_user_entry>:
148: 7d 0c 42 e6 mftb r8
14c: 80 e2 00 08 lwz r7,8(r2)
150: 81 22 00 28 lwz r9,40(r2)
154: 91 02 00 24 stw r8,36(r2)
158: 7d 29 38 50 subf r9,r9,r7
15c: 7d 29 42 14 add r9,r9,r8
160: 91 22 00 08 stw r9,8(r2)
164: 4e 80 00 20 blr
00000168 <my_account_cpu_user_exit>:
168: 7d 0c 42 e6 mftb r8
16c: 80 e2 00 0c lwz r7,12(r2)
170: 81 22 00 24 lwz r9,36(r2)
174: 91 02 00 28 stw r8,40(r2)
178: 7d 29 38 50 subf r9,r9,r7
17c: 7d 29 42 14 add r9,r9,r8
180: 91 22 00 0c stw r9,12(r2)
184: 4e 80 00 20 blr
With your change we now get a call to sched_clock() instead of a simple mftb,
00000154 <my_account_cpu_user_entry>:
154: 94 21 ff f0 stwu r1,-16(r1)
158: 7c 08 02 a6 mflr r0
15c: 90 01 00 14 stw r0,20(r1)
160: 48 00 00 01 bl 160 <my_account_cpu_user_entry+0xc>
160: R_PPC_REL24 sched_clock
164: 81 02 00 08 lwz r8,8(r2)
168: 81 22 00 28 lwz r9,40(r2)
16c: 90 82 00 24 stw r4,36(r2)
170: 7d 29 40 50 subf r9,r9,r8
174: 7d 29 22 14 add r9,r9,r4
178: 91 22 00 08 stw r9,8(r2)
17c: 80 01 00 14 lwz r0,20(r1)
180: 38 21 00 10 addi r1,r1,16
184: 7c 08 03 a6 mtlr r0
188: 4e 80 00 20 blr
0000018c <my_account_cpu_user_exit>:
18c: 94 21 ff f0 stwu r1,-16(r1)
190: 7c 08 02 a6 mflr r0
194: 90 01 00 14 stw r0,20(r1)
198: 48 00 00 01 bl 198 <my_account_cpu_user_exit+0xc>
198: R_PPC_REL24 sched_clock
19c: 81 02 00 0c lwz r8,12(r2)
1a0: 81 22 00 24 lwz r9,36(r2)
1a4: 90 82 00 28 stw r4,40(r2)
1a8: 7d 29 40 50 subf r9,r9,r8
1ac: 7d 29 22 14 add r9,r9,r4
1b0: 91 22 00 0c stw r9,12(r2)
1b4: 80 01 00 14 lwz r0,20(r1)
1b8: 38 21 00 10 addi r1,r1,16
1bc: 7c 08 03 a6 mtlr r0
1c0: 4e 80 00 20 blr
And sched_clock() is heavy, first it has the sequence mftbu/mftb/mftbu, and then it does awful lot of calculations including many multiply:
000004d8 <sched_clock>:
4d8: 7d 2d 42 e6 mftbu r9
4dc: 7d 0c 42 e6 mftb r8
4e0: 7d 4d 42 e6 mftbu r10
4e4: 7c 09 50 40 cmplw r9,r10
4e8: 40 82 ff f0 bne 4d8 <sched_clock>
4ec: 3d 40 00 00 lis r10,0
4ee: R_PPC_ADDR16_HA .data..ro_after_init
4f0: 38 ca 00 00 addi r6,r10,0
4f2: R_PPC_ADDR16_LO .data..ro_after_init
4f4: 3c e0 00 00 lis r7,0
4f6: R_PPC_ADDR16_HA .data..read_mostly
4f8: 38 87 00 00 addi r4,r7,0
4fa: R_PPC_ADDR16_LO .data..read_mostly
4fc: 80 66 00 04 lwz r3,4(r6)
500: 80 e7 00 00 lwz r7,0(r7)
502: R_PPC_ADDR16_LO .data..read_mostly
504: 80 c4 00 04 lwz r6,4(r4)
508: 81 4a 00 00 lwz r10,0(r10)
50a: R_PPC_ADDR16_LO .data..ro_after_init
50c: 7c 63 40 10 subfc r3,r3,r8
510: 7d 0a 49 10 subfe r8,r10,r9
514: 7d 27 19 d6 mullw r9,r7,r3
518: 7d 43 30 16 mulhwu r10,r3,r6
51c: 7c 08 31 d6 mullw r0,r8,r6
520: 7d 4a 48 14 addc r10,r10,r9
524: 7c 67 18 16 mulhwu r3,r7,r3
528: 39 20 00 00 li r9,0
52c: 7c c8 30 16 mulhwu r6,r8,r6
530: 7c a9 49 14 adde r5,r9,r9
534: 7d 67 41 d6 mullw r11,r7,r8
538: 7d 4a 00 14 addc r10,r10,r0
53c: 7c a5 01 94 addze r5,r5
540: 7c 63 30 14 addc r3,r3,r6
544: 7d 29 49 14 adde r9,r9,r9
548: 80 84 00 08 lwz r4,8(r4)
54c: 7c 63 58 14 addc r3,r3,r11
550: 7c e7 40 16 mulhwu r7,r7,r8
554: 7d 29 01 94 addze r9,r9
558: 7c 63 28 14 addc r3,r3,r5
55c: 7d 29 39 14 adde r9,r9,r7
560: 35 44 ff e0 addic. r10,r4,-32
564: 41 80 00 10 blt 574 <sched_clock+0x9c>
568: 7c 63 50 30 slw r3,r3,r10
56c: 38 80 00 00 li r4,0
570: 4e 80 00 20 blr
574: 21 04 00 1f subfic r8,r4,31
578: 54 6a f8 7e srwi r10,r3,1
57c: 7d 29 20 30 slw r9,r9,r4
580: 7d 4a 44 30 srw r10,r10,r8
584: 7c 64 20 30 slw r4,r3,r4
588: 7d 43 4b 78 or r3,r10,r9
58c: 4e 80 00 20 blr
I think the difference is obvious, no need of benchmarking. We shall refrain from calling sched_clock() at every kernel entry/exit. Converting from timebase to nanoseconds only need to be done in vtime_flush() called by vtime_task_switch() during task switch.
Hope it is more explicit now.
Got it. The main concern was around with additional computation that sched_clock,
not any additional paths per se.
yes, that would be possible,
How about we do below? This adds only one subtraction.
This achieves the same outcome.
---
diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index aff858ca99c0..7afba0202568 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -44,7 +44,7 @@
*/
static notrace inline void account_cpu_user_entry(void)
{
- unsigned long tb = mftb();
+ unsigned long tb = mftb() - get_boot_tb();
struct cpu_accounting_data *acct = raw_get_accounting(current);
acct->utime += (tb - acct->starttime_user);
@@ -53,7 +53,7 @@ static notrace inline void account_cpu_user_entry(void)
static notrace inline void account_cpu_user_exit(void)
{
- unsigned long tb = mftb();
+ unsigned long tb = mftb() - get_boot_tb();
struct cpu_accounting_data *acct = raw_get_accounting(current);
acct->stime += (tb - acct->starttime);
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 18506740f4a4..ff5524e6cdc7 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -215,7 +215,7 @@ static unsigned long vtime_delta(struct cpu_accounting_data *acct,
WARN_ON_ONCE(!irqs_disabled());
- now = mftb();
+ now = mftb() - get_boot_tb();
stime = now - acct->starttime;
acct->starttime = now;
@@ -388,7 +388,7 @@ void vtime_reset(void)
{
struct cpu_accounting_data *acct = get_accounting(current);
- acct->starttime = mftb();
+ acct->starttime = mftb() - get_boot_tb();
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIME
acct->startspurr = read_spurr(acct->starttime);
#endif