Re: [PATCH v6 1/1] MIPS: Fix idle VS timer enqueue

From: Maciej W. Rozycki
Date: Fri Mar 28 2025 - 06:43:08 EST

Next message: Maud Spierings | GOcontroll: "Re: [PATCH v3 5/8] arm64: dts: freescale: add Ka-Ro Electronics tx8p-ml81 COMy"
Previous message: Jagadeesh Kona: "Re: [PATCH v3 05/18] clk: qcom: common: Handle runtime power management in qcom_cc_really_probe"
In reply to: Maciej W. Rozycki: "Re: [PATCH v6 1/1] MIPS: Fix idle VS timer enqueue"
Next in thread: Maciej W. Rozycki: "Re: [PATCH v6 1/1] MIPS: Fix idle VS timer enqueue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi Marco,

> > Whatever manual you quote it refers to MIPS Release 2, which is only
> > dated 2003
>
> About the MIPS manual, anyhow, it is "MIPS32 M4 Processor Core" (year 2008).
> Maybe I've also picked the wrong manual.

It has to be M4K there, and that isn't the most fortunate choice, because
it's a manual for the specific microarchitecture and then one that doesn't
usually run Linux, because it has no TLB. For MIPS Release 1 onwards the
architecture manuals are a better choice, but still they don't necessarily
get the details right for older legacy MIPS ISA revisions.

For your reference the architecture manuals are:

- "MIPS32 Architecture For Programmers, Volume I: Introduction to the
MIPS32 Architecture",

- "MIPS32 Architecture For Programmers, Volume II: The MIPS32 Instruction
Set",

- "MIPS32 Architecture For Programmers, Volume III: The MIPS32 Privileged
Resource Architecture",

and their MIPS64 counterparts, released repeatedly as the architecture
evolved with the document's major revision number matching the respective
architecture release. I do believe the most recent revisions continue
being available from the MIPS web site. But as the titles imply these
manuals document the architecture and not the intricacies of the MIPS
assembly language dialect or the various aspects of programming for the
architecture. For that you need another book.

> > Best is to avoid using a `.set noreorder' region in the first place.
> > But is it really needed here? Does the rollback area have to be of a
> > hardcoded size rather than one calculated by the assembler based on
> > actual machine code produced? It seems to me having it calculated would
> > reduce complexity here and let us use the EI instruction where available
> > as well.
>
> Well, considering the complexity and how the code looks fragile even with
> a small change, yes, that's likely better to avoid noreorder.

FAOD this is my general observation, regardless of its applicability
here.

> I think I'm going to need some guidance here.
> Please, correct me where something is wrong.
>
> 1)
> When you say "let us use the EI instruction where available" are you
> referring to do
> something like below?
>
> #if defined(CONFIG_CPU_HAS_DIEI)
> ei
> #else
> MFC0 t0, CP0_STATUS
> ori t0, 0x1f
> xori t0, 0x1e
> MTC0 t0, CP0_STATUS
> #endif

I think let's just simplify this stuff a lot and make use of the existing
code infrastructure. Since we move interrupt enabling into `__r4k_wait',
we can just get rid of the C wrapper the existence of which makes no sense
anymore along with the extra nesting of function calls, and have something
along the lines of:

.section .cpuidle.text,"ax"
/* Align to 32 bytes for the maximum idle interrupt region size. */
.align 5
LEAF(r4k_wait)
/* Keep the ISA bit clear for calculations on local labels here. */
0: .fill 0
/* Start of idle interrupt region. */
local_irq_enable
/*
* If an interrupt lands here, before going idle on the next
* instruction, we must *NOT* go idle since the interrupt could
* have set TIF_NEED_RESCHED or caused a timer to need resched.
* Fall through -- see rollback_handler below -- and have the
* idle loop take care of things.
*/
1: .fill 0
/* The R2 EI/EHB sequence takes 8 bytes, otherwise pad up. */
.if 1b - 0b > 32
.error "overlong idle interrupt region"
.elseif 1b - 0b > 8
.align 4
.endif
2: .fill 0
.equ r4k_wait_idle_size, 2b - 0b
/* End of idle interrupt region; size has to be a power of 2. */
.set MIPS_ISA_ARCH_LEVEL_RAW
r4k_wait_insn:
wait
.set mips0
local_irq_disable
jr ra
END(r4k_wait)
.previous

.macro BUILD_ROLLBACK_PROLOGUE handler
FEXPORT(rollback_\handler)
.set push
.set noat
MFC0 k0, CP0_EPC
/* Subtract 2 to let the ISA bit propagate through the mask. */
PTR_LA k1, r4k_wait_insn - 2
ori k0, r4k_wait_idle_size - 2
bne k0, k1, \handler
MTC0 k0, CP0_EPC
.set pop
.endm

There's some complication here coming from the need to factor in the ISA
bit in the microMIPS mode; something that hasn't been discussed so far.
The `.fill 0' approach is a hack and it has struck me that we need to add
a `.noinsn' pseudo-op to GAS for this purpose, complementing `.insn', but
we need to stick with the hack for now anyway as it will take years until
we can rely on a new feature in the assembler.

NB don't refer to a local label from a macro as such a reference may end
up pointing not where you want it to from the place the macro is pasted
at. For example with your v6 code BUILD_ROLLBACK_PROLOGUE refers to the
intended label in `__r4k_wait' when pasted for `handle_int', but then to
the label in `except_vec4' instead when pasted for `except_vec_vi' further
down.

> 2)
> Removing "noreorder" would let the compiler add "nops" where they are needed.

The assembler, not the compiler.

> But that still means the 3 ssnop and ehb are still needed, right?

Yes, in the pessimistic case, which the code above avoids where not
needed. E.g. for R2 we now have:

807f1700 <r4k_wait>:
807f1700: 41606020 ei
807f1704: 000000c0 ehb

807f1708 <r4k_wait_insn>:
807f1708: 42000020 wait
807f170c: 41606000 di
807f1710: 000000c0 ehb
807f1714: 03e00008 jr ra
807f1718: 00000000 nop
...

-- nice and sweet, and for the R10k likewise:

a80000000084e540 <r4k_wait>:
a80000000084e540: 400c6000 mfc0 t0,$12
a80000000084e544: 358c0001 ori t0,t0,0x1
a80000000084e548: 408c6000 mtc0 t0,$12
a80000000084e54c: 00000000 nop

a80000000084e550 <r4k_wait_insn>:
a80000000084e550: 42000020 wait
a80000000084e554: 400c6000 mfc0 t0,$12
a80000000084e558: 358c0001 ori t0,t0,0x1
a80000000084e55c: 398c0001 xori t0,t0,0x1
a80000000084e560: 408c6000 mtc0 t0,$12
a80000000084e564: 03e00008 jr ra
a80000000084e568: 00000000 nop
...

-- because it's fully interlocked, so no extra NOPs other than to pad the
idle interrupt region to a power-of-two size.

> My subsequent dumb question is: there is the guarantee that the
> compiler will not
> reorder / change something we did?

Not in this case; the assembler isn't that smart (which is why compiled
code is usually a better choice where feasible) and except for one case
all it can do is adding extra NOPs between instructions to avoid pipeline
hazards (and then ones coming from data dependencies in register use
only), including ones between non-adjacent instruction pairs.

The only exception are jumps/branches where the assembler will schedule
their delay slot where possible by swapping the jump/branch with the
immediately preceding instruction where no data dependency exists between
the two instructions. Otherwise the delay slot will be filled with a NOP.

For example with the code sequences above the last instruction produced
by `local_irq_disable' could be scheduled into the delay slot of `jr ra'.
It wouldn't be an issue here however and it doesn't actually happen with
GAS, likely because `local_irq_disable' is a user macro. For built-in
instruction macros such as `la' the swapping of the final instruction does
happen.

> This question also came after reading the manual you quoted (paragraph
> "Coprocessor Hazards"):
>
> "For example, after an mtc0 to the Status register which changes an
> interrupt mask bit,
> there will be two further instructions before the interrupt is
> actually enabled or disabled.
> [...]
> To cope with these situations usually requires the programmer to take explicit
> action to prevent the assembler from scheduling inappropriate
> instructions after a
> dangerous mtc0. This is done by using the .set noreorder directive,
> discussed below"

Technically it is correct and likely the original MIPSCO assembler from
1985 or one supplied with IRIX were smarter, but GAS won't itself ever
reorder instructions other than to fill branch delay slots, so we don't
have to be worried. Here's the relevant comment from GAS sources:

/* There are a lot of optimizations we could do that we don't.
In particular, we do not, in general, reorder instructions.
If you use gcc with optimization, it will reorder
instructions and generally do much more optimization then we
do here; repeating all that work in the assembler would only
benefit hand written assembly code, and does not seem worth
it. */

And in cases like this where we just don't want a bunch of instructions to
be moved across a certain point we don't actually have to schedule any
code by hand as a lone:

.set noreorder
.set reorder

barrier will do (or actually just a label should as well).

> 3)
> Considering the size is determined by the compiler, the check about
> the idle interrupt
> size region should not be needed, correct?

I gave it some thought and concluded that the interrupt handling path has
to be optimised for performance and the idle routine does not. Therefore
I think we need to stick to the power-of-two size for the idle interrupt
region, because in that case the check for an interrupt arriving within
`r4k_wait' but ahead of the WAIT instruction can be done with a pair of
ALU operations and a single branch. Anything more complex would require
more operations in the interrupt handling path.

> 4)
> ori and PTR_ADDIU should be removed of course from the rollback handler macro.

I can't imagine how we'd advance past WAIT without these instructions,
what do you have in mind?

> Can I have some hints about the needed change?
> Using QEMU is always working, so I'm not sure if what I will change is
> also correct.

Any verification with real hardware can only be based on the probability
of an interrupt arriving at the right time and the right conditions for a
pending timer. I'd expect such an event to be relatively rare, so for the
large part I think we need to rely on correct code generation rather than
run-time verification. Making use of the existing infrastructure rather
than having an ad-hoc handcoded variant improves robustness here and also
means no change will likely be needed should any further platform variant
of the various hazard resolution macros be added in the future.

NB how do you actually verify this stuff with QEMU? Is it by injecting
an interrupt by hand at a chosen code location via GDB attached to QEMU's
built-in debug stub?

> Many thanks in advance, also for your time!

You're welcome.

Below I've included a complete change based on the outline above. It
seems to do the right thing for a couple of my configurations, but I've
only eyeballed the resulting code and haven't tried running it. Most of
my hardware doesn't implement the WAIT instruction anyway.

The missing `.cpuidle.text' section assignment is a separate preexisting
problem and the fix might be worth splitting off into a preparatory patch,
for backporting or documentation purposes.

You can add my:

Signed-off-by: Maciej W. Rozycki <macro@xxxxxxxxxxx>

on top of yours when using this code, since it's still ultimately yours,
just massaged a little. FWIW I went ahead and chose to cook this piece up
myself since I realised how complex this issue actually is and that it
would take us forever to get all the individual aspects nailed over e-mail
communication.

Let me know if you find anything here unclear or have any questions or
comments.

Maciej

---
arch/mips/include/asm/idle.h | 3 --
arch/mips/kernel/genex.S | 59 ++++++++++++++++++++++++-------------------
arch/mips/kernel/idle.c | 7 -----
3 files changed, 34 insertions(+), 35 deletions(-)

linux-mips-idle-vs-timer.diff
Index: linux-macro/arch/mips/include/asm/idle.h
===================================================================
--- linux-macro.orig/arch/mips/include/asm/idle.h
+++ linux-macro/arch/mips/include/asm/idle.h
@@ -6,8 +6,7 @@
#include <linux/linkage.h>

extern void (*cpu_wait)(void);
-extern void r4k_wait(void);
-extern asmlinkage void __r4k_wait(void);
+extern asmlinkage void r4k_wait(void);
extern void r4k_wait_irqoff(void);

static inline int using_rollback_handler(void)
Index: linux-macro/arch/mips/kernel/genex.S
===================================================================
--- linux-macro.orig/arch/mips/kernel/genex.S
+++ linux-macro/arch/mips/kernel/genex.S
@@ -104,41 +104,48 @@ NESTED(except_vec3_r4000, 0, sp)

__FINIT

- .align 5 /* 32 byte rollback region */
-LEAF(__r4k_wait)
- .set push
- .set noreorder
- /* start of rollback region */
- LONG_L t0, TI_FLAGS($28)
- nop
- andi t0, _TIF_NEED_RESCHED
- bnez t0, 1f
- nop
- nop
- nop
-#ifdef CONFIG_CPU_MICROMIPS
- nop
- nop
- nop
- nop
-#endif
+ .section .cpuidle.text,"ax"
+ /* Align to 32 bytes for the maximum idle interrupt region size. */
+ .align 5
+LEAF(r4k_wait)
+ /* Keep the ISA bit clear for calculations on local labels here. */
+0: .fill 0
+ /* Start of idle interrupt region. */
+ local_irq_enable
+ /*
+ * If an interrupt lands here, before going idle on the next
+ * instruction, we must *NOT* go idle since the interrupt could
+ * have set TIF_NEED_RESCHED or caused a timer to need resched.
+ * Fall through -- see rollback_handler below -- and have the
+ * idle loop take care of things.
+ */
+1: .fill 0
+ /* The R2 EI/EHB sequence takes 8 bytes, otherwise pad up. */
+ .if 1b - 0b > 32
+ .error "overlong idle interrupt region"
+ .elseif 1b - 0b > 8
+ .align 4
+ .endif
+2: .fill 0
+ .equ r4k_wait_idle_size, 2b - 0b
+ /* End of idle interrupt region; size has to be a power of 2. */
.set MIPS_ISA_ARCH_LEVEL_RAW
+r4k_wait_insn:
wait
- /* end of rollback region (the region size must be power of two) */
-1:
+ .set mips0
+ local_irq_disable
jr ra
- nop
- .set pop
- END(__r4k_wait)
+ END(r4k_wait)
+ .previous

.macro BUILD_ROLLBACK_PROLOGUE handler
FEXPORT(rollback_\handler)
.set push
.set noat
MFC0 k0, CP0_EPC
- PTR_LA k1, __r4k_wait
- ori k0, 0x1f /* 32 byte rollback region */
- xori k0, 0x1f
+ /* Subtract 2 to let the ISA bit propagate through the mask. */
+ PTR_LA k1, r4k_wait_insn - 2
+ ori k0, r4k_wait_idle_size - 2
bne k0, k1, \handler
MTC0 k0, CP0_EPC
.set pop
Index: linux-macro/arch/mips/kernel/idle.c
===================================================================
--- linux-macro.orig/arch/mips/kernel/idle.c
+++ linux-macro/arch/mips/kernel/idle.c
@@ -35,13 +35,6 @@ static void __cpuidle r3081_wait(void)
write_c0_conf(cfg | R30XX_CONF_HALT);
}

-void __cpuidle r4k_wait(void)
-{
- raw_local_irq_enable();
- __r4k_wait();
- raw_local_irq_disable();
-}
-
/*
* This variant is preferable as it allows testing need_resched and going to
* sleep depending on the outcome atomically. Unfortunately the "It is

Next message: Maud Spierings | GOcontroll: "Re: [PATCH v3 5/8] arm64: dts: freescale: add Ka-Ro Electronics tx8p-ml81 COMy"
Previous message: Jagadeesh Kona: "Re: [PATCH v3 05/18] clk: qcom: common: Handle runtime power management in qcom_cc_really_probe"
In reply to: Maciej W. Rozycki: "Re: [PATCH v6 1/1] MIPS: Fix idle VS timer enqueue"
Next in thread: Maciej W. Rozycki: "Re: [PATCH v6 1/1] MIPS: Fix idle VS timer enqueue"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]