Re: [PATCH] locking/lockdep: skip irq save/restore in hardirq context in lock_release()
From: Waiman Long
Date: Tue Jun 30 2026 - 14:26:49 EST
On 6/30/26 4:13 AM, Deepanshu Kartikey wrote:
On Tue, Jun 30, 2026 at 10:28 AM Waiman Long <longman@xxxxxxxxxx> wrote:As I have said previously, the only possible explanation that I can think of is speculative execution. I think the x86 sti instruction is not serializing. Depending on the actual processor, it may be possible for a CPU to speculatively execute an sti instruction and enable interrupt before it realizes that the speculation is incorrect and rewind it. If so, adding a in_hardirq() conditional check may have just further narrow the window that problem doesn't show up anymore. So what are those processors that are showing test failures?
You are correct - in hardirq context, the IF bit in EFLAGS should
I looked at the generated code of raw_local_irq_restore():
./arch/x86/include/asm/irqflags.h:
146 return !(flags & X86_EFLAGS_IF);
0x00000000000082b9 <+9>: test $0x200,%edi
0x00000000000082bf <+15>: je 0x82c2 <cpuset_test+18>
42 asm volatile("sti": : :"memory");
0x00000000000082c1 <+17>: sti
kernel/cgroup/cpuset.c:
4553 }
0x00000000000082c2 <+18>: jmp 0x82c7
sti should only be called if the saved flags has the IF bit set. In
hardirq context, the IF bit shouldn't be set. Is my interpretation correct?
Regards,
Longman
already be 0 (IRQs disabled by CPU on interrupt entry). Therefore
raw_local_irq_restore() should not execute sti.
However, the syzkaller reproducer consistently triggers the RCU stall,
indicating a real issue exists. Our fix is correct regardless of the
root cause - by completely skipping the raw_local_irq_save/restore
dance in hardirq context, we avoid any potential issues in this path.
Hardirq handlers must never manipulate IRQ state mid-execution since
the CPU hardware manages it automatically on entry/exit. This is a
fundamental rule of interrupt handling.
If you have insights into the actual root cause, we'd appreciate
understanding it better.
Thank you for the thorough review.
As an experiment, you can try to insert some delay between the IF flag check and the actual sti instruction to see if it can also avoid the test failures, like
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 6f25de05ed58..174962fcc37c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -155,8 +155,10 @@ static __always_inline int arch_irqs_disabled(void)
static __always_inline void arch_local_irq_restore(unsigned long flags)
{
- if (!arch_irqs_disabled_flags(flags))
+ if (!arch_irqs_disabled_flags(flags)) {
+ smp_mb();
arch_local_irq_enable();
+ }
}
#endif /* !__ASSEMBLER__ */
If it is the real root cause, we would have to contact some Intel/AMD engineers with connection to their CPU hardware side to figure out the best way forward. Your current patch is currently not mergeable without a clear RCA.
Cheers,
Longman