Re: [PATCH] locking/lockdep: skip irq save/restore in hardirq context in lock_release()

From: Waiman Long

Date: Tue Jun 30 2026 - 14:26:49 EST



On 6/30/26 4:13 AM, Deepanshu Kartikey wrote:
On Tue, Jun 30, 2026 at 10:28 AM Waiman Long <longman@xxxxxxxxxx> wrote:

I looked at the generated code of raw_local_irq_restore():

./arch/x86/include/asm/irqflags.h:
146 return !(flags & X86_EFLAGS_IF);
0x00000000000082b9 <+9>: test $0x200,%edi
0x00000000000082bf <+15>: je 0x82c2 <cpuset_test+18>

42 asm volatile("sti": : :"memory");
0x00000000000082c1 <+17>: sti

kernel/cgroup/cpuset.c:
4553 }
0x00000000000082c2 <+18>: jmp 0x82c7

sti should only be called if the saved flags has the IF bit set. In
hardirq context, the IF bit shouldn't be set. Is my interpretation correct?

Regards,
Longman

You are correct - in hardirq context, the IF bit in EFLAGS should
already be 0 (IRQs disabled by CPU on interrupt entry). Therefore
raw_local_irq_restore() should not execute sti.

However, the syzkaller reproducer consistently triggers the RCU stall,
indicating a real issue exists. Our fix is correct regardless of the
root cause - by completely skipping the raw_local_irq_save/restore
dance in hardirq context, we avoid any potential issues in this path.

Hardirq handlers must never manipulate IRQ state mid-execution since
the CPU hardware manages it automatically on entry/exit. This is a
fundamental rule of interrupt handling.

If you have insights into the actual root cause, we'd appreciate
understanding it better.

Thank you for the thorough review.

As I have said previously, the only possible explanation that I can think of is speculative execution. I think the x86 sti instruction is not serializing. Depending on the actual processor, it may be possible for a CPU to speculatively execute an sti instruction and enable interrupt before it realizes that the speculation is incorrect and rewind it. If so, adding a in_hardirq() conditional check may have just further narrow the window that problem doesn't show up anymore. So what are those processors that are showing test failures?

As an experiment, you can try to insert some delay between the IF flag check and the actual sti instruction to see if it can also avoid the test failures, like

diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index 6f25de05ed58..174962fcc37c 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -155,8 +155,10 @@ static __always_inline int arch_irqs_disabled(void)

 static __always_inline void arch_local_irq_restore(unsigned long flags)
 {
-       if (!arch_irqs_disabled_flags(flags))
+       if (!arch_irqs_disabled_flags(flags)) {
+               smp_mb();
                arch_local_irq_enable();
+       }
 }
 #endif /* !__ASSEMBLER__ */

If it is the real root cause, we would have to contact some Intel/AMD engineers with connection to their CPU hardware side to figure out the best way forward. Your current patch is currently not mergeable without a clear RCA.

Cheers,
Longman