Re: [PATCH 0/2] arm64: Allow the rescheduling IPI to bypass irq_enter/exit
From: Thomas Gleixner
Date: Fri Nov 20 2020 - 09:17:36 EST
Marc,
On Fri, Nov 20 2020 at 09:20, Marc Zyngier wrote:
> On 2020-11-03 20:32, Thomas Gleixner wrote:
>> On Sun, Nov 01 2020 at 13:14, Marc Zyngier wrote:
>
> More seriously, it seems to me that we have a bit of a
> cross-architecture disconnect here. I have been trying to join the
> dots between what you describe above, and the behaviour of arm64 (and
> probably a large number of the non-x86 architectures), and I feel
> massively confused.
>
> Up to 5.9, our handling of the rescheduling IPI was "do as little as
> possible": decode the interrupt from the lowest possible level (the
> root irqchip), call into arch code, end-up in scheduler_ipi(), the end.
>
> No lockdep, no RCU, no nothing.
>
> What changed? Have we missed some radical change in the way the core
> kernel expects the arch code to do thing? I'm aware of the
> kernel/entry/common.c stuff, which implements most of the goodies you
> mention, but this effectively is x86-only at the moment.
>
> If arm64 has forever been broken, I'd really like to know and fix it.
So with the pre 5.8 scheduler IPI we had scheduler_ipi() doing wonderful
things
scheduler_ipi()
...
if (llist_empty(&this_rq()->wake_list) && !got_nohz_idle_kick())
return;
irq_enter();
...
irq_exit();
When Peter and I started to investigate the correctness and robustness
of syscalls, interrupts and exceptions versus RCU, instrumentation,
breakpoints for live patching etc., we stumbled over this and a lot of
other things.
Especially instrumentation had grown warts over time to deal with the
problem that it can hit into a RCU idle section. That started to get
worse which made us look deeper and the decision was to have strict
non-instrumentable sections which call out into instrumentable sections
_after_ establishing state. That moved all of the context tracking, RCU,
tracing muck out into C code which is what we have in kernel/entry/ now.
Back to the scheduler IPI. 5.8 made the scheduler IPI an almost empty
function which just folds need resched on architectures which need that.
Of course when I wrote the last reply I had the current x86
implementation in mind which does:
DEFINE_IDTENTRY_SYSVEC_SIMPLE(sysvec_reschedule_ipi)
{
ack_APIC_irq();
trace_reschedule_entry(RESCHEDULE_VECTOR);
inc_irq_stat(irq_resched_count);
scheduler_ipi();
trace_reschedule_exit(RESCHEDULE_VECTOR);
}
ack_APIC_irq() can be traced and the tracepoints of course want a proper
established context too.
Now you surely could do:
ASM-IPI-enter:
asm_ack_irq();
asm_fold_need_resched();
reti();
But that's not enough, because the point of the scheduler_ipi is
unsurprisingly to reevaluate need_resched() and to schedule. So you get:
ASM-IPI-enter:
asm_ack_irq();
asm_fold_need_resched()
if (!asm_need_resched())
reti();
/* context tracking, rcu, .... */
handle_enter_from_user_or_kernel();
preempt_schedule_irq();
/* context tracking, rcu, .... */
handle_exit_to_user_or_kernel();
reti();
We did not do that because we have the tracepoints there and I couldn't
find a reason why having yet another slightly different codepath to get
straight vs. context tracking and RCU was desirable. So we ended up
with:
ASM-IPI-enter:
asm_enter();
irqentry_enter() {
handle_enter_from_user_or_kernel();
/* Start of safe and instrumentable section */
}
__irq_enter_raw();
sysvec_scheduler_ipi();
__irq_exit_raw();
irqentry_exit() {
if (need_resched())
preempt_schedule_irq();
/* End of safe and instrumentable section */
handle_exit_to_user_or_kernel();
}
asm_exit();
The only difference to other IPIs is that it avoids the accounting
overhead because accounting the almost empty scheduler_ipi() function is
pretty much pointless.
Hope that clarifies it.
> If arm64 has forever been broken, I'd really like to know and fix it.
Define broken. It kinda works with all the warts and bolts in tracing,
context tracking and other places. Is it entirely correct? Probably not.
The scheduler IPI is trivial compared to the more interesting ones like
NMI, MCE, breakpoint, debug exceptions etc. We found quite some
interesting issues with all of that muck during our 'noinstr' chase.
Thanks,
tglx