Re: [GIT PULL] arm64 fixes for 5.15-rc5
From: Pingfan Liu
Date: Sun Oct 17 2021 - 04:00:09 EST
Hi Linus,
I am not in the Cc list, so just aware of it the day before yesterday.
On Mon, Oct 11, 2021 at 12:54:49PM -0700, Linus Torvalds wrote:
> On Mon, Oct 11, 2021 at 3:47 AM Mark Rutland <mark.rutland@xxxxxxx> wrote:
> >
> > Sorry; I agree that commit messages don't explain this thoroughly. I can
> > go and rework the commit messages to clarify things, if you'd like?
>
> So I really hate that patch, even with explanation.
>
> Having the Kconfig option for "do the right thing" is not how this
> should be fixed.
>
> Either the generic code is generic, or it isn't.
>
> And if the generic code doesn't work for arm64, we shouldn't add a
> random Kconfig option for "this architecture wants different
> behavior".
>
> > The TL;DR is that a bunch of constraints conspire such that we can't
> > defer accounting things to the irqdomain or irqchip code without making
> > this even more complicated/fragile, and many architectures get this
> > subtly wrong today -- it's not that arm64 is necessarily much different
> > from other architectures using this code, just that we're trying to
> > solve this first.
>
> So then I think the fix is to just say "accounting in this generic
> function is wrong, and the accounting needs to be moved to the
> callers".
>
> That's particularly true if you think arm64 is only the only actually
> _tested_ case that gets this wrong, and other architectures most
> likely have the exact same issue, but you only fixed it for arm64.
>
> So do it unconditionally - possibly even using a coccinelle script if
> there are lots of callers.
>
> Because generic code that just isn't generic, but randomly does
> different things based on subtle Kconfig options that are different
> for different architectures is bad, bad, bad.
>
When composing the patch, I failed to realize it, but now, I learn.
> And yes, I realize that we've had that kind of stuff before (and we
> have odd Kconfig option testing in that irqdesc.c file elsewhere), but
> the Kconfig options we currently have are mostly either
>
> (a) actual real honest-to-goodness config options with semantic
> meaning (ie things like CONFIG_SMP and CONFIG_NUMA)
>
> (b) really ugly workarounds for odd special module exports (that
> CONFIG_KVM_BOOK3S_64_HV_MODULE thing for irq_to_desc() that we *tried*
> but failed to remove).
>
> And so the reason I really hate that patch is that it introduces a new
> "different architectures randomly and inexplicably do different
> things, and the generic behavior is very different on arm64 than it is
> elsewhere".
>
> That's just the worst kind of hack to me.
>
> And in this case, it's really *horribly* hard to see what the call
> chain is. It all ends up being actively obfuscated and obscured
> through that 'handle_arch_irq' function pointer, that is sometimes set
> through set_handle_irq(), and sometimes set directly.
>
> I really think that if the rule is "we can't do accounting in
> handle_domain_irq(), because it's too late for arm64", then the fix
> really should be to just not do that.
>
> Move the irq_enter()/irq_exit() to the callers - quite possibly far up
> the call chain to the root of it all, and just say "architecture code
> needs to do this in the low-level code before calling
> handle_arch_irq".
>
> Or, if it turns out that 99% of callers do want it - and don't have
> the issues arm64 has - maybe we can have a helper wrapper that does
> the irq_enter/irq_exit, and another version that doesn't do it, and at
> least it's clear to the callers which one it is. But it looks like
> particularly with the odd indirection through that handle_arch_irq
> function, it's best to just say "this needs to be done".
>
> What architectures actually care? From what I can follow, it's really
> GENERIC_IRQ_MULTI_HANDLER that ends up doing this all, and then arm64
> has it's own special case for no obvious reason (why isn't it using
> GENERIC_IRQ_MULTI_HANDLER that seems to do the EXACT same thing in
> generic code except for a special "default != NULL" case?)
>
> Anyway, it _looks_ to me like the pattern is very simple:
>
> Step 1:
> - remove irq_enter/irq_exit from handle_domain_irq(), move it to all callers.
>
> This clearly doesn't change anything at all, but also doesn't fix the
> problem you have. But it's easy to verify that the code is the same
> before-and-after.
>
> Step 2 is the pattern matching step:
>
> - if the caller of handle_domain_irq() ends up being a function that
> is registered with set_handle_irq(), then we
> (a) remove the irq_enter/irq_exit from it
> (b) add it to the architectures that use handle_arch_irq.
> (c) make sure that if there are other callers of it (not through
> handle_arch_irq) we move that irq_enter/irq_exit into them too
>
> I _suspect_ - but didn't check - that Step 2(c) doesn't actually
> exist. But who knows.
>
> It really looks like there is a very tight connection between "uses
> handle_domain_irq()" and "uses handle_arch_irq/set_handle_irq()". No?
>
> Is this a bit more work? Yes.
>
> Would this fix arm64? Yes - it ends up effectively doing what you did.
>
> Would this fix _other_ architectures doing the same thing that you
> suspect are broken? YES. Instead of leaving them randomly broken.
>
> And most importantly, it would make the rules for "generic" code
> clear, and actually generic.
>
> Now, it may be that I'm wrong. I'm willing to be convinced, and if you
> beat me over the head enough I guess I can take that pull request and
> just be unhappy about it.
>
I had thought if any arch adapts GENERIC_ENTRY, then handle_domain_irq()
can be a bug. As GENERIC_ENTRY is a not _random_ config option, so try
to re-anchor the config depending on GENERIC_ENTRY.
But finally I gave up, since there is no direct link between them at a
glance. And what is more, as you said, "the rules for "generic" code
clear, and actually generic", so it is better to go in that way.
I think Mark has started the work or I will be happy to re-work on these
patches.
Thanks,
Pingfan