Re: [PATCH] interrupt: discover and disable very frequent interrupts
From: Zhang Xincheng
Date: Mon Oct 17 2022 - 21:07:30 EST
> >
> > +config FREQUENT_IRQ_DEBUG
> > + bool "Support for finding and reporting frequent interrupt"
> > + default n
> > + help
> > +
>
> Pointless newline.
>
Thank you very much for your advice, I get it.
> > + This is a mechanism to detect and report that interrupts
> > + are triggered too frequently.
> > +
> > +config COUNT_PER_SECOND
> > + int "Interrupt limit per second"
> > + depends on FREQUENT_IRQ_DEBUG
> > + default "2000"
> > + help
>
> 2000 interrupts per second is just a hillarious low default. Trivial to
> reach with networking. Aside of that on systems where the per CPU timer
> interrupt goes through this code, that's trivial to exceed with
> something simple like a periodic timer with a 250us interval.
>
This limit is really too low on some high-performance PCs, and I neglected.
> > +#ifdef CONFIG_FREQUENT_IRQ_DEBUG
> > +#define COUNT_PER_SECOND_MASK 0x0000ffff
> > +#define DURATION_LIMIT_MASK 0xffff0000
> > +#define DURATION_LIMIT_COUNT 0x00010000
> > +#define DURATION_LIMIT_OFFSET 16
> > +static unsigned int count_per_second = CONFIG_COUNT_PER_SECOND;
> > +static unsigned int duration_limit = CONFIG_DURATION_LIMIT;
> > +static bool disable_frequent_irq;
> > +#endif /* CONFIG_FREQUENT_IRQ_DEBUG */
> > +
> > /*
> > * We wait here for a poller to finish.
> > *
> > @@ -189,18 +199,16 @@ static inline int bad_action_ret(irqreturn_t action_ret)
> > * (The other 100-of-100,000 interrupts may have been a correctly
> > * functioning device sharing an IRQ with the failing one)
> > */
> > -static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret)
> > +static void __report_bad_irq(struct irq_desc *desc, irqreturn_t action_ret, const char *msg)
> > {
> > unsigned int irq = irq_desc_get_irq(desc);
> > struct irqaction *action;
> > unsigned long flags;
> >
> > if (bad_action_ret(action_ret)) {
> > - printk(KERN_ERR "irq event %d: bogus return value %x\n",
> > - irq, action_ret);
> > + printk(msg, irq, action_ret);
>
> This wants to be pr_err() and that change needs to be split out into a
> seperate patch if at all.
>
This is a good suggestion, I get it.
> > +#ifdef CONFIG_FREQUENT_IRQ_DEBUG
> > +/*
> > + * Some bad hardware will trigger interrupts very frequently, which will
> > + * cause the CPU to process hardware interrupts all the time. We found
> > + * and reported it, and disabling it is optional.
> > + */
> > +void report_frequent_irq(struct irq_desc *desc, irqreturn_t action_ret)
>
> static, no?
>
Yes, indeed it should be static.
> > +{
> > + if (desc->have_reported)
> > + return;
> > +
> > + if ((desc->gap_count & DURATION_LIMIT_MASK) == 0)
>
> What's the point of this mask dance here? Use seperate variables. This
> is unreadable and overoptimized for no value.
>
This mask is probably not really needed.
> Also why is a simple count per second not sufficient? Why do you need
> the extra duration limit?
>
The extra duration limit is is increased to ignore transient conditions.
> > + desc->gap_time = get_jiffies_64();
>
> Why does this need 64bit jiffies? 32bit are plenty enough.
>
Yes,32bit are plenty enough.
> > +
> > + desc->gap_count++;
> > +
> > + if ((desc->gap_count & COUNT_PER_SECOND_MASK) >= count_per_second) {
> > + if ((get_jiffies_64() - desc->gap_time) < HZ) {
> > + desc->gap_count += DURATION_LIMIT_COUNT;
> > + desc->gap_count &= DURATION_LIMIT_MASK;
> > + } else {
> > + desc->gap_count = 0;
> > + }
> > +
> > + if ((desc->gap_count >> DURATION_LIMIT_OFFSET) >= duration_limit) {
> > + __report_bad_irq(desc, action_ret, KERN_ERR "irq %d: triggered too "
> > + "frequently\n");
> > + desc->have_reported = true;
> > + if (disable_frequent_irq)
> > + irq_disable(desc);
>
> How is this rate limiting? This is simply disabling the interrupt.
>
This does not limit the frequency of an interrupt, it just reports an
interrupt that is triggered very frequently and may not be normal.
As for how the administrator will deal with this in the future, it is
up to the administrator.
>
> So again if your limit is too narrow you might simply disable the wrong
> interrupt and render the machine useless.
>
> So if enabled in Kconfig it must be default off and you need a command
> line parameter to turn it on, but TBH I'm less than convinced that this
> is actually useful for general purpose debugging in it's current form
> simply because it is hard to get the limit right.
>
At present, this mechanism needs to artificially give a suitable limit,
which is really not very easy to use. But it's really hard to automatically
get a suitable limit for different machines, and I need to reconsider.
Thanks,
Zhang Xincheng