Re: [PATCH 3/3] [BUGFIX] x86/x86_64: fix IRQ migration triggered active device IRQ interrruption

From: Yinghai Lu
Date: Sat Apr 11 2009 - 02:46:53 EST


On Wed, Apr 8, 2009 at 4:08 PM, Gary Hade <garyhade@xxxxxxxxxx> wrote:
> On Wed, Apr 08, 2009 at 03:03:49PM -0700, Yinghai Lu wrote:
>> On Wed, Apr 8, 2009 at 2:07 PM, Gary Hade <garyhade@xxxxxxxxxx> wrote:
>> > Impact: Eliminates an issue that can leave the system in an
>> >        unusable state.
>> >
>> > This patch addresses an issue where device generated IRQs
>> > are no longer seen by the kernel following IRQ affinity
>> > migration while the device is generating IRQs at a high rate.
>> > We have seen this problem happen when IRQ affinities are
>> > adjusted in response to CPU offlining but I believe it
>> > could also happen in during user initiated IRQ affinity
>> > changes unrelated to CPU offlining. e.g. while the
>> > irqbalance daemon is adjusting IRQ affinities when the
>> > system is heavily loaded.
>> >
>> > I have been able to consistently reproduce the problem on
>> > some of our systems by running the following script (VICTIM_IRQ
>> > specifies the IRQ for the aic94xx device) while a single instance
>> > of the command
>> >  # while true; do find / -exec file {} \;; done
>> > is keeping the filesystem activity and IRQ rate reasonably high.
>> >
>> > #!/bin/sh
>> >
>> > SYS_CPU_DIR=/sys/devices/system/cpu
>> > VICTIM_IRQ=25
>> > IRQ_MASK=f0
>> >
>> > iteration=0
>> > while true; do
>> >  echo $iteration
>> >  echo $IRQ_MASK > /proc/irq/$VICTIM_IRQ/smp_affinity
>> >  for cpudir in $SYS_CPU_DIR/cpu[1-9] $SYS_CPU_DIR/cpu??; do
>> >    echo 0 > $cpudir/online
>> >  done
>> >  for cpudir in $SYS_CPU_DIR/cpu[1-9] $SYS_CPU_DIR/cpu??; do
>> >    echo 1 > $cpudir/online
>> >  done
>> >  iteration=`expr $iteration + 1`
>> > done
>> >
>> > The root cause is a known issue already addressed for some
>> > code paths [e.g. ack_apic_level() and the now obsolete
>> > migrate_irq_remapped_level_desc()] where the ioapic can
>> > misbehave when the I/O redirection table register is written
>> > while the Remote IRR bit is set.
>> >
>> > The proposed fix uses the same avoidance method and much
>> > of same code that the Interrupt Remapping code previously
>> > used to avoid the same problem.
>> >
>> > Signed-off-by: Gary Hade <garyhade@xxxxxxxxxx>
>> >
>> > ---
>> >  arch/x86/kernel/apic/io_apic.c |   72 ++++++++++++++++++++++++++++++-
>> >  1 file changed, 71 insertions(+), 1 deletion(-)
>> >
>> > Index: linux-2.6.30-rc1/arch/x86/kernel/apic/io_apic.c
>> > ===================================================================
>> > --- linux-2.6.30-rc1.orig/arch/x86/kernel/apic/io_apic.c        2009-04-08 09:24:11.000000000 -0700
>> > +++ linux-2.6.30-rc1/arch/x86/kernel/apic/io_apic.c     2009-04-08 09:24:23.000000000 -0700
>> > @@ -2331,7 +2331,8 @@ set_desc_affinity(struct irq_desc *desc,
>> >  }
>> >
>> >  static void
>> > -set_ioapic_affinity_irq_desc(struct irq_desc *desc, const struct cpumask *mask)
>> > +set_ioapic_irq_affinity_desc(struct irq_desc *desc,
>> > +                            const struct cpumask *mask)
>> >  {
>> >        struct irq_cfg *cfg;
>> >        unsigned long flags;
>> > @@ -2352,6 +2353,75 @@ set_ioapic_affinity_irq_desc(struct irq_
>> >  }
>> >
>> >  static void
>> > +delayed_irq_move(struct work_struct *work)
>> > +{
>> > +       unsigned int irq;
>> > +       struct irq_desc *desc;
>> > +
>> > +       for_each_irq_desc(irq, desc) {
>> > +               if (desc->status & IRQ_MOVE_PENDING) {
>> > +                       unsigned long flags;
>> > +
>> > +                       spin_lock_irqsave(&desc->lock, flags);
>> > +                       if (!desc->chip->set_affinity ||
>> > +                           !(desc->status & IRQ_MOVE_PENDING)) {
>> > +                               desc->status &= ~IRQ_MOVE_PENDING;
>> > +                               spin_unlock_irqrestore(&desc->lock, flags);
>> > +                               continue;
>> > +                       }
>> > +
>> > +                       desc->chip->set_affinity(irq, desc->pending_mask);
>> > +                       spin_unlock_irqrestore(&desc->lock, flags);
>> > +               }
>> > +       }
>> > +}
>> > +
>> > +static DECLARE_DELAYED_WORK(delayed_irq_move_work, delayed_irq_move);
>> > +
>> > +static void
>> > +set_ioapic_irq_affinity_level_desc(struct irq_desc *desc)
>> > +{
>> > +
>> > +       struct irq_cfg *cfg = desc->chip_data;
>> > +
>> > +       mask_IO_APIC_irq_desc(desc);
>> > +
>> > +       if (io_apic_level_ack_pending(cfg)) {
>> > +               /*
>> > +                * Interrupt in progress. Migrating irq now will change
>> > +                * the vector information in the IO-APIC RTE which will
>> > +                * confuse the EOI broadcast performed by cpu.
>> > +                * So, we delay the irq migration.
>> > +                */
>> > +               schedule_delayed_work(&delayed_irq_move_work, 1);
>> > +               goto unmask;
>> > +       }
>> > +
>> > +       /* Interrupt not in progress. we can change the vector
>> > +        * information in the IO-APIC RTE. */
>> > +       set_ioapic_irq_affinity_desc(desc, desc->pending_mask);
>> > +
>> > +       desc->status &= ~IRQ_MOVE_PENDING;
>> > +       cpumask_clear(desc->pending_mask);
>> > +
>> > +unmask:
>> > +       unmask_IO_APIC_irq_desc(desc);
>> > +}
>> > +
>> > +static void
>> > +set_ioapic_affinity_irq_desc(struct irq_desc *desc,
>> > +                            const struct cpumask *mask)
>> > +{
>> > +       if (desc->status & IRQ_LEVEL) {
>> > +               desc->status |= IRQ_MOVE_PENDING;
>> > +               cpumask_copy(desc->pending_mask, mask);
>> > +               set_ioapic_irq_affinity_level_desc(desc);
>> > +               return;
>> > +       }
>> > +       set_ioapic_irq_affinity_desc(desc, mask);
>> > +}
>> > +
>> > +static void
>> >  set_ioapic_affinity_irq(unsigned int irq, const struct cpumask *mask)
>> >  {
>> >        struct irq_desc *desc;
>> > --
>>
>> it seems, ack_apic_level() already checked io_apic_level_ack_pending()
>>
>>                 cfg = desc->chip_data;
>>                 if (!io_apic_level_ack_pending(cfg))
>>                         move_masked_irq(irq);
>>                 unmask_IO_APIC_irq_desc(desc);
>
> Yes, I have actually observed instances where the command
>   `echo $IRQ_MASK > /proc/irq/$VICTIM_IRQ/smp_affinity`
> in the above test script failed to modify the affinity due to
> a debug printk confirmed non-zero return from io_apic_level_ack_pending()
> at this location.  However, this is definitely not catching all the cases.
> This was confirmed this with a printk in __target_IO_APIC_irq() that would
> sometimes shows the Remote IRR bit set before the io_apic_modify() call.
> e.g.
> __target_IO_APIC_irq: XXX before io_apic_modify irq=25 vector=0xd9 reg=0x1e0d9


that is strange, before that irq is masked,

#ifdef CONFIG_GENERIC_PENDING_IRQ
/* If we are moving the irq we need to mask it */
if (unlikely(desc->status & IRQ_MOVE_PENDING)) {
do_unmask_irq = 1;
mask_IO_APIC_irq_desc(desc);
}
#endif

YH

>
>>
>> also calling path without irremap - for level triggered.
>> 1. user use /proc/irq/xx/smp_affinity to set affinity
>> 2. kernel/irq/manager.c::irq_set_affinity() is called, and
>> desc->pending_mask is changed
>> 3a. one irq is arrived: arch/x86/kernel/irq.c::do_IRQ() is called,
>> hand_irq==>desc->handle_irq
>>     ==>kernel/irq/chip.c::handle_fasteoi_irq() ==> call desc->chip->eoi()
>> 3b. desc->chip->eoi() aka
>> arch/x86/kernel/apic/io_apic.c::ack_apic_level() will call
>> mask_IO_APIC_irq_desc() and kernel/irq/migration.c::move_masked_irq()
>> 3c. move_masked_irq will call desc->set_affinity(), vector and target
>> in io_apic controller and IDT are changed
>>     desc->affinity is changed at last. ( for desc moving,
>> move_in_progress and extra_move flag is set too.)
>> 3d then ack_apic_level() will unmask the irq.
>> 3e. finished other call in handle_fasteoi_irq()
>
> Need to think about this.  I'll get back to you.
>
> Thanks!
>
> Gary
>
> --
> Gary Hade
> System x Enablement
> IBM Linux Technology Center
> 503-578-4503  IBM T/L: 775-4503
> garyhade@xxxxxxxxxx
> http://www.ibm.com/linux/ltc
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/