Question related to patch: x86/irq: Remove x86_io_apic_ops.set_affinity and related interfaces

From: Mika Westerberg
Date: Tue Sep 08 2015 - 06:41:00 EST


Hi Jiang and Thomas,

With recent kernels I'm getting crash when a GPIO interrupt triggers. For
unknown reasons I'm not able to capture the crash on the serial console so I
did following change to irq_move_masked_irq():

assert_raw_spin_locked(&desc->lock);

to

WARN_ON(!raw_spin_is_locked(&desc->lock));

The backtrace looks like:

[ 6.733640] WARNING: CPU: 0 PID: 5 at kernel/irq/migration.c:32 irq_move_masked_irq+0xb8/0xc0()
[ 6.768765] Modules linked in: x86_pkg_temp_thermal i2c_hid(+)
[ 6.775425] CPU: 0 PID: 5 Comm: kworker/0:0H Not tainted 4.2.0+ #124
[ 6.782639] ffffffff81747e59 ffff8801c3c03e18 ffffffff815d6165 0000000000000000
[ 6.791110] ffff8801c3c03e50 ffffffff81050c9d ffff8801bf00a600 ffff8801beb594c0
[ 6.799540] ffff8801beb594c0 0000000000000002 0000000042000000 ffff8801c3c03e60
[ 6.807989] Call Trace:
[ 6.810755] <IRQ> [<ffffffff815d6165>] dump_stack+0x4e/0x79
[ 6.817307] [<ffffffff81050c9d>] warn_slowpath_common+0x7d/0xb0
[ 6.824121] [<ffffffff81050d85>] warn_slowpath_null+0x15/0x20
[ 6.830736] [<ffffffff810a0a88>] irq_move_masked_irq+0xb8/0xc0
[ 6.837450] [<ffffffff8103c161>] ioapic_ack_level+0x111/0x130
[ 6.844079] [<ffffffff812bbfe8>] intel_gpio_irq_handler+0x148/0x1c0
[ 6.851306] [<ffffffff81006bfb>] handle_irq+0xab/0x180
[ 6.857235] [<ffffffff812a8a87>] ? debug_smp_processor_id+0x17/0x20
[ 6.864440] [<ffffffff810063c2>] do_IRQ+0x52/0xe0
[ 6.869877] [<ffffffff815dcfbf>] common_interrupt+0x7f/0x7f
[ 6.876292] <EOI> [<ffffffff815dbf6d>] ? _raw_write_unlock_irq+0xd/0x30
[ 6.884008] [<ffffffff815dbf99>] _raw_spin_unlock_irq+0x9/0x10
[ 6.890721] [<ffffffff810745bc>] finish_task_switch+0x7c/0x1a0
[ 6.897438] [<ffffffff815d7b9b>] __schedule+0x33b/0xb20
[ 6.903461] [<ffffffff8107cc77>] ? __enqueue_entity+0x67/0x70
[ 6.910079] [<ffffffff815d8408>] schedule+0x38/0x90
[ 6.915711] [<ffffffff815db3a8>] schedule_timeout+0x158/0x250
[ 6.922328] [<ffffffff812a8a87>] ? debug_smp_processor_id+0x17/0x20
[ 6.929534] [<ffffffff815d933f>] wait_for_completion_killable+0xaf/0x220
[ 6.937231] [<ffffffff81076d80>] ? wake_up_q+0x60/0x60
[ 6.943158] [<ffffffff810682d0>] ? process_one_work+0x4a0/0x4a0
[ 6.949970] [<ffffffff8106d5c2>] kthread_create_on_node+0xd2/0x170
[ 6.957079] [<ffffffff8129ac39>] ? snprintf+0x39/0x40
[ 6.962907] [<ffffffff810665b9>] create_worker+0xb9/0x190
[ 6.969130] [<ffffffff810685d3>] worker_thread+0x303/0x4b0
[ 6.975452] [<ffffffff810682d0>] ? process_one_work+0x4a0/0x4a0
[ 6.982262] [<ffffffff8106d8bf>] kthread+0xcf/0xf0
[ 6.987797] [<ffffffff815dbf99>] ? _raw_spin_unlock_irq+0x9/0x10
[ 6.994722] [<ffffffff8106d7f0>] ? kthread_worker_fn+0x190/0x190
[ 7.001636] [<ffffffff815dc86f>] ret_from_fork+0x3f/0x70
[ 7.007765] [<ffffffff8106d7f0>] ? kthread_worker_fn+0x190/0x190

The driver in question is drivers/pinctrl/intel/pinctrl-intel.c and the
corresponding code which triggers this is below:

static void intel_gpio_irq_handler(unsigned irq, struct irq_desc *desc)
{
struct gpio_chip *gc = irq_desc_get_handler_data(desc);
struct intel_pinctrl *pctrl = gpiochip_to_pinctrl(gc);
struct irq_chip *chip = irq_desc_get_chip(desc);
int i;

chained_irq_enter(chip, desc);

/* Need to check all communities for pending interrupts */
for (i = 0; i < pctrl->ncommunities; i++)
intel_gpio_community_irq_handler(gc, &pctrl->communities[i]);

chained_irq_exit(chip, desc);
}

So once chained_irq_exit() is called, it in turn calls chip->irq_eoi()
which in this case is ioapic_ack_level().

I'm sure such crash did not happen when pinctrl-intel.c was developed so I
started to investigate what might have changed. Note that it may be the
driver does something wrong and the crash is expected. However, many other
drivers do pretty much the same (with the exception that the parent IRQ
chip is not IO-APIC in most cases).

To me assert_raw_spin_locked(&desc->lock) looks valid as the function goes
and modifies desc right after the check. Also in intel_gpio_irq_handler()
desc->lock is not locked (not sure if it needs to be).

After "manual" bisection I found the commit that causes the crash to be
triggered:

commit aa5cb97f14a2dd5aefabed6538c35ebc087d7c24
Author: Jiang Liu <jiang.liu@xxxxxxxxxxxxxxx>
Date: Tue Apr 14 10:29:42 2015 +0800

x86/irq: Remove x86_io_apic_ops.set_affinity and related interfaces

Now there is no user of x86_io_apic_ops.set_affinity anymore, so remove
it.

It says that there are no users for x86_io_apic_ops.set_affinity but then
it does this:

- x86_io_apic_ops.set_affinity(idata, mask, false);
+ irq_set_affinity(irq, mask);

The difference is that x86_io_apic_ops.set_affinity() programs affinity
directly to the hardware (if I understand it right) but irq_set_affinity()
calls irqd_set_move_pending() which defers programming the hardware later.

Now when an interrupt triggers we end up calling irq_move_masked_irq() with
unlocked descriptor.

Without the above change we never do that and the crash does not happen.

Since I don't know much about genirq and IO-APIC code in particular, I
would like to get your input on how to solve the problem. Do I need to
change the pinctrl driver somehow to lock the descriptor or maybe the
commit above misses something?

It is really easy to trigger so please let me know if further debugging is
needed.

Thanks in advance.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/