Re: Question related to patch: x86/irq: Remove x86_io_apic_ops.set_affinity and related interfaces

From: Jiang Liu
Date: Tue Sep 08 2015 - 08:44:53 EST


On 2015/9/8 18:40, Mika Westerberg wrote:
> Hi Jiang and Thomas,
>
> With recent kernels I'm getting crash when a GPIO interrupt triggers. For
> unknown reasons I'm not able to capture the crash on the serial console so I
> did following change to irq_move_masked_irq():
>
> assert_raw_spin_locked(&desc->lock);
>
> to
>
> WARN_ON(!raw_spin_is_locked(&desc->lock));
>
> The backtrace looks like:
>
> [ 6.733640] WARNING: CPU: 0 PID: 5 at kernel/irq/migration.c:32 irq_move_masked_irq+0xb8/0xc0()
> [ 6.768765] Modules linked in: x86_pkg_temp_thermal i2c_hid(+)
> [ 6.775425] CPU: 0 PID: 5 Comm: kworker/0:0H Not tainted 4.2.0+ #124
> [ 6.782639] ffffffff81747e59 ffff8801c3c03e18 ffffffff815d6165 0000000000000000
> [ 6.791110] ffff8801c3c03e50 ffffffff81050c9d ffff8801bf00a600 ffff8801beb594c0
> [ 6.799540] ffff8801beb594c0 0000000000000002 0000000042000000 ffff8801c3c03e60
> [ 6.807989] Call Trace:
> [ 6.810755] <IRQ> [<ffffffff815d6165>] dump_stack+0x4e/0x79
> [ 6.817307] [<ffffffff81050c9d>] warn_slowpath_common+0x7d/0xb0
> [ 6.824121] [<ffffffff81050d85>] warn_slowpath_null+0x15/0x20
> [ 6.830736] [<ffffffff810a0a88>] irq_move_masked_irq+0xb8/0xc0
> [ 6.837450] [<ffffffff8103c161>] ioapic_ack_level+0x111/0x130
> [ 6.844079] [<ffffffff812bbfe8>] intel_gpio_irq_handler+0x148/0x1c0
> [ 6.851306] [<ffffffff81006bfb>] handle_irq+0xab/0x180
> [ 6.857235] [<ffffffff812a8a87>] ? debug_smp_processor_id+0x17/0x20
> [ 6.864440] [<ffffffff810063c2>] do_IRQ+0x52/0xe0
> [ 6.869877] [<ffffffff815dcfbf>] common_interrupt+0x7f/0x7f
> [ 6.876292] <EOI> [<ffffffff815dbf6d>] ? _raw_write_unlock_irq+0xd/0x30
> [ 6.884008] [<ffffffff815dbf99>] _raw_spin_unlock_irq+0x9/0x10
> [ 6.890721] [<ffffffff810745bc>] finish_task_switch+0x7c/0x1a0
> [ 6.897438] [<ffffffff815d7b9b>] __schedule+0x33b/0xb20
> [ 6.903461] [<ffffffff8107cc77>] ? __enqueue_entity+0x67/0x70
> [ 6.910079] [<ffffffff815d8408>] schedule+0x38/0x90
> [ 6.915711] [<ffffffff815db3a8>] schedule_timeout+0x158/0x250
> [ 6.922328] [<ffffffff812a8a87>] ? debug_smp_processor_id+0x17/0x20
> [ 6.929534] [<ffffffff815d933f>] wait_for_completion_killable+0xaf/0x220
> [ 6.937231] [<ffffffff81076d80>] ? wake_up_q+0x60/0x60
> [ 6.943158] [<ffffffff810682d0>] ? process_one_work+0x4a0/0x4a0
> [ 6.949970] [<ffffffff8106d5c2>] kthread_create_on_node+0xd2/0x170
> [ 6.957079] [<ffffffff8129ac39>] ? snprintf+0x39/0x40
> [ 6.962907] [<ffffffff810665b9>] create_worker+0xb9/0x190
> [ 6.969130] [<ffffffff810685d3>] worker_thread+0x303/0x4b0
> [ 6.975452] [<ffffffff810682d0>] ? process_one_work+0x4a0/0x4a0
> [ 6.982262] [<ffffffff8106d8bf>] kthread+0xcf/0xf0
> [ 6.987797] [<ffffffff815dbf99>] ? _raw_spin_unlock_irq+0x9/0x10
> [ 6.994722] [<ffffffff8106d7f0>] ? kthread_worker_fn+0x190/0x190
> [ 7.001636] [<ffffffff815dc86f>] ret_from_fork+0x3f/0x70
> [ 7.007765] [<ffffffff8106d7f0>] ? kthread_worker_fn+0x190/0x190
>
> The driver in question is drivers/pinctrl/intel/pinctrl-intel.c and the
> corresponding code which triggers this is below:
>
> static void intel_gpio_irq_handler(unsigned irq, struct irq_desc *desc)
> {
> struct gpio_chip *gc = irq_desc_get_handler_data(desc);
> struct intel_pinctrl *pctrl = gpiochip_to_pinctrl(gc);
> struct irq_chip *chip = irq_desc_get_chip(desc);
> int i;
>
> chained_irq_enter(chip, desc);
>
> /* Need to check all communities for pending interrupts */
> for (i = 0; i < pctrl->ncommunities; i++)
> intel_gpio_community_irq_handler(gc, &pctrl->communities[i]);
>
> chained_irq_exit(chip, desc);
> }
>
> So once chained_irq_exit() is called, it in turn calls chip->irq_eoi()
> which in this case is ioapic_ack_level().
>
> I'm sure such crash did not happen when pinctrl-intel.c was developed so I
> started to investigate what might have changed. Note that it may be the
> driver does something wrong and the crash is expected. However, many other
> drivers do pretty much the same (with the exception that the parent IRQ
> chip is not IO-APIC in most cases).
>
> To me assert_raw_spin_locked(&desc->lock) looks valid as the function goes
> and modifies desc right after the check. Also in intel_gpio_irq_handler()
> desc->lock is not locked (not sure if it needs to be).
>
> After "manual" bisection I found the commit that causes the crash to be
> triggered:
>
> commit aa5cb97f14a2dd5aefabed6538c35ebc087d7c24
> Author: Jiang Liu <jiang.liu@xxxxxxxxxxxxxxx>
> Date: Tue Apr 14 10:29:42 2015 +0800
>
> x86/irq: Remove x86_io_apic_ops.set_affinity and related interfaces
>
> Now there is no user of x86_io_apic_ops.set_affinity anymore, so remove
> it.
>
> It says that there are no users for x86_io_apic_ops.set_affinity but then
> it does this:
>
> - x86_io_apic_ops.set_affinity(idata, mask, false);
> + irq_set_affinity(irq, mask);
>
> The difference is that x86_io_apic_ops.set_affinity() programs affinity
> directly to the hardware (if I understand it right) but irq_set_affinity()
> calls irqd_set_move_pending() which defers programming the hardware later.

Hi Mika,
I feel this is the root cause and will investigate it tomorrow.
Thanks!
Gerry

>
> Now when an interrupt triggers we end up calling irq_move_masked_irq() with
> unlocked descriptor.
>
> Without the above change we never do that and the crash does not happen.
>
> Since I don't know much about genirq and IO-APIC code in particular, I
> would like to get your input on how to solve the problem. Do I need to
> change the pinctrl driver somehow to lock the descriptor or maybe the
> commit above misses something?
>
> It is really easy to trigger so please let me know if further debugging is
> needed.
>
> Thanks in advance.
>
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/