Re: [PATCH v3 1/2] spi: spi-fsl-dspi: Fix external abort on interrupt in resume or exit paths
From: Krzysztof Kozlowski
Date: Mon Jun 22 2020 - 06:31:16 EST
On Fri, Jun 19, 2020 at 12:24:56AM +0300, Vladimir Oltean wrote:
> Hi Krzysztof,
>
> On Tue, 16 Jun 2020 at 12:42, Krzysztof Kozlowski <krzk@xxxxxxxxxx> wrote:
> >
> > If shared interrupt comes late, during probe error path or device remove
> > (could be triggered with CONFIG_DEBUG_SHIRQ), the interrupt handler
> > dspi_interrupt() will access registers with the clock being disabled.
> > This leads to external abort on non-linefetch on Toradex Colibri VF50
> > module (with Vybrid VF5xx):
> >
> > $ echo 4002d000.spi > /sys/devices/platform/soc/40000000.bus/4002d000.spi/driver/unbind
> >
> > Unhandled fault: external abort on non-linefetch (0x1008) at 0x8887f02c
> > Internal error: : 1008 [#1] ARM
> > Hardware name: Freescale Vybrid VF5xx/VF6xx (Device Tree)
> > Backtrace:
> > (regmap_mmio_read32le)
> > (regmap_mmio_read)
> > (_regmap_bus_reg_read)
> > (_regmap_read)
> > (regmap_read)
> > (dspi_interrupt)
> > (free_irq)
> > (devm_irq_release)
> > (release_nodes)
> > (devres_release_all)
> > (device_release_driver_internal)
> >
> > The resource-managed framework should not be used for shared interrupt
> > handling, because the interrupt handler might be called after releasing
> > other resources and disabling clocks.
> >
> > Similar bug could happen during suspend - the shared interrupt handler
> > could be invoked after suspending the device. Each device sharing this
> > interrupt line should disable the IRQ during suspend so handler will be
> > invoked only in following cases:
> > 1. None suspended,
> > 2. All devices resumed.
> >
> > Fixes: 349ad66c0ab0 ("spi:Add Freescale DSPI driver for Vybrid VF610 platform")
> > Cc: <stable@xxxxxxxxxxxxxxx>
> > Signed-off-by: Krzysztof Kozlowski <krzk@xxxxxxxxxx>
> >
> > ---
> >
> > Changes since v2:
> > 1. Go back to v1 and use non-devm interface,
> > 2. Fix also suspend/resume paths.
> >
> > Changes since v1:
> > 1. Disable the IRQ instead of using non-devm interface.
> > ---
> > drivers/spi/spi-fsl-dspi.c | 17 +++++++++++++----
> > 1 file changed, 13 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/spi/spi-fsl-dspi.c b/drivers/spi/spi-fsl-dspi.c
> > index 58190c94561f..7ecc90ec8f2f 100644
> > --- a/drivers/spi/spi-fsl-dspi.c
> > +++ b/drivers/spi/spi-fsl-dspi.c
> > @@ -1109,6 +1109,8 @@ static int dspi_suspend(struct device *dev)
> > struct spi_controller *ctlr = dev_get_drvdata(dev);
> > struct fsl_dspi *dspi = spi_controller_get_devdata(ctlr);
> >
> > + if (dspi->irq > 0)
>
> Since the line right above "goto poll_mode" is executing "dspi->irq =
> 0;" on error, all these checks could have been simplified as "if
> (dspi->irq)" which is slightly easier on the eye.
Indeed.
>
>
> > + disable_irq(dspi->irq);
> > spi_controller_suspend(ctlr);
> > clk_disable_unprepare(dspi->clk);
> >
> > @@ -1129,6 +1131,8 @@ static int dspi_resume(struct device *dev)
> > if (ret)
> > return ret;
> > spi_controller_resume(ctlr);
> > + if (dspi->irq > 0)
> > + enable_irq(dspi->irq);
> >
> > return 0;
> > }
> > @@ -1385,8 +1389,8 @@ static int dspi_probe(struct platform_device *pdev)
> > goto poll_mode;
> > }
> >
> > - ret = devm_request_irq(&pdev->dev, dspi->irq, dspi_interrupt,
> > - IRQF_SHARED, pdev->name, dspi);
> > + ret = request_threaded_irq(dspi->irq, dspi_interrupt, NULL,
> > + IRQF_SHARED, pdev->name, dspi);
> > if (ret < 0) {
> > dev_err(&pdev->dev, "Unable to attach DSPI interrupt\n");
> > goto out_clk_put;
> > @@ -1400,7 +1404,7 @@ static int dspi_probe(struct platform_device *pdev)
> > ret = dspi_request_dma(dspi, res->start);
> > if (ret < 0) {
> > dev_err(&pdev->dev, "can't get dma channels\n");
> > - goto out_clk_put;
> > + goto out_free_irq;
> > }
> > }
> >
> > @@ -1415,11 +1419,14 @@ static int dspi_probe(struct platform_device *pdev)
> > ret = spi_register_controller(ctlr);
> > if (ret != 0) {
> > dev_err(&pdev->dev, "Problem registering DSPI ctlr\n");
> > - goto out_clk_put;
> > + goto out_free_irq;
> > }
> >
> > return ret;
> >
> > +out_free_irq:
> > + if (dspi->irq > 0)
> > + free_irq(dspi->irq, dspi);
> > out_clk_put:
> > clk_disable_unprepare(dspi->clk);
> > out_ctlr_put:
> > @@ -1435,6 +1442,8 @@ static int dspi_remove(struct platform_device *pdev)
> >
> > /* Disconnect from the SPI framework */
> > dspi_release_dma(dspi);
> > + if (dspi->irq > 0)
> > + free_irq(dspi->irq, dspi);
>
> There is a subtle but important bug here: by waiting for the current
> interrupt to finish and removing the handler, you are effectively
> causing dspi_transfer_one_message to deadlock. The way this driver
> works (and I'm sure many others) is that it transmits a SPI buffer
> through a train of interrupts: each completion of a chunk triggers the
> transmission of the next one, which will eventually raise an IRQ and
> the process continues. But if you just disable the IRQ like that, you
> effectively chop the interrupt train in the middle, and
> dspi_transfer_one_message will remain stuck in
> wait_for_completion(&dspi->xfer_done).
> The reason why you don't do this (while I very much do) is that you
> seem to be on Vybrid which uses DMA mode, and for DMA, the DSPI
> interrupt line doesn't really do much of anything at all (which makes
> it a bit funny that you're spending so much time working on it).
> Instead, the completion events go from the DSPI controller directly to
> the DMA controller, requesting for more data, and that's where IRQs
> are raised to the CPU.
> So, to avoid this deadlock, I would suggest you move
> spi_unregister_controller as the first thing that gets called in the
> teardown process. And yes, this should be a bugfix patch in itself.
You're right. It also makes sense from the reverse-probe-order point of
view.
>
> And while I'm at it, let me add another one: this recent commit:
>
> commit dc234825997ec6ff05980ca9e2204f4ac3f8d695
> Author: Peng Ma <peng.ma@xxxxxxx>
> Date: Fri Apr 24 14:12:16 2020 +0800
>
> spi: spi-fsl-dspi: Adding shutdown hook
>
> We need to ensure dspi controller could be stopped in order for kexec
> to start the next kernel.
> So add the shutdown operation support.
>
> Signed-off-by: Peng Ma <peng.ma@xxxxxxx>
> Link: https://lore.kernel.org/r/20200424061216.27445-1-peng.ma@xxxxxxx
> Signed-off-by: Mark Brown <broonie@xxxxxxxxxx>
>
> is causing some very similar issues: if you look at the code it adds,
> it disables the FIFOs and halts the module, which is bad for the same
> reason that it will hang an ongoing interrupt train in
> dspi_transfer_one_message:
>
> [17159.264026] INFO: task init:1870 blocked for more than 120 seconds.
> [17159.270385] Not tainted 5.8.0-rc1-00133-g923e4b5032dd-dirty #208
> [17159.276953] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [17159.284828] init D 0 1870 1 0x00000008
> [17159.290353] Call trace:
> [17159.292828] __switch_to+0x108/0x158
> [17159.296433] __schedule+0x320/0x870
> [17159.299933] schedule+0x50/0x110
> [17159.303188] schedule_timeout+0x35c/0x438
> [17159.307229] wait_for_completion+0xb0/0x148
> [17159.311444] dspi_transfer_one_message+0x8c/0x4d0
> [17159.316180] __spi_pump_messages+0x434/0x9a8
> [17159.320480] __spi_sync+0x2ec/0x3a8
> [17159.323994] spi_sync+0x3c/0x60
> [17159.327149] spi_sync_transfer+0x94/0xb8
> [17159.331101] sja1105_xfer.isra.0+0x26c/0x2f8
> [17159.335402] sja1105_xfer_buf+0x20/0x30
> [17159.339267] sja1105_dynamic_config_write+0x18c/0x1e0
> [17159.344353] sja1105_bridge_stp_state_set+0x58/0xd0
> [17159.349265] dsa_port_set_state+0x74/0xd0
> [17159.353303] dsa_port_set_state_now+0x28/0x58
> [17159.357690] dsa_port_disable_rt+0x74/0x78
> [17159.361818] dsa_slave_close+0x2c/0xd0
> [17159.365601] __dev_close_many+0xcc/0x150
> [17159.369553] dev_close_many+0x8c/0x130
> [17159.373330] rollback_registered_many+0x128/0x550
> [17159.378066] unregister_netdevice_queue+0x98/0x120
> [17159.382891] unregister_netdev+0x2c/0x40
> [17159.386843] dsa_slave_destroy+0x4c/0x80
> [17159.390795] dsa_port_teardown+0xa0/0xd0
> [17159.394746] dsa_tree_teardown_switches+0x40/0x98
> [17159.399481] dsa_unregister_switch+0xbc/0x170
> [17159.403868] sja1105_remove+0x20/0x30
> [17159.407557] spi_drv_remove+0x34/0x58
> [17159.411247] device_release_driver_internal+0x104/0x1d8
> [17159.416506] device_release_driver+0x20/0x30
> [17159.420808] bus_remove_device+0xdc/0x168
> [17159.424847] device_del+0x15c/0x3b0
> [17159.428363] device_unregister+0x28/0x80
> [17159.432313] spi_unregister_device+0x8c/0xb0
> [17159.436613] __unregister+0x18/0x28
> [17159.440130] device_for_each_child+0x64/0xb0
> [17159.444430] spi_unregister_controller+0x44/0x140
> [17159.449168] dspi_shutdown+0x88/0x98
> [17159.452770] platform_drv_shutdown+0x28/0x38
> [17159.457072] device_shutdown+0x168/0x340
> [17159.461024] kernel_restart_prepare+0x40/0x50
> [17159.465416] kernel_restart+0x20/0x70
> [17159.469107] __do_sys_reboot+0x22c/0x258
> [17159.473058] __arm64_sys_reboot+0x2c/0x38
> [17159.477098] el0_svc_common.constprop.0+0x7c/0x180
> [17159.481921] do_el0_svc+0x2c/0x98
> [17159.485261] el0_sync_handler+0x9c/0x1b8
> [17159.489213] el0_sync+0x158/0x180
> [17159.492556] INFO: lockdep is turned off.
> [17159.496508] Kernel panic - not syncing: hung_task: blocked tasks
> [17159.502537] CPU: 1 PID: 25 Comm: khungtaskd Not tainted
> 5.8.0-rc1-00133-g923e4b5032dd-dirty #208
> [17159.516767] Call trace:
> [17159.519218] dump_backtrace+0x0/0x1e0
> [17159.522889] show_stack+0x20/0x30
> [17159.526214] dump_stack+0xec/0x158
> [17159.529624] panic+0x16c/0x37c
> [17159.532686] watchdog+0x3d8/0x7f8
> [17159.536008] kthread+0x158/0x168
> [17159.539245] ret_from_fork+0x10/0x1c
> [17159.542844] Kernel Offset: 0x5bd125880000 from 0xffff800010000000
> [17159.548957] PHYS_OFFSET: 0xffffafdb00000000
> [17159.553151] CPU features: 0x240022,21806008
> [17159.557346] Memory Limit: none
> [17159.560414] Rebooting in 3 seconds..
>
> So for this third issue, what I would suggest would be to merge
> dspi_shutdown with the reworked version of dspi_remove (i.e. to move
> "/* Disable RX and TX */" and "/* Stop Running */" to dspi_remove,
> right before clk_disable_unprepare, and then just call dspi_remove
> from dspi_shutdown. It would end up looking like this:
>
> static int dspi_remove(struct platform_device *pdev)
> {
> struct spi_controller *ctlr = platform_get_drvdata(pdev);
> struct fsl_dspi *dspi = spi_controller_get_devdata(ctlr);
>
> /* Disconnect from the SPI framework */
> spi_unregister_controller(dspi->ctlr);
>
> dspi_release_dma(dspi);
> if (dspi->irq > 0)
> free_irq(dspi->irq, dspi);
>
> /* Disable RX and TX */
> regmap_update_bits(dspi->regmap, SPI_MCR,
> SPI_MCR_DIS_TXF | SPI_MCR_DIS_RXF,
> SPI_MCR_DIS_TXF | SPI_MCR_DIS_RXF);
>
> /* Stop Running */
> regmap_update_bits(dspi->regmap, SPI_MCR, SPI_MCR_HALT, SPI_MCR_HALT);
>
> clk_disable_unprepare(dspi->clk);
>
> return 0;
> }
>
> static void dspi_shutdown(struct platform_device *pdev)
> {
> dspi_remove(pdev);
> }
Yes, good point. The shutdown also lacks the interrupt freeing in my
patch.
Best regards,
Krzysztof
>
> If you think it's too complicated or that you can't test, let me know
> and I can take over the patch series.
>
> > clk_disable_unprepare(dspi->clk);
> > spi_unregister_controller(dspi->ctlr);
> >
> > --
> > 2.7.4
> >
>
> Thanks,
> -Vladimir