RE: [EXT] regression due to soc_device_match not handling defer (Was: [PATCH v4 4/4] soc: imx8m: change to use platform driver)
From: Alice Guo (OSS)
Date: Mon Mar 29 2021 - 22:42:22 EST
Hi,
Thanks for reporting this issue, I'll check and add a fix to handle defer probe.
Best regards,
Alice Guo
> -----Original Message-----
> From: Dominique MARTINET <dominique.martinet@xxxxxxxxxxxxxxxxx>
> Sent: 2021年3月29日 17:09
> To: Alice Guo <alice.guo@xxxxxxx>; Shawn Guo <shawnguo@xxxxxxxxxx>;
> Krzysztof Kozlowski <krzk@xxxxxxxxxx>
> Cc: robh+dt@xxxxxxxxxx; devicetree@xxxxxxxxxxxxxxx; Peng Fan
> <peng.fan@xxxxxxx>; dl-linux-imx <linux-imx@xxxxxxx>;
> linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx
> Subject: [EXT] regression due to soc_device_match not handling defer (Was:
> [PATCH v4 4/4] soc: imx8m: change to use platform driver)
>
> Caution: EXT Email
>
> Hi,
>
> First thanks for the patch, it fixes the kexec hang I was looking at...
>
> Unfortunately it means the soc is now init much later and other piece of drivers
> that depend on querying the soc will fail, I am getting a BUG in the caam crypto
> driver from 7d981405d0fd ("soc: imx8m: change to use platform driver") +
> ce58459d8c7f ("arm64: dts: imx8m: add SoC ID
> compatible") on the imx8mp evk as follow:
>
> [ 2.575505] caam 30900000.crypto: caam pkc algorithms registered in
> /proc/crypto
> [ 2.588986] caam 30900000.crypto: registering rng-caam
> [ 2.594168] caam_jr 30901000.jr: job ring error: irqstate: 00000103
> [ 2.600492] ------------[ cut here ]------------
> [ 2.605109] kernel BUG at drivers/crypto/caam/jr.c:187!
> [ 2.610338] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
> [ 2.615829] Modules linked in:
> [ 2.618895] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc5 #8
> [ 2.625168] Hardware name: NXP i.MX8MPlus EVK board (DT)
> [ 2.630482] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
> [ 2.636493] pc : caam_jr_interrupt+0x150/0x154
> [ 2.640944] lr : caam_jr_interrupt+0x150/0x154
> [ 2.645396] sp : ffff800010003e90
> [ 2.648709] x29: ffff800010003e90 x28: ffff800011f82e80
> [ 2.654035] x27: ffff800011cd2000 x26: ffff0000c1988400
> [ 2.659353] x25: ffff800011cd2000 x24: ffff800011f789e0
> [ 2.664674] x23: 000000000000002e x22: ffff800012261000
> [ 2.669994] x21: ffff0000c1979410 x20: ffff0000c1988a80
> [ 2.675315] x19: 0000000000000103 x18: 0000000000000000
> [ 2.680637] x17: 0000000000000007 x16: 000000000000000e
> [ 2.685958] x15: 0000000000000030 x14: ffffffffffffffff
> [ 2.691279] x13: ffff800090003aa7 x12: ffff800010003aaf
> [ 2.696601] x11: 0000000000000003 x10: 0101010101010101
> [ 2.701921] x9 : ffff8000100039d0 x8 : ffff800011fa3830
> [ 2.707242] x7 : ffff800011ffb830 x6 : ffff800011ffb830
> [ 2.712560] x5 : 0000000000000000 x4 : 0000000000000000
> [ 2.717881] x3 : 0000000000000000 x2 : 0000000000000000
> [ 2.723203] x1 : 0000000000000000 x0 : ffff800011f82e80
> [ 2.728528] Call trace:
> [ 2.730976] caam_jr_interrupt+0x150/0x154
> [ 2.735080] __handle_irq_event_percpu+0x6c/0x280
> [ 2.739791] handle_irq_event+0x70/0x160
> [ 2.743716] handle_fasteoi_irq+0xb0/0x200
> [ 2.747822] __handle_domain_irq+0x8c/0xf0
> [ 2.751924] gic_handle_irq+0xd8/0x160
> [ 2.755683] el1_irq+0xb4/0x180
> [ 2.758829] arch_cpu_idle+0x18/0x30
> [ 2.762412] default_idle_call+0x4c/0x1d0
> [ 2.766431] do_idle+0x238/0x2b0
> [ 2.769664] cpu_startup_entry+0x30/0x7c
> [ 2.773595] rest_init+0xe4/0xf4
> [ 2.776832] arch_call_rest_init+0x1c/0x28
> [ 2.780937] start_kernel+0x570/0x5a8
> [ 2.784602] 0x0
> [ 2.786452] Code: aa1503e0 90005c41 91108021 940da95d (d4210000)
> [ 2.792560] ---[ end trace 968b8515172abc2e ]---
> [ 2.797181] Kernel panic - not syncing: Oops - BUG: Fatal exception in
> interrupt
> [ 2.804580] SMP: stopping secondary CPUs
> [ 2.808508] Kernel Offset: disabled
> [ 2.811998] CPU features: 0x00240002,2000200c
> [ 2.816360] Memory Limit: none
> [ 2.819419] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception
> in interrupt ]---
>
>
> This particular crash can be fixed by making the caam driver delay as well if the
> device isn't inited yet, e.g. this works:
> --------
> diff --git a/drivers/base/soc.c b/drivers/base/soc.c index
> 0af5363a582c..f179f9e55b49 100644
> --- a/drivers/base/soc.c
> +++ b/drivers/base/soc.c
> @@ -36,6 +36,7 @@ static DEVICE_ATTR(family, 0444,
> soc_info_show, NULL);
> static DEVICE_ATTR(serial_number, 0444, soc_info_show, NULL);
> static DEVICE_ATTR(soc_id, 0444, soc_info_show, NULL);
> static DEVICE_ATTR(revision, 0444, soc_info_show, NULL);
> +static int init_done;
>
> struct device *soc_device_to_device(struct soc_device *soc_dev) { @@
> -157,6 +158,7 @@ struct soc_device *soc_device_register(struct
> soc_device_attribute *soc_dev_attr
> return ERR_PTR(ret);
> }
>
> + init_done = true;
> return soc_dev;
>
> out3:
> @@ -243,6 +245,9 @@ const struct soc_device_attribute
> *soc_device_match( {
> int ret = 0;
>
> + if (!init_done)
> + return ERR_PTR(-EPROBE_DEFER);
> +
> if (!matches)
> return NULL;
>
> diff --git a/drivers/crypto/caam/ctrl.c b/drivers/crypto/caam/ctrl.c index
> ca0361b2dbb0..d08f8cc4131f 100644
> --- a/drivers/crypto/caam/ctrl.c
> +++ b/drivers/crypto/caam/ctrl.c
> @@ -635,6 +635,9 @@ static int caam_probe(struct platform_device *pdev)
> nprop = pdev->dev.of_node;
>
> imx_soc_match = soc_device_match(caam_imx_soc_table);
> + if (IS_ERR(imx_soc_match))
> + return PTR_ERR(imx_soc_match);
> +
> caam_imx = (bool)imx_soc_match;
>
> if (imx_soc_match) {
> -------
>
> But it obviously doesn't cover the ~50 (!) other soc_device_match users in the
> code base which would all need to start handling potential error return code.
>
> (I also had problems with other drivers when trying to backport the patch to the
> 5.4.70_2.3.0 imx kernel but I just gave up on that one)
>
>
> I think having all drivers handle potential EPROBE_DEFER errors is the best way
> forward, how do you propose to move on?
> I can do some but have no way of testing most of these so am a bit reluctant
> to...
>
> Thanks,
> --
> Dominique