regression due to soc_device_match not handling defer (Was: [PATCH v4 4/4] soc: imx8m: change to use platform driver)
From: Dominique MARTINET
Date: Mon Mar 29 2021 - 05:10:13 EST
Hi,
First thanks for the patch, it fixes the kexec hang I was looking at...
Unfortunately it means the soc is now init much later and other piece of
drivers that depend on querying the soc will fail, I am getting a BUG in
the caam crypto driver from 7d981405d0fd ("soc: imx8m: change to use
platform driver") + ce58459d8c7f ("arm64: dts: imx8m: add SoC ID
compatible") on the imx8mp evk as follow:
[ 2.575505] caam 30900000.crypto: caam pkc algorithms registered in /proc/crypto
[ 2.588986] caam 30900000.crypto: registering rng-caam
[ 2.594168] caam_jr 30901000.jr: job ring error: irqstate: 00000103
[ 2.600492] ------------[ cut here ]------------
[ 2.605109] kernel BUG at drivers/crypto/caam/jr.c:187!
[ 2.610338] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
[ 2.615829] Modules linked in:
[ 2.618895] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.12.0-rc5 #8
[ 2.625168] Hardware name: NXP i.MX8MPlus EVK board (DT)
[ 2.630482] pstate: 60000085 (nZCv daIf -PAN -UAO -TCO BTYPE=--)
[ 2.636493] pc : caam_jr_interrupt+0x150/0x154
[ 2.640944] lr : caam_jr_interrupt+0x150/0x154
[ 2.645396] sp : ffff800010003e90
[ 2.648709] x29: ffff800010003e90 x28: ffff800011f82e80
[ 2.654035] x27: ffff800011cd2000 x26: ffff0000c1988400
[ 2.659353] x25: ffff800011cd2000 x24: ffff800011f789e0
[ 2.664674] x23: 000000000000002e x22: ffff800012261000
[ 2.669994] x21: ffff0000c1979410 x20: ffff0000c1988a80
[ 2.675315] x19: 0000000000000103 x18: 0000000000000000
[ 2.680637] x17: 0000000000000007 x16: 000000000000000e
[ 2.685958] x15: 0000000000000030 x14: ffffffffffffffff
[ 2.691279] x13: ffff800090003aa7 x12: ffff800010003aaf
[ 2.696601] x11: 0000000000000003 x10: 0101010101010101
[ 2.701921] x9 : ffff8000100039d0 x8 : ffff800011fa3830
[ 2.707242] x7 : ffff800011ffb830 x6 : ffff800011ffb830
[ 2.712560] x5 : 0000000000000000 x4 : 0000000000000000
[ 2.717881] x3 : 0000000000000000 x2 : 0000000000000000
[ 2.723203] x1 : 0000000000000000 x0 : ffff800011f82e80
[ 2.728528] Call trace:
[ 2.730976] caam_jr_interrupt+0x150/0x154
[ 2.735080] __handle_irq_event_percpu+0x6c/0x280
[ 2.739791] handle_irq_event+0x70/0x160
[ 2.743716] handle_fasteoi_irq+0xb0/0x200
[ 2.747822] __handle_domain_irq+0x8c/0xf0
[ 2.751924] gic_handle_irq+0xd8/0x160
[ 2.755683] el1_irq+0xb4/0x180
[ 2.758829] arch_cpu_idle+0x18/0x30
[ 2.762412] default_idle_call+0x4c/0x1d0
[ 2.766431] do_idle+0x238/0x2b0
[ 2.769664] cpu_startup_entry+0x30/0x7c
[ 2.773595] rest_init+0xe4/0xf4
[ 2.776832] arch_call_rest_init+0x1c/0x28
[ 2.780937] start_kernel+0x570/0x5a8
[ 2.784602] 0x0
[ 2.786452] Code: aa1503e0 90005c41 91108021 940da95d (d4210000)
[ 2.792560] ---[ end trace 968b8515172abc2e ]---
[ 2.797181] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
[ 2.804580] SMP: stopping secondary CPUs
[ 2.808508] Kernel Offset: disabled
[ 2.811998] CPU features: 0x00240002,2000200c
[ 2.816360] Memory Limit: none
[ 2.819419] ---[ end Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt ]---
This particular crash can be fixed by making the caam driver delay as
well if the device isn't inited yet, e.g. this works:
--------
diff --git a/drivers/base/soc.c b/drivers/base/soc.c
index 0af5363a582c..f179f9e55b49 100644
--- a/drivers/base/soc.c
+++ b/drivers/base/soc.c
@@ -36,6 +36,7 @@ static DEVICE_ATTR(family, 0444, soc_info_show, NULL);
static DEVICE_ATTR(serial_number, 0444, soc_info_show, NULL);
static DEVICE_ATTR(soc_id, 0444, soc_info_show, NULL);
static DEVICE_ATTR(revision, 0444, soc_info_show, NULL);
+static int init_done;
struct device *soc_device_to_device(struct soc_device *soc_dev)
{
@@ -157,6 +158,7 @@ struct soc_device *soc_device_register(struct soc_device_attribute *soc_dev_attr
return ERR_PTR(ret);
}
+ init_done = true;
return soc_dev;
out3:
@@ -243,6 +245,9 @@ const struct soc_device_attribute *soc_device_match(
{
int ret = 0;
+ if (!init_done)
+ return ERR_PTR(-EPROBE_DEFER);
+
if (!matches)
return NULL;
diff --git a/drivers/crypto/caam/ctrl.c b/drivers/crypto/caam/ctrl.c
index ca0361b2dbb0..d08f8cc4131f 100644
--- a/drivers/crypto/caam/ctrl.c
+++ b/drivers/crypto/caam/ctrl.c
@@ -635,6 +635,9 @@ static int caam_probe(struct platform_device *pdev)
nprop = pdev->dev.of_node;
imx_soc_match = soc_device_match(caam_imx_soc_table);
+ if (IS_ERR(imx_soc_match))
+ return PTR_ERR(imx_soc_match);
+
caam_imx = (bool)imx_soc_match;
if (imx_soc_match) {
-------
But it obviously doesn't cover the ~50 (!) other soc_device_match users
in the code base which would all need to start handling potential error
return code.
(I also had problems with other drivers when trying to backport the
patch to the 5.4.70_2.3.0 imx kernel but I just gave up on that one)
I think having all drivers handle potential EPROBE_DEFER errors is the
best way forward, how do you propose to move on?
I can do some but have no way of testing most of these so am a bit
reluctant to...
Thanks,
--
Dominique