Re: [PATCH v2 0/9] deferred_probe_timeout logic clean up

From: Geert Uytterhoeven
Date: Tue Jun 07 2022 - 16:08:36 EST


Hi Saravana,

On Wed, Jun 1, 2022 at 12:46 PM Saravana Kannan <saravanak@xxxxxxxxxx> wrote:
> This series is based on linux-next + these 2 small patches applies on top:
> https://lore.kernel.org/lkml/20220526034609.480766-1-saravanak@xxxxxxxxxx/
>
> A lot of the deferred_probe_timeout logic is redundant with
> fw_devlink=on. Also, enabling deferred_probe_timeout by default breaks
> a few cases.
>
> This series tries to delete the redundant logic, simplify the frameworks
> that use driver_deferred_probe_check_state(), enable
> deferred_probe_timeout=10 by default, and fixes the nfsroot failure
> case.
>
> The overall idea of this series is to replace the global behavior of
> driver_deferred_probe_check_state() where all devices give up waiting on
> supplier at the same time with a more granular behavior:
>
> 1. Devices with all their suppliers successfully probed by late_initcall
> probe as usual and avoid unnecessary deferred probe attempts.
>
> 2. At or after late_initcall, in cases where boot would break because of
> fw_devlink=on being strict about the ordering, we
>
> a. Temporarily relax the enforcement to probe any unprobed devices
> that can probe successfully in the current state of the system.
> For example, when we boot with a NFS rootfs and no network device
> has probed.
> b. Go back to enforcing the ordering for any devices that haven't
> probed.
>
> 3. After deferred probe timeout expires, we permanently give up waiting
> on supplier devices without drivers. At this point, whatever devices
> can probe without some of their optional suppliers end up probing.
>
> In the case where module support is disabled, it's fairly
> straightforward and all device probes are completed before the initcalls
> are done.
>
> Patches 1 to 3 are fairly straightforward and can probably be applied
> right away.
>
> Patches 4 to 6 are for fixing the NFS rootfs issue and setting the
> default deferred_probe_timeout back to 10 seconds when modules are
> enabled.
>
> Patches 7 to 9 are further clean up of the deferred_probe_timeout logic
> so that no framework has to know/care about deferred_probe_timeout.
>
> Yoshihiro/Geert,
>
> If you can test this patch series and confirm that the NFS root case
> works, I'd really appreciate that.

Thanks, I gave this a try on various boards I have access to.
The results were quite positive. E.g. the compile error I saw on v1
(implicit declation of fw_devlink_unblock_may_probe(), which is no longer
used in v2) is gone.

However, I'm seeing a weird error when userspace (Debian9 nfsroot) is
starting:

[ OK ] Started D-Bus System Message Bus.
Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000000
Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000000
Mem abort info:
ESR = 0x0000000096000004
Mem abort info:
ESR = 0x0000000096000004
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EC = 0x25: DABT (current EL), IL = 32 bits
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
SET = 0, FnV = 0
Data abort info:
ISV = 0, ISS = 0x00000004
EA = 0, S1PTW = 0
FSC = 0x04: level 0 translation fault
CM = 0, WnR = 0
user pgtable: 4k pages, 48-bit VAs, pgdp=000000004ec45000
[0000000000000000] pgd=0000000000000000, p4d=0000000000000000
Data abort info:
Internal error: Oops: 96000004 [#1] PREEMPT SMP
CPU: 0 PID: 374 Comm: v4l_id Tainted: G W
5.19.0-rc1-arm64-renesas-00799-gc13c3e49e8bd #1660
ISV = 0, ISS = 0x00000004
Hardware name: Renesas Ebisu-4D board based on r8a77990 (DT)
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
CM = 0, WnR = 0
pc : subdev_open+0x8c/0x128
lr : subdev_open+0x78/0x128
sp : ffff80000aadba60
x29: ffff80000aadba60 x28: 0000000000000000 x27: ffff80000aadbc58
x26: 0000000000020000 x25: ffff00000b3aaf00 x24: 0000000000000000
x23: ffff00000c331c00 x22: ffff000009aa61b8 x21: ffff000009aa6000
x20: ffff000008bae3e8 x19: ffff00000c3fe200 x18: 0000000000000000
x17: ffff800076945000 x16: ffff800008004000 x15: 00008cc6bf550c7c
x14: 000000000000038f x13: 000000000000001a x12: ffff00007fba8618
x11: 0000000000000001 x10: 0000000000000000 x9 : ffff800009253954
x8 : ffff00000b3aaf00 x7 : 0000000000000004 x6 : 000000000000001a
x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000001
x2 : 0000000100000001 x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
subdev_open+0x8c/0x128
v4l2_open+0xa4/0x120
chrdev_open+0x78/0x178
do_dentry_open+0xfc/0x398
vfs_open+0x28/0x30
path_openat+0x584/0x9c8
do_filp_open+0x80/0x108
do_sys_openat2+0x20c/0x2d8
user pgtable: 4k pages, 48-bit VAs, pgdp=000000004ec53000
do_sys_open+0x54/0xa0
__arm64_sys_openat+0x20/0x28
invoke_syscall+0x40/0xf8
el0_svc_common.constprop.0+0xf0/0x110
do_el0_svc+0x20/0x78
el0_svc+0x48/0xd0
el0t_64_sync_handler+0xb0/0xb8
el0t_64_sync+0x148/0x14c
Code: f9405280 f9400400 b40000e0 f9400280 (f9400000)
---[ end trace 0000000000000000 ]---

This only happens on the Ebisu-4D board (r8a77990-ebisu.dts).
I do not see this on the Salvator-X(S) boards.

Bisection shows this starts to happen with "[PATCH v2 7/9] driver core:
Set fw_devlink.strict=1 by default".

Adding more debug info:

subdev_open:54: file v4l-subdev1
Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000000
subdev_open:54: file v4l-subdev2
Unable to handle kernel NULL pointer dereference at virtual
address 0000000000000000

Matching the subdev using sysfs gives:

/sys/devices/platform/soc/e6500000.i2c/i2c-0/0-0070/video4linux/v4l-subdev1
/sys/devices/platform/soc/e6500000.i2c/i2c-0/0-0070/video4linux/v4l-subdev2

The i2c device is the adi,adv7482 at address 0x70.
But now I'm lost...

Gr{oetje,eeting}s,

Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@xxxxxxxxxxxxxx

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
-- Linus Torvalds