Re: [PATCH 5.17 127/298] driver core: Fix wait_for_device_probe() & deferred_probe_timeout interaction

From: Saravana Kannan
Date: Fri Aug 18 2023 - 16:21:28 EST


On Thu, Aug 17, 2023 at 4:13 PM Shreeya Patel
<shreeya.patel@xxxxxxxxxxxxx> wrote:
>
> Hi Geert, Saravana,
>
> On 18/08/23 00:03, Saravana Kannan wrote:
> > On Thu, Aug 17, 2023 at 4:37 AM Shreeya Patel
> > <shreeya.patel@xxxxxxxxxxxxx> wrote:
> >> Hi Greg,
> >>
> >> On 16/08/23 20:33, Greg Kroah-Hartman wrote:
> >>> On Wed, Aug 16, 2023 at 03:09:27PM +0530, Shreeya Patel wrote:
> >>>> On 13/06/22 15:40, Greg Kroah-Hartman wrote:
> >>>>> From: Saravana Kannan<saravanak@xxxxxxxxxx>
> >>>>>
> >>>>> [ Upstream commit 5ee76c256e928455212ab759c51d198fedbe7523 ]
> >>>>>
> >>>>> Mounting NFS rootfs was timing out when deferred_probe_timeout was
> >>>>> non-zero [1]. This was because ip_auto_config() initcall times out
> >>>>> waiting for the network interfaces to show up when
> >>>>> deferred_probe_timeout was non-zero. While ip_auto_config() calls
> >>>>> wait_for_device_probe() to make sure any currently running deferred
> >>>>> probe work or asynchronous probe finishes, that wasn't sufficient to
> >>>>> account for devices being deferred until deferred_probe_timeout.
> >>>>>
> >>>>> Commit 35a672363ab3 ("driver core: Ensure wait_for_device_probe() waits
> >>>>> until the deferred_probe_timeout fires") tried to fix that by making
> >>>>> sure wait_for_device_probe() waits for deferred_probe_timeout to expire
> >>>>> before returning.
> >>>>>
> >>>>> However, if wait_for_device_probe() is called from the kernel_init()
> >>>>> context:
> >>>>>
> >>>>> - Before deferred_probe_initcall() [2], it causes the boot process to
> >>>>> hang due to a deadlock.
> >>>>>
> >>>>> - After deferred_probe_initcall() [3], it blocks kernel_init() from
> >>>>> continuing till deferred_probe_timeout expires and beats the point of
> >>>>> deferred_probe_timeout that's trying to wait for userspace to load
> >>>>> modules.
> >>>>>
> >>>>> Neither of this is good. So revert the changes to
> >>>>> wait_for_device_probe().
> >>>>>
> >>>>> [1] -https://lore.kernel.org/lkml/TYAPR01MB45443DF63B9EF29054F7C41FD8C60@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/
> >>>>> [2] -https://lore.kernel.org/lkml/YowHNo4sBjr9ijZr@dev-arch.thelio-3990X/
> >>>>> [3] -https://lore.kernel.org/lkml/Yo3WvGnNk3LvLb7R@xxxxxxxxxxxxx/
> >>>> Hi Saravana, Greg,
> >>>>
> >>>>
> >>>> KernelCI found this patch causes the baseline.bootrr.deferred-probe-empty test to fail on r8a77960-ulcb,
> >>>> see the following details for more information.
> >>>>
> >>>> KernelCI dashboard link:
> >>>> https://linux.kernelci.org/test/plan/id/64d2a6be8c1a8435e535b264/
> >>>>
> >>>> Error messages from the logs :-
> >>>>
> >>>> + UUID=11236495_1.5.2.4.5
> >>>> + set +x
> >>>> + export 'PATH=/opt/bootrr/libexec/bootrr/helpers:/lava-11236495/1/../bin:/sbin:/usr/sbin:/bin:/usr/bin'
> >>>> + cd /opt/bootrr/libexec/bootrr
> >>>> + sh helpers/bootrr-auto
> >>>> e6800000.ethernet
> >>>> e6700000.dma-controller
> >>>> e7300000.dma-controller
> >>>> e7310000.dma-controller
> >>>> ec700000.dma-controller
> >>>> ec720000.dma-controller
> >>>> fea20000.vsp
> >>>> feb00000.display
> >>>> fea28000.vsp
> >>>> fea30000.vsp
> >>>> fe9a0000.vsp
> >>>> fe9af000.fcp
> >>>> fea27000.fcp
> >>>> fea2f000.fcp
> >>>> fea37000.fcp
> >>>> sound
> >>>> ee100000.mmc
> >>>> ee140000.mmc
> >>>> ec500000.sound
> >>>> /lava-11236495/1/../bin/lava-test-case
> >>>> <8>[ 17.476741] <LAVA_SIGNAL_TESTCASE TEST_CASE_ID=deferred-probe-empty RESULT=fail>
> >>>>
> >>>> Test case failing :-
> >>>> Baseline Bootrr deferred-probe-empty test -https://github.com/kernelci/bootrr/blob/main/helpers/bootrr-generic-tests
> >>>>
> >>>> Regression Reproduced :-
> >>>>
> >>>> Lava job after reverting the commit 5ee76c256e92
> >>>> https://lava.collabora.dev/scheduler/job/11292890
> >>>>
> >>>>
> >>>> Bisection report from KernelCI can be found at the bottom of the email.
> >>>>
> >>>> Thanks,
> >>>> Shreeya Patel
> >>>>
> >>>> #regzbot introduced: 5ee76c256e92
> >>>> #regzbot title: KernelCI: Multiple devices deferring on r8a77960-ulcb
> >>>>
> >>>> ---------------------------------------------------------------------------------------------------------------------------------------------------
> >>>>
> >>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * **
> >>>> * If you do send a fix, please include this trailer: *
> >>>> * Reported-by: "kernelci.org bot" <bot@...> *
> >>>> * *
> >>>> * Hope this helps! *
> >>>> * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
> >>>>
> >>>> stable-rc/linux-5.10.y bisection: baseline.bootrr.deferred-probe-empty on
> >>>> r8a77960-ulcb
> >>> You are testing 5.10.y, yet the subject says 5.17?
> >>>
> >>> Which is it here?
> >> Sorry, I accidentally used the lore link for 5.17 while reporting this
> >> issue,
> >> but this test does fail on all the stable releases from 5.10 onwards.
> >>
> >> stable 5.15 :-
> >> https://linux.kernelci.org/test/case/id/64dd156a5ac58d0cf335b1ea/
> >> mainline :-
> >> https://linux.kernelci.org/test/case/id/64dc13d55cb51357a135b209/
> >>
> > Shreeya, can you try the patch Geert suggested and let us know if it
> > helps? If not, then I can try to take a closer look.
>
> I tried to test the kernel with 9be4cbd09da8 but it didn't change the
> result.
> https://lava.collabora.dev/scheduler/job/11311615
>
> Also, I am not sure if this can change things but just FYI, KernelCI
> adds some kernel parameters when running these tests and one of the
> parameter is deferred_probe_timeout=60.

Ah this is good to know.

> You can check this in the definition details given in the Lava job. I
> also tried to remove this parameter and rerun the test but again I got
> the same result.

How long does the test wait after boot before checking for the
deferred devices list?

> I will try to add 9be4cbd09da8 to mainline kernel and see what results I
> get.

Now I'm confused. What do you mean by mainline? Are you saying the tip
of tree of Linus's tree is also hitting this issue?

-Saravana