Re: [PATCH v1] drm/ci: enable lockdep detection

From: Vignesh Raman
Date: Fri Sep 13 2024 - 08:19:29 EST


Hi Rob, Helen,

On 14/08/24 23:11, Rob Clark wrote:
On Wed, Aug 14, 2024 at 2:42 AM Vignesh Raman
<vignesh.raman@xxxxxxxxxxxxx> wrote:

Hi Helen,

On 14/08/24 01:44, Helen Mae Koike Fornazier wrote:




---- On Tue, 13 Aug 2024 02:26:48 -0300 Vignesh Raman wrote ---

> Hi Helen,
>
> On 13/08/24 01:47, Helen Mae Koike Fornazier wrote:
> >
> > Hi Vignesh,
> >
> > Thanks for your patch.
> >
> >
> > ---- On Mon, 12 Aug 2024 08:20:28 -0300 Vignesh Raman wrote ---
> >
> > > We have enabled PROVE_LOCKING (which enables LOCKDEP) in drm-ci.
> > > This will output warnings when kernel locking errors are encountered
> > > and will continue executing tests. To detect if lockdep has been
> > > triggered, check the debug_locks value in /proc/lockdep_stats after
> > > the tests have run. When debug_locks is 0, it indicates that lockdep
> > > has detected issues and turned itself off. So check this value and
> > > exit with an error if lockdep is detected.
> >
> > Should we exit with an error? Or with a warning? (GitLab-CI supports that).
> > Well, I guess it is serious enough.
>
> I think we can exit with an error since we check the status at the end
> of the tests.

I mean, we can exit with a specific error and configure this specific error in gitlab-ci to be a warning,
so the job will be yellow and not red.

But maybe the lockdep issue should be a strong error.

Yes agree. We can exit with an error for lockdep issue instead of a warning.

I think that is too strong, lockdep can warn about things which can
never happen in practice. (We've never completely solved some of the
things that lockdep complains about in runpm vs shrinker reclaim.)

Surfacing it as a warning is fine.

Will send another patch which will exit with an error if lockdep is detected and configure it as a warning in GitLab CI.

Regards,
Vignesh


BR,
-R


>
> >
> > Should we also track on the xfail folder? So we can annotate those errors as well?
>
> Do you mean reporting this error in expectation files?

I wonder if there will be cases were we are getting this error and we should ignore it, so in the code
we should check the xfail files to see if we should exit with an error or ignore it.

For instance, if we have a case where we are having this error, and it is flaky, we might want to add it
to the flakes file list.

Maybe this is not the case, I'm just wondering.


The tests are passing but log shows lockdep warning
(https://gitlab.freedesktop.org/vigneshraman/linux/-/jobs/62177711).

Moreover if the lockdep warning is emitted, lockdep will not continue to
run and there is no need to check this warning for each tests.
So added the check at the end of the tests.



>
> > Did you have an entire pipeline with this? To see if everything is still green?
>
> Yes. https://gitlab.freedesktop.org/vigneshraman/linux/-/jobs/62177711
>
> This is a test branch in which I reverted a fix for the lockdep issue.
> We see 'WARNING: bad unlock balance detected!' in logs and pipeline is
> still green.

But with your patch, it would red right?

Yes it would fail and the pipeline will be red.

With the current patch, is the pipeline still all green?

With this current patch, it will fail.
Pipeline link to show lockdep_stats before and after tests,
https://gitlab.freedesktop.org/vigneshraman/linux/-/pipelines/1246721

Regards,
Vignesh


Regards,
Helen

>
> Regards,
> Vignesh
>
> >
> > Helen
> >
> > >
> > > Signed-off-by: Vignesh Raman vignesh.raman@xxxxxxxxxxxxx>
> > > ---
> > >
> > > v1:
> > > - Pipeline link to show lockdep_stats before and after tests,
> > > https://gitlab.freedesktop.org/vigneshraman/linux/-/pipelines/1246721
> > >
> > > ---
> > > drivers/gpu/drm/ci/igt_runner.sh | 11 +++++++++++
> > > 1 file changed, 11 insertions(+)
> > >
> > > diff --git a/drivers/gpu/drm/ci/igt_runner.sh b/drivers/gpu/drm/ci/igt_runner.sh
> > > index f38836ec837c..d2c043cd8c6a 100755
> > > --- a/drivers/gpu/drm/ci/igt_runner.sh
> > > +++ b/drivers/gpu/drm/ci/igt_runner.sh
> > > @@ -85,6 +85,17 @@ deqp-runner junit \
> > > --limit 50 \
> > > --template "See https://$CI_PROJECT_ROOT_NAMESPACE.pages.freedesktop.org/-/$CI_PROJECT_NAME/-/jobs/$CI_JOB_ID/artifacts/results/{{testcase}}.xml";
> > >
> > > +# Check if /proc/lockdep_stats exists
> > > +if [ -f /proc/lockdep_stats ]; then
> > > + # If debug_locks is 0, it indicates lockdep is detected and it turns itself off.
> > > + debug_locks=$(grep 'debug_locks:' /proc/lockdep_stats | awk '{print $2}')
> > > + if [ "$debug_locks" -eq 0 ]; then
> > > + echo "LOCKDEP issue detected. Please check dmesg logs for more information."
> > > + cat /proc/lockdep_stats
> > > + ret=1
> > > + fi
> > > +fi
> > > +
> > > # Store the results also in the simpler format used by the runner in ChromeOS CI
> > > #sed -r 's/(dmesg-warn|pass)/success/g' /results/results.txt > /results/results_simple.txt
> > >
> > > --
> > > 2.43.0
> > >
> > >
>