Re: [Intel-gfx] alderlake crashes (random memory corruption?) with 6.0 i915 / ucode related
From: Ville Syrjälä
Date: Tue Oct 18 2022 - 06:33:12 EST
On Mon, Oct 17, 2022 at 04:32:28PM +0200, Hans de Goede wrote:
> Hi,
>
> On 10/17/22 15:35, Jani Nikula wrote:
> > On Mon, 17 Oct 2022, Hans de Goede <hdegoede@xxxxxxxxxx> wrote:
> >> Hi,
> >>
> >> On 10/17/22 13:19, Thorsten Leemhuis wrote:
> >>> CCing the regression mailing list, as it should be in the loop for all
> >>> regressions, as explained here:
> >>> https://www.kernel.org/doc/html/latest/admin-guide/reporting-issues.html
> >>
> >> Yes sorry about that I meant to Cc the regressions list, not you personally,
> >> but the auto-completion picked the wrong address-book entry
> >> (and I did not notice this).
> >>
> >>> On 17.10.22 12:48, Hans de Goede wrote:
> >>>> On 10/17/22 10:39, Jani Nikula wrote:
> >>>>> On Mon, 17 Oct 2022, Jani Nikula <jani.nikula@xxxxxxxxxxxxxxx> wrote:
> >>>>>> On Thu, 13 Oct 2022, Hans de Goede <hdegoede@xxxxxxxxxx> wrote:
> >>>>>>> With 6.0 the following WARN triggers:
> >>>>>>> drivers/gpu/drm/i915/display/intel_bios.c:477:
> >>>>>>>
> >>>>>>> drm_WARN(&i915->drm, min_size == 0,
> >>>>>>> "Block %d min_size is zero\n", section_id);
> >>>>>>
> >>>>>> What's the value of section_id that gets printed?
> >>>>>
> >>>>> I'm guessing this is [1] fixed by commit d3a7051841f0 ("drm/i915/bios:
> >>>>> Use hardcoded fp_timing size for generating LFP data pointers") in
> >>>>> v6.1-rc1.
> >>>>>
> >>>>> I don't think this is the root cause for your issues, but I wonder if
> >>>>> you could try v6.1-rc1 or drm-tip and see if we've fixed the other stuff
> >>>>> already too?
> >>>>
> >>>> 6.1-rc1 indeed does not trigger the drm_WARN and for now (couple of
> >>>> reboots, running for 5 minutes now) it seems stable. 6.0.0 usually
> >>>> crashed during boot (but not always).
> >>>>
> >>>> Do you think it would be worthwhile to try 6.0.0 with d3a7051841f0 ?
> >>
> >> So I have been trying 6.0.0 with d3a7051841f0 doing a whole bunch of
> >> reboots + general use and that seems stable, then I reverted it and
> >> the very first boot of the kernel with that broke again, so I'm
> >> pretty sure that d3a7051841f0 fixes things.
> >>
> >> So d3a7051841f0 seems to do more then just fix the WARN().
> >
> > Wow, so I guess we do screw up the parsing royally then. :o
>
> I'm running the kernel with lockdep + list-debugging enabled and
> I could not reproduce this (not easily at least) on a standard
> Fedora 6.0.0 build without that. So maybe the parsing just manages
> to write out of binds a tiny bit which just happens to hit a list_head
> somewhere ... ?
We don't parse any of the LFP data stuff if we didn't manage
to generate the data ptrs. So can't really see how that would
happen. Another theory might be that something else gets
screwed up if we fail to parse anything, but can't really
think how that would lead to list corruption either.
>
> Either way things look stable with d3a7051841f0 and it turns out
> that Fedora already had that cherry-picked downstream in the
> 5.19.13 kernel which was stable for me too.
>
> >> So lets try to get d3a7051841f0 added to the official stable series
> >> ASAP (I just noticed that Mark Pearson from Lenovo has already added it
> >> to Fedora's 6.0.2 build.
> >
> > I think I'd also pick d3a7051841f0^ i.e. both commits:
> >
> > d3a7051841f0 ("drm/i915/bios: Use hardcoded fp_timing size for generating LFP data pointers")
> > 4e78d6023c15 ("drm/i915/bios: Validate fp_timing terminator presence")
> >
> > for stable.
Ack from me.
>
> That sounds good, can you take care of submitting these to gkh ?
>
> Regards,
>
> Hans
--
Ville Syrjälä
Intel