Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)

From: Karol Herbst
Date: Thu Jun 01 2023 - 13:20:05 EST


On Thu, Jun 1, 2023 at 6:54 PM Limonciello, Mario
<Mario.Limonciello@xxxxxxx> wrote:
>
> [AMD Official Use Only - General]
>
> > -----Original Message-----
> > From: Karol Herbst <kherbst@xxxxxxxxxx>
> > Sent: Thursday, June 1, 2023 11:33 AM
> > To: Limonciello, Mario <Mario.Limonciello@xxxxxxx>
> > Cc: Nick Hastings <nicholaschastings@xxxxxxxxx>; Lyude Paul
> > <lyude@xxxxxxxxxx>; Lukas Wunner <lukas@xxxxxxxxx>; Salvatore
> > Bonaccorso <carnil@xxxxxxxxxx>; 1036530@xxxxxxxxxxxxxxx; Rafael J.
> > Wysocki <rafael@xxxxxxxxxx>; Len Brown <lenb@xxxxxxxxxx>; linux-
> > acpi@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx;
> > regressions@xxxxxxxxxxxxxxx
> > Subject: Re: Regression from "ACPI: OSI: Remove Linux-Dell-Video _OSI
> > string"? (was: Re: Bug#1036530: linux-signed-amd64: Hard lock up of system)
> >
> > On Thu, Jun 1, 2023 at 6:18 PM Limonciello, Mario
> > <mario.limonciello@xxxxxxx> wrote:
> > >
> > > +Lyude, Lukas, Karol
> > >
> > > On 5/31/2023 6:40 PM, Nick Hastings wrote:
> > > > Hi,
> > > >
> > > > * Nick Hastings <nicholaschastings@xxxxxxxxx> [230530 16:01]:
> > > >> * Mario Limonciello <mario.limonciello@xxxxxxx> [230530 13:00]:
> > > > <snip>
> > > >>> As you're actually loading nouveau, can you please try
> > nouveau.runpm=0 on
> > > >>> the kernel command line?
> > > >> I'm not intentionally loading it. This machine also has intel graphics
> > > >> which is what I prefer. Checking my
> > > >> /etc/modprobe.d/blacklist-nvidia-nouveau.conf
> > > >> I see:
> > > >>
> > > >> blacklist nvidia
> > > >> blacklist nvidia-drm
> > > >> blacklist nvidia-modeset
> > > >> blacklist nvidia-uvm
> > > >> blacklist ipmi_msghandler
> > > >> blacklist ipmi_devintf
> > > >>
> > > >> So I thought I had blacklisted it but it seems I did not. Since I do not
> > > >> want to use it maybe it is better to check if the lock up occurs with
> > > >> nouveau blacklisted. I will try that now.
> > > > I blacklisted nouveau and booted into a 6.1 kernel:
> > > > % uname -a
> > > > Linux xps 6.1.0-9-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.27-1
> > (2023-05-08) x86_64 GNU/Linux
> > > >
> > > > It has been running without problems for nearly two days now:
> > > > % uptime
> > > > 08:34:48 up 1 day, 16:22, 2 users, load average: 1.33, 1.26, 1.27
> > > >
> > > > Regards,
> > > >
> > > > Nick.
> > >
> > > Thanks, that makes a lot more sense now.
> > >
> > > Nick, Can you please test if nouveau works with runtime PM in the
> > > latest 6.4-rc?
> > >
> > > If it works in 6.4-rc, there are probably nouveau commits that need
> > > to be backported to 6.1 LTS.
> > >
> > > If it's still broken in 6.4-rc, I believe you should file a bug:
> > >
> > > https://gitlab.freedesktop.org/drm/nouveau/
> > >
> > >
> > > Lyude, Lukas, Karol
> > >
> > > This thread is in relation to this commit:
> > >
> > > 24867516f06d ("ACPI: OSI: Remove Linux-Dell-Video _OSI string")
> > >
> > > Nick has found that runtime PM is *not* working for nouveau.
> > >
> >
> > keep in mind we have a list of PCIe controllers where we apply a
> > workaround:
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/drivers
> > /gpu/drm/nouveau/nouveau_drm.c?h=v6.4-rc4#n682
> >
> > And I suspect there might be one or two more IDs we'll have to add
> > there. Do we have any logs?
>
> There's some archived onto the distro bug. Search this page for "journalctl.log.gz"
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1036530
>

interesting.. It seems to be the same controller used here. I wonder
if the pci topology is different or if the workaround is applied at
all.

But yeah, I'd kinda love for somebody with better knowledge on all of
this to figure out what exactly is going wrong, but everytime this
gets investigated Intel says "our hardware has no bugs", the ACPI
folks dig for months and find nothing and I end up figuring out some
weirdo workaround I don't understand. And apparently also nobody is
able to hand out docs explaining in detail how that runtime
suspend/resume stuff is supposed to work.

I have a Dell XPS 9560 where the added workaround in nouveau fixed the
problem and I know it's fixed on a bunch of other systems. So if
anybody is willing to publish docs and/or actually debug it with
domain knowledge, please go ahead.

> > And could anybody test if adding the
> > controller in play here does resolve the problem?
> >
> > > If you recall we did 24867516f06d because 5775b843a619 was
> > > supposed to have fixed it.
> > >
>