RE: [PATCH] ACPI: don't show an error when we're not in charge of PCIe hotplug.

From: Mario_Limonciello
Date: Wed Jun 22 2016 - 15:43:35 EST


> -----Original Message-----
> From: rjwysocki@xxxxxxxxx [mailto:rjwysocki@xxxxxxxxx] On Behalf Of
> Rafael J. Wysocki
> Sent: Tuesday, June 21, 2016 5:51 PM
> To: Limonciello, Mario <Mario_Limonciello@xxxxxxxx>
> Cc: Peter Jones <pjones@xxxxxxxxxx>; Rafael J. Wysocki
> <rafael@xxxxxxxxxx>; ACPI Devel Maling List <linux-acpi@xxxxxxxxxxxxxxx>;
> Linux Kernel Mailing List <linux-kernel@xxxxxxxxxxxxxxx>; Len Brown
> <lenb@xxxxxxxxxx>; Rafael J. Wysocki <rjw@xxxxxxxxxxxxx>; Andy Lutomirski
> <luto@xxxxxxxxxxxxxx>
> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge of
> PCIe hotplug.
>
> On Tue, Jun 21, 2016 at 8:01 PM, <Mario_Limonciello@xxxxxxxx> wrote:
> >> -----Original Message-----
> >> From: Peter Jones [mailto:pjones@xxxxxxxxxx]
> >> Sent: Tuesday, June 21, 2016 10:19 AM
> >> To: Rafael J. Wysocki <rafael@xxxxxxxxxx>
> >> Cc: ACPI Devel Maling List <linux-acpi@xxxxxxxxxxxxxxx>; Limonciello,
> Mario
> >> <Mario_Limonciello@xxxxxxxx>; Linux Kernel Mailing List <linux-
> >> kernel@xxxxxxxxxxxxxxx>; Len Brown <lenb@xxxxxxxxxx>; Rafael J .
> Wysocki
> >> <rjw@xxxxxxxxxxxxx>; Andy Lutomirski <luto@xxxxxxxxxxxxxx>
> >> Subject: Re: [PATCH] ACPI: don't show an error when we're not in charge
> of
> >> PCIe hotplug.
> >>
> >> (Sorry for the slow response - it's deadline time over here.)
> >>
> >> On Thu, Jun 16, 2016 at 04:56:57PM +0200, Rafael J. Wysocki wrote:
> >> > On Thu, Jun 16, 2016 at 2:12 AM, Rafael J. Wysocki <rafael@xxxxxxxxxx>
> >> wrote:
> >> > > On Thu, Jun 16, 2016 at 12:15 AM, Peter Jones <pjones@xxxxxxxxxx>
> >> wrote:
> >> > >> Right now when booting, on many laptops the firmware manages
> the
> >> PCIe
> >> > >> bus. As a result, when we call the _OSC ACPI method, it returns an
> >> > >> error code. Unfortunately the errors are not very articulate.
> >> > >
> >> > > What exactly do you mean here?
> >> > >
> >> > >> As a result, we show:
> >> > >>
> >> > >> ACPI: PCI Root Bridge [PCI0] (domain 0000 [bus 00-fe])
> >> > >> acpi PNP0A08:00: _OSC: OS supports [ExtendedConfig ASPM
> ClockPM
> >> Segments MSI]
> >> > >> \_SB_.PCI0 (33DB4D5B-1FF7-401C-9657-7441C03DD766): _OSC invalid
> >> UUID
> >> > >> _OSC request data: 1 1f 0
> >> > >
> >> > > So _OSC told us that the UUID was invalid, didn't it?
> >> >
> >> > BTW, the above messages are KERN_DEBUG, so at least in theory they
> >> > shouldn't be visible in production runs.
> >> >
> >> > Maybe the bug to fix is that they show up when they aren't supposed
> to?
> >>
> >> No - the workflow that I am really trying to remedy is this:
> >>
> >> 1) S3 resume sometimes isn't working on some laptop you've got.
> >> 2) start looking at debug messages
> >> 3) this shows an error, so it looks like it's probably the problem
> >> 4) go fishing for red herring
> >> 5) if you happen to know who maintains the DSDT for the platform in
> >> question, eventually work out that this is working as intended and
> >> the bug is someplace else.
> >> 5b) if you don't know that person, eventually work out that it /might/
> >> be someplace else...
> >>
> >> So the idea was to make it look more like an indication of status, and
> >> less like an error that's causing unrelated problems.
> >>
> >> When I talked to Mario at Dell (Cc'd), it wasn't clear to us that
> >> there's a way to distinguish the between the UUID being
> >> invalid/malformed, being merely unsupported, or being supported in
> some
> >> configurations but not the current one. In this particular DSDT, the
> >> machine doesn't support the OS controlling any of this if USB-C /
> >> thunderbolt are enabled. The DSDT is clearly written with the belief
> >> that you have to completely disable the handling for that UUID in this
> >> case, and googling for this looks like it's not the only one written
> >> with that belief.
> >>
> >> Reading the spec (v6.1, sections 6.2.11.3 and 6.2.11.4), it seems
> >> plausible that you can express this instead by handling the UUID but
> >> choosing each individual query/status bit in the way that accomplishes
> >> the OS doing nothing with the response. So it may well be that that's
> >> just more code that vendors have thought wasn't necessary (or wasn't
> >> correct for some reason.)
> >>
> >> Mario, want to jump in on your thinking here?
> >>
> >> --
> >> Peter
> >
> > After talking to the team, I was told this particular implementation to not
> let
> > OS take control when acting on that specific UUID based upon a variable
> > (NEXP in this case) came from Intel RC code.
> >
> > That's probably why this is all across a lot of platforms, including non-Dell.
> >
> > At least in the context of the laptop Peter noticed this on (Dell XPS 13 9350)
> > NEXP is set in GNVS based upon Thunderbolt capability.
> >
> > As for why they return unrecognized UUID instead of just masking all the
> > capabilities bits? It's the same net functional result. If the vendor provided
> > RC code doesn't caused WCHK problems or functional problems it's hard to
> > make a case for why it needs to be changed by the OEM.
> >
> > I think that Peter's patch is appropriate to message this is specifically
> > what's going on.
>
> No, it may hide real (ie. non-intentional) bugs in _OSC, so it is not
> appropriate.
>
> Debug-level messages really should not hurt anyone (and should never
> show up in production anyway).
>
> We can slightly tone down the "_OSC failed (%s); disabling ASPM\n"
> message in negotiate_os_control() in drivers/acpi/pci_root.c if you
> think it's too strong and that's it.
>
> Thanks,
> Rafael

I think changing that would help communicate what's going on here and at
least let the user know the result will be that the firmware is still controlling
ASPM due to the _OSC failure.

Something else that I think Andy recommended a while back was at that
time try to evaluate NEXP and display its value and an associated message
in debug logs when _OSC fails. Would you be amenable to a change like that?