Re: [PATCH v2] PCI/MSI: Don't touch MSI bits when the PCI device is disconnected

From: Alex_Gagniuc
Date: Wed Nov 14 2018 - 14:22:14 EST


On 11/14/2018 12:00 AM, Bjorn Helgaas wrote:
> On Tue, Nov 13, 2018 at 10:39:15PM +0000, Alex_Gagniuc@xxxxxxxxxxxx wrote:
>> On 11/12/2018 11:02 PM, Bjorn Helgaas wrote:
>>>
>>> [EXTERNAL EMAIL]
>>> Please report any suspicious attachments, links, or requests for sensitive information.
>
> It looks like Dell's email system adds the above in such a way that the
> email quoting convention suggests that *I* wrote it, when I did not.

I was wondering why you thought I was suspicious. It's a recent
(server-side) change. You used to be able to disable these sort of
notices. I'm told back in the day people were asked to delete emails
before reading them.

>> ...
>>> Do you think Linux observes the rule about not touching AER bits on
>>> FFS? I'm not sure it does. I'm not even sure what section of the
>>> spec is relevant.
>>
>> I haven't found any place where linux breaks this rule. I'm very
>> confident that, unless otherwise instructed, we follow this rule.
>
> Just to make sure we're on the same page, can you point me to this
> rule? I do see that OSPM must request control of AER using _OSC
> before it touches the AER registers. What I don't see is the
> connection between firmware-first and the AER registers.

ACPI 6.2 - 6.2.11.3, Table 6-197:

PCI Express Advanced Error Reporting control:
* The firmware sets this bit to 1 to grant control over PCI Express
Advanced Error Reporting. If firmware allows the OS control of this
feature, then in the context of the _OSC method it must ensure that
error messages are routed to device interrupts as described in the PCI
Express Base Specification[...]

Now I'm confused too:
* HEST -> __aer_firmware_first
This is used for touching/not touching AER bits
* _OSC -> bridge->native_aer
Used to enable/not enable AER portdrv service
Maybe Keith knows better why we're doing it this way. From ACPI text, it
doesn't seem that control of AER would be tied to HEST entries, although
in practice, it is.

> The closest I can find is the "Enabled" field in the HEST PCIe
> AER structures (ACPI v6.2, sec 18.3.2.4, .5, .6), where it says:
>
> If the field value is 1, indicates this error source is
> to be enabled.
>
> If the field value is 0, indicates that the error source
> is not to be enabled.
>
> If FIRMWARE_FIRST is set in the flags field, the Enabled
> field is ignored by the OSPM.
>
> AFAICT, Linux completely ignores the Enabled field in these
> structures.

I don't think ignoring the field is a problem:
* With FFS, OS should ignore it.
* Without FFS, we have control, and we get to make the decisions anyway.
In the latter case we decide whether to use AER, independent of the crap
in ACPI. I'm not even sure why "Enabled" matters in native AER handling.
Probably one of the check-boxes in "Binary table designer's handbook"?

> These structures also contain values the OS is apparently supposed to
> write to Device Control and several AER registers (in struct
> acpi_hest_aer_common). Linux ignores these as well.
>
> These seem like fairly serious omissions in Linux.

I think HPX carries the same sort of information (except for Root
Command reg). FW is supposed to program those registers anyway, so even
if OS doesn't touch them, I'd expect things to just work.

>>> The whole issue of firmware-first, the mechanism by which firmware
>>> gets control, the System Error enables in Root Port Root Control
>>> registers, etc., is very murky to me. Jon has a sort of similar issue
>>> with VMD where he needs to leave System Errors enabled instead of
>>> disabling them as we currently do.
>>
>> Well, OS gets control via _OSC method, and based on that it should
>> touch/not touch the AER bits.
>
> I agree so far.
>
>> The bits that get set/cleared come from _HPX method,
>
> _HPX tells us about some AER registers, Device Control, Link Control,
> and some bridge registers. It doesn't say anything about the Root
> Control register that Jon is concerned with.

_HPX type 3 (yay!!!) got approved recently, and that will have more
fine-grained control. It will be able to handle root control reg.

> For firmware-first to work, firmware has to get control. How does it
> get control? How does OSPM know to either set up that mechanism or
> keep its mitts off something firmware set up before handoff?

My understanding is that, if FW keeps control of AER in _OSC, then it
will have set things up to get notified instead of the OS. OSPM not
touching AER bits is to make sure it doesn't mess up FW's setup. I think
there are some proprietary bits in the root port to route interrupts to
SMIs instead of the AER vectors.

> In Jon's
> VMD case, I think firmware-first relies on the System Error controlled
> by the Root Control register. Linux thinks it owns that, and I don't
> know how to learn otherwise.

Didn't Keith say the root port is not visible to the OS?

Alex