Re: [RESEND RFC 5/5] PCIe, AER: Update initial value of UC error mask

From: Chen, Gong
Date: Tue Sep 09 2014 - 03:46:47 EST


On Fri, Sep 05, 2014 at 05:34:21PM -0600, Bjorn Helgaas wrote:
> Date: Fri, 5 Sep 2014 17:34:21 -0600
> From: Bjorn Helgaas <bhelgaas@xxxxxxxxxx>
> To: "Chen, Gong" <gong.chen@xxxxxxxxxxxxxxx>
> Cc: rdunlap@xxxxxxxxxxxxx, bp@xxxxxxxxx, tony.luck@xxxxxxxxx,
> linux-pci@xxxxxxxxxxxxxxx, linux-kernel@xxxxxxxxxxxxxxx
> Subject: Re: [RESEND RFC 5/5] PCIe, AER: Update initial value of UC error
> mask
> User-Agent: Mutt/1.5.21 (2010-09-15)
>
> On Wed, Aug 13, 2014 at 02:22:41AM -0400, Chen, Gong wrote:
> > In PCI-e SPEC r3.0, BIT 0 of Uncorrectable Error Status Register
> > is redefined and it has an explicit requirement that when writing
> > this field, a value of 1b is the only choice. So change previous
> > initial maks from 0 to 1.
> >
> > Signed-off-by: Chen, Gong <gong.chen@xxxxxxxxxxxxxxx>
> > ---
> > NOTE: After scratching all use cases, this is the most obvious use
> > case to violate the SPEC. Most of use cases just read first and
> > then overwrite for clear purpose. Even so, such fix is obvious to
> > not compatiable with previous SPEC definition. Do we need a dirty
> > hack?
> >
> > arch/mips/pci/pci-octeon.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/arch/mips/pci/pci-octeon.c b/arch/mips/pci/pci-octeon.c
> > index 59cccd95688b..f1bfdc201297 100644
> > --- a/arch/mips/pci/pci-octeon.c
> > +++ b/arch/mips/pci/pci-octeon.c
> > @@ -134,7 +134,7 @@ int pcibios_plat_dev_init(struct pci_dev *dev)
> > dconfig);
> > /* Enable reporting of all uncorrectable errors */
> > /* Uncorrectable Error Mask - turned on bits disable errors */
> > - pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, 0);
> > + pci_write_config_dword(dev, pos + PCI_ERR_UNCOR_MASK, 1);
>
> I see the text in the spec that says we should only write 1 to bit 0 (sec
> 7.10.3, for anybody following along). It looks like that change was made
> between PCIe r1.0 and r1.1. It would really be nice to have more context
> about why the change was made, because if there's hardware in the field
> that implements r1.0 behavior, this patch will change the way it works, and
> I don't know how to verify that is safe.
>
> Does this actually change fix a problem? If it fixes a problem that
> happens on real hardware, that's a much better reason to make a change than
> just to comply with the spec.
>
> Sec 7.10.2 also says we should ignore the value of bit 0 in the
> Uncorrectable Error Status register, and I don't see any place where we
> follow that advice.
>
That's why I mark this patch as RFC. As you mentioned above, these are my
concerns, too. I submit such a patch not for merging but throwing a potential
issue. As I noted above, I don't know if it is deserved to fix all affected
placed to comply with spec change. After all, no one reports such an
issue (or maybe have happened :-))

Attachment: signature.asc
Description: Digital signature