RE: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS

From: Deucher, Alexander
Date: Tue Mar 28 2017 - 16:37:39 EST


> -----Original Message-----
> From: Joerg Roedel [mailto:jroedel@xxxxxxx]
> Sent: Tuesday, March 28, 2017 4:29 PM
> To: Deucher, Alexander
> Cc: 'Joerg Roedel'; Bjorn Helgaas; linux-pci@xxxxxxxxxxxxxxx; linux-
> kernel@xxxxxxxxxxxxxxx; Daniel Drake; Nath, Arindam
> Subject: Re: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
>
> On Tue, Mar 28, 2017 at 08:18:26PM +0000, Deucher, Alexander wrote:
> > > -----Original Message-----
> > > From: Joerg Roedel [mailto:joro@xxxxxxxxxx]
> > > Sent: Tuesday, March 28, 2017 8:17 AM
> > > To: Bjorn Helgaas
> > > Cc: linux-pci@xxxxxxxxxxxxxxx; linux-kernel@xxxxxxxxxxxxxxx; Joerg
> Roedel;
> > > Daniel Drake; Deucher, Alexander
> > > Subject: [PATCH] PCI: Blacklist AMD Stoney GPU devices for ATS
> > >
> > > From: Joerg Roedel <jroedel@xxxxxxx>
> > >
> > > ATS is broken on these devices. Under invalidation load, the
> > > GPU does not reply to invalidations anymore, causing
> > > Completion-wait loop timeouts on the AMD IOMMU driver side.
> > > Fix it by not enabling ATS on these devices.
> > >
> > > Note that below mentioned commit is not broken, it just
> > > triggers the issue because it might cause invalidation
> > > storms on devices.
> > >
> > > Fixes: b1516a14657a ('iommu/amd: Implement flush queue')
> > > Reported-by: Daniel Drake <drake@xxxxxxxxxxxx>
> > > Cc: Daniel Drake <drake@xxxxxxxxxxxx>
> > > Cc: Alexander Deucher <Alexander.Deucher@xxxxxxx>
> > > Signed-off-by: Joerg Roedel <jroedel@xxxxxxx>
> >
> > Did you see Arindam's patch from yesterday[1]? Not sure which is the
> proper fix, maybe both?
>
> Arindam's patch makes sense on its own, but not as a fix for this issue.
> It lowers the invalidation load on the GPU, but there are still ways to
> trigger a high invalidation rate on the device. So it might hide the
> issue, but not fix it.
>
> We need to disable ATS on the device if it doesn't work reliably.

The question is, could the problem stem from flushing an entity that didn't request it, or should that not matter? I guess it shouldn't matter otherwise we'd see this on other platforms like Carrizo as well.

Alex