Re: [PATCH 5/5] powerpc: use the generic dma_ops_bypass mode

From: Cédric Le Goater
Date: Mon Aug 31 2020 - 03:28:02 EST


On 8/31/20 8:40 AM, Christoph Hellwig wrote:
> On Sun, Aug 30, 2020 at 11:04:21AM +0200, Cédric Le Goater wrote:
>> Hello,
>>
>> On 7/8/20 5:24 PM, Christoph Hellwig wrote:
>>> Use the DMA API bypass mechanism for direct window mappings. This uses
>>> common code and speed up the direct mapping case by avoiding indirect
>>> calls just when not using dma ops at all. It also fixes a problem where
>>> the sync_* methods were using the bypass check for DMA allocations, but
>>> those are part of the streaming ops.
>>>
>>> Note that this patch loses the DMA_ATTR_WEAK_ORDERING override, which
>>> has never been well defined, as is only used by a few drivers, which
>>> IIRC never showed up in the typical Cell blade setups that are affected
>>> by the ordering workaround.
>>>
>>> Fixes: efd176a04bef ("powerpc/pseries/dma: Allow SWIOTLB")
>>> Signed-off-by: Christoph Hellwig <hch@xxxxxx>
>>> ---
>>> arch/powerpc/Kconfig | 1 +
>>> arch/powerpc/include/asm/device.h | 5 --
>>> arch/powerpc/kernel/dma-iommu.c | 90 ++++---------------------------
>>> 3 files changed, 10 insertions(+), 86 deletions(-)
>>
>> I am seeing corruptions on a couple of POWER9 systems (boston) when
>> stressed with IO. stress-ng gives some results but I have first seen
>> it when compiling the kernel in a guest and this is still the best way
>> to raise the issue.
>>
>> These systems have of a SAS Adaptec controller :
>>
>> 0003:01:00.0 Serial Attached SCSI controller: Adaptec Series 8 12G SAS/PCIe 3 (rev 01)
>>
>> When the failure occurs, the POWERPC EEH interrupt fires and dumps
>> lowlevel PHB4 registers among which :
>>
>> [ 2179.251069490,3] PHB#0003[0:3]: phbErrorStatus = 0000028000000000
>> [ 2179.251117476,3] PHB#0003[0:3]: phbFirstErrorStatus = 0000020000000000
>>
>> The bits raised identify a PPC 'TCE' error, which means it is related
>> to DMAs. See below for more details.
>>
>>
>> Reverting this patch "fixes" the issue but it is probably else where,
>> in some other layers or in the aacraid driver. How should I proceed
>> to get more information ?
>
> The aacraid DMA masks look like a mess. Can you try the hack
> below and see it it helps?

No effect. The system crashes the same. But Alexey spotted some issue
with swiotlb.

C.