Re: [PATCH 2/8] bus: fsl-mc: handle DMA config deferral in ACPI case

From: Laurentiu Tudor
Date: Wed Nov 17 2021 - 10:30:42 EST




On 11/17/2021 3:59 PM, Daniel Thompson wrote:
> On Wed, Nov 17, 2021 at 03:07:51PM +0200, Laurentiu Tudor wrote:
>> On 11/12/2021 7:31 PM, Daniel Thompson wrote:
>>> On Thu, Nov 11, 2021 at 06:36:58PM +0100, Jon Nettleton wrote:
>>>> On Thu, Nov 11, 2021 at 6:23 PM Daniel Thompson
>>>> <daniel.thompson@xxxxxxxxxx> wrote:
>>>>> Hi Laurentiu
>>>>>
>>>>> On Thu, Jul 15, 2021 at 05:07:12PM +0300, laurentiu.tudor@xxxxxxx wrote:
>>>>>> From: Laurentiu Tudor <laurentiu.tudor@xxxxxxx>
>>>>>>
>>>>>> ACPI DMA configure API may return a defer status code, so handle it.
>>>>>> On top of this, move the MC firmware resume after the DMA setup
>>>>>> is completed to avoid crashing due to DMA setup not being done yet or
>>>>>> being deferred.
>>>>>>
>>>>>> Signed-off-by: Laurentiu Tudor <laurentiu.tudor@xxxxxxx>
>>>>>
>>>>> I saw regressions on my Honeycomb LX2 (NXP LX2060A) when I switched to
>>>>> v5.15. It seems like it results in so many sMMU errors that the system
>>>>> cannot function correctly (it's only about a 75% chance the system will
>>>>> boot to GUI and even if it does boot successfully the system will hang
>>>>> up soon after).
>>>>>
>>>>> Bisect took me up a couple of blind alleys (mostly due to unrelated boot
>>>>> problems in v5.14-rc2) by eventually led me to this patch as the cause.
>>>>> Applying/unapplying this patch to a v5.14-rc3 tree will provoke/fix the
>>>>> problem and reverting it against v5.15 also resolves the problem.
>>>>>
>>>>> Is there some specific firmware version required for this patch to work
>>>>> correctly?
>>>>
>>>> This patch was merged as a requirement for operational on board networking.
>>>> This was merged as a prerequisite to landing the patches to support MDIO and
>>>> phy initialization in general.
>>>
>>> Interesting.
>>>
>>> I assumed the change of behaviour comes from properly handling
>>> -EPROBE_DEFER (which can hardly be regarded as a fault with the patch).
>>>
>>> Having said that the patch does not seem to be mandatory to get the 1G
>>> networking working on Honeycomb LX2 (running ACPI). By taking v5.15 and
>>> reverting as I shared previously, I am still able to access the network
>>> using the 1G port on the back of the unit (although I didn't do any
>>> performance tests).
>>>
>>>
>>>> The correct solution for the problem you are seeing is the ACPI
>>>> maintainers figuring out how to land the IORT RMR patchset. Until
>>>> that is done the only workaround is setting "arm-smmu.disable_bypass=0
>>>> iommu.passthrough=1" on the kernel commandline. The latter option is
>>>> required since 5.15 and I haven't had time or energy to figure out
>>>> why. The proper solution is to just land the IORT RMR patchset and
>>>> let HoneyComb run with the SMMU enabled.
>>>
>>> Thanks for the update. I'll probably adopt iommu.passthrough=1 for now.
>>> That allows me to adopt a distro kernel when it updates to v5.15.
>>
>> The "iommu.passthrough=1" kernel arg shouldn't be needed. By chance, do
>> you remember what errors were you seeing? What was failing?
>
> For all testing of v5.15 I had "arm-smmu.disable_bypass=0" set because I
> was guided to enable that by the error messages in older kernels ;-) .
>
> Anyhow without "iommu.passthrough=1" (and without the patch from this thread
> reverted) then the logs are being massively spammed with error messages:
>
> ~~~
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm-smmu arm-smmu.0.auto: Unhandled context fault: fsr=0x402, iova=0x23e0000100, fsynr=0x20040, cbfrsynra=0x4000, cb=0
> arm_smmu_context_fault: 1697259 callbacks suppressed
> ~~~
>
> This results a relatively simple workstation (LX2 + nVidia GT-710 + USB
> for networking) becoming unresponsive. How long to fail is a little
> unpredictable. I assumed that the weight of such dense log messages
> eventually gets into a timing pattern that prevented any useful
> interrupts from being serviced... but that is only a guess.
>

Few comments here:
- I'm suspecting that the PCI video card is triggering the smmu faults.
Would it be possible to give it a try with the card out and without
"iommu.passthrough=1"?
- the IOVAs look weird to me, they should look something like
0xffffxxxxxx or so. Maybe there are issues in the nvidia driver?
- Would it be possible to share a full boot log? I'm thinking that it
would be interesting to see how the devices are allocated in iommu groups.

---
Thanks & Best Regards, Laurentiu