Re: xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 2 comp_code 1

From: Mathias Nyman
Date: Fri Jul 24 2015 - 07:56:10 EST


On 22.07.2015 17:12, Arkadiusz Miskiewicz wrote:
>
> [sorry, resend from different email - vger postmaster team has stupid filters
> in place]
>
> On Tuesday 21 of July 2015, Mathias Nyman wrote:
>> On 20.07.2015 23:13, Arkadiusz Miskiewicz wrote:
>>> On Saturday 18 of July 2015, Arkadiusz Miskiewicz wrote:
>>>> Hi.
>>>>
>>>> I'm on 4.2.0-rc2-00077-gf760b87 kernel and while trying to copy some
>>>> file from usb storage (sata disk behind sata-usb bridge or pendrive;
>>>> hapens in
>>>
>>>> both cases) copying process hangs just early after start with:
>>> Looks like suspend & resume is enough. Reloading bluetooth firmware done
>>> by kernel triggers problem:
>>>
>>> [ 106.302783] rtc_cmos 00:02: System wakeup disabled by ACPI
>>> [ 106.313280] PM: resume of devices complete after 3003.032 msecs
>>> [ 106.314079] Restarting tasks ... done.
>>> [ 106.326434] Bluetooth: hci0: read Intel version: 370710018002030d00
>>> [ 106.330422] Bluetooth: hci0: Intel Bluetooth firmware file:
>>> intel/ibt-hw-37.7.10-fw-1.80.2.3.d.bseq [ 106.398223] xhci_hcd
>>> 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD
>>> ep_index 0 comp_code 1
>>
>> Looks like we get an event for a really old transfer for some reason, it
>> should probably be handled already.
>>
>> I got a patch that adds more paranoid checks for TRB cancel, which has been
>> one major reasons for the "Transfer event TRB DMA ptr not part of current
>> TD" Errors. It also adds some logging to show what's went wrong. (patch
>> attached, applies on 4.2-rc3) Can you see if it helps?
>
> It doesn't unfortunately. 4.2.0-rc3-00017-gd725e66 + that patch
> -> dmesg http://sprunge.us/PDIE
>
> around 91s - after resume from ram bluetooth driver reloads
>
> around 754s - tried to copy data from external usb disk
>
>
>> If it doesn't, then adding xhci debugging could give us some clue:
>> echo -n 'module xhci_hcd =p' > /sys/kernel/debug/dynamic_debug/control
>
> Ok, http://sprunge.us/GiHX
>

Thanks for the logs, They show that the error is related to transfer descriptors that wrap around
on the endpoint ring buffer by exactly one transfer block.

I don't know yet why this happens, and I might need some help running additional debug
patches to solve this. I'll take a more in depth look at the code one more time first.

A short explanation of the error, mostly for myself:

To transfer data we have a ring buffer that holds transfer requests blocks (TRBs).
The ring buffer is made up of two segments,
the last TRB in each segment is a special link TRB that points to the next segment.

Segment A: 0x3000 - 0x3ff0 (where link TRB at 0x3ff0 points to Segment B, 0x4000)
Segment B: 0x4000 - 0x4ff0 (where link TRB at 0x4ff0 points back at Segment A, 0x3000)

A tranfer descriptor (TD) can consist of many TRBs, in this case three TRBs.
When a TD is completed we will get an event telling where the last transferred TRB of the TD was.

So in this case the error:
xhci_hcd 0000:00:14.0: ERROR Transfer event TRB DMA ptr not part of current TD ep_index 0 comp_code 1
xhci_hcd 0000:00:14.0: Looking for event-dma 00000000fffd3000 trb-start 00000000fffd4fd0 trb-end 00000000fffd5000
seg-start 00000000fffd4000 seg-end 00000000fffd4ff0

tells us the first TRB of the TD (trb-start) is at 4fd0,
The second TRB is at 4fe0,
Then we have the special link TRB be at 4ff0, pointing us back to the first segment.
The third and final TRB should be back at the first segment at 3000.

We get an event for the last TRB at 3000 and all should be fine, but driver claims the TDs TRBs start at 4fd0,
and the last TRB is at 5000, missing the link TRB wrapping us back to the first segment.


-Mathias









--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/