Re: mmc0: Got data interrupt 0x04000000 even though no data operation was in progress.
From: Adrian Hunter
Date: Thu Aug 08 2024 - 05:47:49 EST
On 7/08/24 00:35, Gratian Crisan wrote:
>
> Adrian Hunter <adrian.hunter@xxxxxxxxx> writes:
>> On 6/08/24 00:33, Gratian Crisan wrote:
>>>
>>> We are getting the following splat on latest 6.11.0-rc2-00002-gc813111d19e6 (and
>>> older) kernel(s):
>>
>> Do you know a kernel version that does not get an error?
>>
>
> Sorry for not being more clear in my original email - this is not a new issue. I
> believe this Bay Trail hardware always had an issue with receiving "Tuning
> Error" interrupts with certain SD cards. At least as far back as 4.9.47.
>
> Up until commit b3855668d98c ("mmc: sdhci: Add support for "Tuning Error"
> interrupts") these resulted in a "mmc0: Unexpected interrupt 0x04000000" splat,
> which b3855668d98c fixed.
>
> However, now that "Tuning Error" interrupts are treated as data interrupts and
> handled in sdhci_data_irq() we are hitting a corner case where that tuning error
> interrupt comes in after a MMC_SEND_STATUS command with no 'host->data'
> associated resulting in the new splat.
Ok, thanks for clarifying.
> Hence the question in my previous email: Should the tuning error interrupts be
> handled in common code in sdhci_irq()?
>
>>>
>>> [ 4.792991] mmc0: new ultra high speed DDR50 SDHC card at address 0001
>>> [ 4.793550] with environment:
>>> [ 4.793786] HOME=/
>>> [ 4.793985] TERM=linux
>>> [ 4.794201] BOOT_IMAGE=/runmode/bzImage
>>> [ 4.794485] sys_reset=false
>>> [ 4.795791] mmcblk0: mmc0:0001 0016G 15.2 GiB
>>> [ 5.333153] mmc0: Got data interrupt 0x04000000 even though no data operation was in progress.
>>> [ 5.333676] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> [ 5.334069] mmc0: sdhci: Sys addr: 0x12454200 | Version: 0x0000b502
>>> [ 5.334464] mmc0: sdhci: Blk size: 0x00007040 | Blk cnt: 0x00000001
>>> [ 5.334860] mmc0: sdhci: Argument: 0x00010000 | Trn mode: 0x00000010
>>> [ 5.335253] mmc0: sdhci: Present: 0x01ff0000 | Host ctl: 0x00000016
>>> [ 5.335648] mmc0: sdhci: Power: 0x0000000f | Blk gap: 0x00000000
>>> [ 5.336040] mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x00000107
>>> [ 5.336432] mmc0: sdhci: Timeout: 0x0000000a | Int stat: 0x00000000
>>> [ 5.336824] mmc0: sdhci: Int enab: 0x03ff008b | Sig enab: 0x03ff008b
>>> [ 5.337214] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>> [ 5.337605] mmc0: sdhci: Caps: 0x076864b2 | Caps_1: 0x00000004
>>> [ 5.337997] mmc0: sdhci: Cmd: 0x00000d1a | Max curr: 0x00000000
>>> [ 5.338389] mmc0: sdhci: Resp[0]: 0x00400900 | Resp[1]: 0x00000000
>>> [ 5.338780] mmc0: sdhci: Resp[2]: 0x00000000 | Resp[3]: 0x00000000
>>> [ 5.339170] mmc0: sdhci: Host ctl2: 0x0000000c
>>> [ 5.339468] mmc0: sdhci: ADMA Err: 0x00000003 | ADMA Ptr: 0x12454200
>>> [ 5.339859] mmc0: sdhci: ============================================
>>> [ 5.340293] I/O error, dev mmcblk0, sector 0 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
>>> [ 5.344663] Buffer I/O error on dev mmcblk0, logical block 0, async page read
>>> [ 5.346127] mmcblk0: p1 p2
>>>
>>> This is on an Intel Bay Trail based system: NI cRIO-9053 using an Atom E3805.
>>>
>>> The issue appears related to the one fixed by commit b3855668d98c ("mmc: sdhci:
>>> Add support for "Tuning Error" interrupts") and discussed here[1].
>>
>> Does reverting that commit help?
>>
>
> Reverting the commit brings back the original splat that commit fixed (albeit
> without the I/O error):
>
> [ 4.893032] mmc0: new ultra high speed DDR50 SDHC card at address 0001
> [ 4.896238] mmcblk0: mmc0:0001 0016G 15.2 GiB
> [ 4.905944] mmc0: Unexpected interrupt 0x04000000.
> [ 4.906272] mmc0: sdhci: ============ SDHCI REGISTER DUMP ===========
> [ 4.906664] mmc0: sdhci: Sys addr: 0x126e6200 | Version: 0x0000b502
> [ 4.907059] mmc0: sdhci: Blk size: 0x00007200 | Blk cnt: 0x00000008
> [ 4.907451] mmc0: sdhci: Argument: 0x00000000 | Trn mode: 0x0000003b
> [ 4.907842] mmc0: sdhci: Present: 0x01ff0206 | Host ctl: 0x00000016
> [ 4.908234] mmc0: sdhci: Power: 0x0000000f | Blk gap: 0x00000000
> [ 4.908625] mmc0: sdhci: Wake-up: 0x00000000 | Clock: 0x00000107
> [ 4.909015] mmc0: sdhci: Timeout: 0x0000000a | Int stat: 0x00000002
> [ 4.909408] mmc0: sdhci: Int enab: 0x03ff008b | Sig enab: 0x03ff008b
> [ 4.909800] mmc0: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000001
> [ 4.910193] mmc0: sdhci: Caps: 0x076864b2 | Caps_1: 0x00000004
> [ 4.910581] mmc0: sdhci: Cmd: 0x0000123a | Max curr: 0x00000000
> [ 4.910976] mmc0: sdhci: Resp[0]: 0x00000900 | Resp[1]: 0x00400900
> [ 4.911371] mmc0: sdhci: Resp[2]: 0x00000000 | Resp[3]: 0x00400900
> [ 4.911765] mmc0: sdhci: Host ctl2: 0x0000000c
> [ 4.912064] mmc0: sdhci: ADMA Err: 0x00000000 | ADMA Ptr: 0x126e6200
> [ 4.912456] mmc0: sdhci: ============================================
> [ 4.913301] mmcblk0: p1 p2
> [ 6.401855] EXT4-fs (mmcblk1p2): mounted filesystem d57a3d3c-a1f9-4f8e-8cbc-19dc5bb4fc4c r/w with ordered data mode. Quota mode: disabled.
>
>>> I am new to this area of the kernel so I would appreciate any suggestions on the
>>> direction to take here:
>>>
>>> - Should the tuning error interrupts be handled in common code in sdhci_irq()
>>> (or at least before the !host->data check in sdhci_data_irq())?
>>>
>>> - Is this more of an issue with tuning not happening when is expected or
>>> taking too long, since at first we do get the error during data transfer
>>> commands? Suggestions on what I should debug/trace next appreciated.
>>
>> SDHCI driver does not enable the "Tuning Error" interrupt, refer
>> the kernel messages above:
>>
>> Int enab: 0x03ff008b | Sig enab: 0x03ff008b
>>
>> but it happens anyway, so the "fix" was to handle it anyway.
>>
>> But it begs the question, wasn't the error happening already?
>>
>
> Kind of: Before we were getting "mmc0: Unexpected interrupt 0x04000000", but
> somehow it didn't result in a I/O error. That may be just lucky timing.
>
> Now we're getting "mmc0: Got data interrupt 0x04000000 even though no data operation was in
> progress." followed by an I/O error on READ.
>
> I appreciate your reply. I'm happy to work on a patch or test things if I'm
> pointed in the right direction.
Note that neither "Got data interrupt ... even though no data
operation was in progress" nor "Unexpected interrupt ..." messages
result in I/O errors directly, but indicate that the host controller
is not behaving as expected.
However, according to the spec. "Tuning Error" interrupt status is
an unrecoverable error, so commit b3855668d98c ("mmc: sdhci: Add
support for "Tuning Error" interrupts") began treating it that way
and causing a re-tune and retry, as Hans reported that it made things
work for his devices.
You should first get a better idea of what circumstances errors are
occurring. Enabling debug messages should help, but it will cause
a very large number of messages and you may also want to increase
the kernel message buffer size (CONFIG_LOG_BUF_SHIFT)
Dynamic debug for mmc:
Kernel must be configured:
CONFIG_DYNAMIC_DEBUG=y
To enable mmc debug via sysfs:
echo 'file drivers/mmc/core/* +p' > /sys/kernel/debug/dynamic_debug/control
echo 'file drivers/mmc/host/* +p' > /sys/kernel/debug/dynamic_debug/control
To enable mmc debug via kernel command line:
dyndbg="file drivers/mmc/core/* +p;file drivers/mmc/host/* +p"
To disable mmc debug:
echo 'file drivers/mmc/core/* -p' > /sys/kernel/debug/dynamic_debug/control
echo 'file drivers/mmc/host/* -p' > /sys/kernel/debug/dynamic_debug/control
More general information in kernel documentation in kernel tree:
Documentation/admin-guide/dynamic-debug-howto.rst