Re: ASMedia ASM1062 (AHCI) hang after "ahci 0000:28:00.0: Using 64-bit DMA addresses"

From: Niklas Cassel
Date: Wed Jan 24 2024 - 05:15:59 EST


On Tue, Jan 23, 2024 at 11:00:44PM +0200, Lennert Buytenhek wrote:
> On Wed, Jan 17, 2024 at 11:52:25PM +0100, Niklas Cassel wrote:

(snip)

> This all suggests to me that the ASM1061 drops the upper 21 bits of all
> DMA addresses. Going back to the original report, on the Asus Pro WS
> WRX80E-SAGE SE WIFI, we also see DMA addresses that seem to have been
> capped to 43 bits:
>
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: Using 64-bit DMA addresses
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00000 flags=0x0000]
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00300 flags=0x0000]
> > [Thu Jan 4 23:12:54 2024] ahci 0000:28:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0035 address=0x7fffff00380 flags=0x0000]
>
> Since in this test the X570 AHCI controller is inside the chipset and
> the ASM1061 in a PCIe slot, this doesn't 100% prove that the ASM1061 is
> at fault (e.g. the upstream IOMMUs for the X570 AHCI controller and the
> ASM1061 could be behaving differently), and to 100% prove this theory I
> would have to find a non-ASM1061 AHCI controller and put it in the same
> PCIe slot as the ASM1061 is currently in, and try to make it DMA to
> address 0xffffffff00000000, and verify that the I/O page faults on the
> host report 0xffffffff00000000 and not 0x7fffff00000 -- but I think that
> the current evidence is perhaps good enough?

It does indeed look like the same issue on the internal ASMedia ASM1061 on
your Asus Pro WS WRX80E-SAGE SE WIFI and the stand alone ASMedia ASM1061
PCI card connected to your other X570 based motherboard.

However, ASMedia ASM1061 seems to be quite common, so I'm surprised that
no one has ever reported this problem before, so what has changed?
Perhaps there is some recent kernel patch that introduced this?

The commit was introduced:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4bf7fda4dce22214c70c49960b1b6438e6260b67
was reverted:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=af3e9579ecfbe1796334bb25a2f0a6437983673a
and was then introduced in a new form:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=791c2b17fb4023f21c3cbf5f268af01d9b8cb7cc

I suppose that these commits might be recent enough that we have not received
any bug reports for ASMedia ASM1061 since then.


If you can find another PCIe card (e.g. a AHCI controller or NVMe controller)
that you can plug in to the same slot on the X570 motherboard,
I agree that it would confirm your theory.


If you don't have any other PCIe card, do you possibly have another system,
with an IOMMU and a free PCIe slot that you can plug your ASMedia ASM1061
PCI card and perform the same test?

(Preferably something that is not AMD, to rule out a amd_iommu issue,
since both Asus Pro WS WRX80E-SAGE SE WIFI and X570 use amd_iommu.)

If we see the same behavior that the device drops the upper 21-bits there
when using the trick in your test patch, that would also confirm your theory.


>
> There are two ways to handle this -- either set the DMA mask for ASM106x
> parts to 43 bits, or take the lazy route and just use AHCI_HFLAG_32BIT_ONLY
> for these parts. I feel that the former would be more appropriate, as
> there seem to be plenty of bits beyond bit 31 that do work, but I will
> defer to your judgement on this matter. What do you think the right way
> to handle this apparent hardware quirk is?

I've seen something similar for NVMe, where some NVMe controllers from
Amazon was violating the spec, and only supported 48-bit DMA addresses,
even though NVMe spec requires you to support 64-bit DMA addresses, see:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4bdf260362b3be529d170b04662638fd6dc52241

It is possible that ASMedia ASM1061 has a similar problem (but for AHCI)
and only supports 43-bit DMA addresses, even though it sets AHCI CAP.S64A,
which says "Indicates whether the HBA can access 64-bit data structures.".

I think the best thing is to do a similar quirk, where we set the dma_mask
accordingly.


Kind regards,
Niklas