RE: [PATCH] net: ftgmac100: Fix missing TX-poll issue

From: Dylan Hung
Date: Fri Oct 23 2020 - 09:08:39 EST


> -----Original Message-----
> From: Andrew Jeffery [mailto:andrew@xxxxxxxx]
> Sent: Wednesday, October 21, 2020 6:26 AM
> To: Benjamin Herrenschmidt <benh@xxxxxxxxxxxxxxxxxxx>; Arnd Bergmann
> <arnd@xxxxxxxx>; Dylan Hung <dylan_hung@xxxxxxxxxxxxxx>
> Cc: BMC-SW <BMC-SW@xxxxxxxxxxxxxx>; linux-aspeed
> <linux-aspeed@xxxxxxxxxxxxxxxx>; Po-Yu Chuang <ratbert@xxxxxxxxxxxxxxxx>;
> netdev <netdev@xxxxxxxxxxxxxxx>; OpenBMC Maillist
> <openbmc@xxxxxxxxxxxxxxxx>; Linux Kernel Mailing List
> <linux-kernel@xxxxxxxxxxxxxxx>; Jakub Kicinski <kuba@xxxxxxxxxx>; David
> Miller <davem@xxxxxxxxxxxxx>
> Subject: Re: [PATCH] net: ftgmac100: Fix missing TX-poll issue
>
>
>
> On Wed, 21 Oct 2020, at 08:40, Benjamin Herrenschmidt wrote:
> > On Tue, 2020-10-20 at 21:49 +0200, Arnd Bergmann wrote:
> > > On Tue, Oct 20, 2020 at 11:37 AM Dylan Hung
> <dylan_hung@xxxxxxxxxxxxxx> wrote:
> > > > > +1 @first is system memory from dma_alloc_coherent(), right?
> > > > >
> > > > > You shouldn't have to do this. Is coherent DMA memory broken on
> > > > > your platform?
> > > >
> > > > It is about the arbitration on the DRAM controller. There are two
> queues in the dram controller, one is for the CPU access and the other is for
> the HW engines.
> > > > When CPU issues a store command, the dram controller just
> acknowledges cpu's request and pushes the request into the queue. Then
> CPU triggers the HW MAC engine, the HW engine starts to fetch the DMA
> memory.
> > > > But since the cpu's request may still stay in the queue, the HW engine
> may fetch the wrong data.
> >
> > Actually, I take back what I said earlier, the above seems to imply
> > this is more generic.
> >
> > Dylan, please confirm, does this affect *all* DMA capable devices ? If
> > yes, then it's a really really bad design bug in your chips
> > unfortunately and the proper fix is indeed to make dma_wmb() do a
> > dummy read of some sort (what address though ? would any dummy
> > non-cachable page do ?) to force the data out as *all* drivers will
> > potentially be affected.
> >

The issue was found on our test chip (ast2600 version A0) which is just for testing and won't be mass-produced. This HW bug has been fixed on ast2600 A1 and later versions.

To verify the HW fix, I run overnight iperf and kvm tests on ast2600A1 without this patch, and get stable result without hanging.
So I think we can discard this patch.

> > I was under the impression that it was a specific timing issue in the
> > vhub and ethernet parts, but if it's more generic then it needs to be
> > fixed globally.
> >
>
> We see a similar issue in the XDMA engine where it can transfer stale data to
> the host. I think the driver ended up using memcpy_toio() to work around that
> despite using a DMA reserved memory region.