RE: using DMA-API on ARM

From: Hante Meuleman
Date: Fri Dec 05 2014 - 06:49:55 EST

Thank you for investigating. Your analysis matches what was intended to be done. Regarding the cast, you're right, I'll improve it. My idea was that long long is always 64bit, but when casting directly to long long I get a compiler (when using C=2) warning: "warning: right shift by bigger than source value". Probably I concluded wrongly that I should cast it to long first. On the targets used the high address was always 0, so I got away with it thusfar, but indeed it should and shall be fixed.


-----Original Message-----
From: Russell King - ARM Linux [mailto:linux@xxxxxxxxxxxxxxxx]
Sent: vrijdag 5 december 2014 12:11
To: Arnd Bergmann
Cc: linux-arm-kernel@xxxxxxxxxxxxxxxxxxx; Arend Van Spriel; linux-wireless; brcm80211-dev-list; David Miller; linux-kernel@xxxxxxxxxxxxxxx
Subject: Re: using DMA-API on ARM

On Fri, Dec 05, 2014 at 10:52:02AM +0100, Arnd Bergmann wrote:
> I'm still puzzled why you'd need a single dma_sync_single_for_cpu()
> after dma_alloc_coherent though, you should not need any. Is it possible
> that the driver accidentally uses __raw_readl() instead of readl()
> in some places and you are just lacking an appropriate barrier?

Digging into the driver, it looks like individual DMA buffers are
allocated (via brcmf_pcie_init_dmabuffer_for_device) and registered
into a "commonring" layer.

Whenever the buffer is written to, space is first allocated via a call
to brcmf_commonring_reserve_for_write() or
brcmf_commonring_reserve_for_write_multiple(), data written to the
buffer, followed by a call to brcmf_commonring_write_complete().

brcmf_commonring_write_complete() calls two methods at that point:
cr_write_wptr() and cr_ring_bell(), which will be
brcmf_pcie_ring_mb_write_wptr() and brcmf_pcie_ring_mb_ring_bell().

The first calls brcmf_pcie_write_tcm16(), which uses iowrite16(),
which contains the appropriate barrier. The bell ringing functions
also use ioread*/iowrite*().

So, on the write side, it looks fine from the barrier perspective.

On the read side, brcmf_commonring_get_read_ptr() is used before
a read access to the ring - which calls the cr_update_wptr() method,
which in turn uses an ioread16() call. After the CPU has read data
from the ring, brcmf_commonring_read_complete() is used, which uses

So, I don't see a barrier problem on the read side.

However, I did trip over this:

static void *
brcmf_pcie_init_dmabuffer_for_device(struct brcmf_pciedev_info *devinfo,
u32 size, u32 tcm_dma_phys_addr,
dma_addr_t *dma_handle)
void *ring;
long long address;

ring = dma_alloc_coherent(&devinfo->pdev->dev, size, dma_handle,
if (!ring)
return NULL;

address = (long long)(long)*dma_handle;

Casting to (long) will truncate the DMA handle to 32-bits on a 32-bit
architecture, even if it supports 64-bit DMA addresses. There's a couple
of other places where this same truncation occurs:

address = (long long)(long)devinfo->shared.scratch_dmahandle;


address = (long long)(long)devinfo->shared.ringupd_dmahandle;

In any case, wouldn't using a u64 type for "address" be better - isn't
"long long" 128-bit on 64-bit architectures?

FTTC broadband for 0.8mile line: currently at 9.5Mbps down 400kbps up
according to
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at
Please read the FAQ at