Re: [PATCH v6 1/1] dmaengine: switchtec-dma: Introduce Switchtec DMA engine PCI driver

From: Kelvin.Cao
Date: Tue Oct 10 2023 - 17:23:35 EST

Next message: Justin Stitt: "[PATCH] net: mvpp2: replace deprecated strncpy with strscpy"
Previous message: Dimitri John Ledkov: "[PATCH] crypto: pkcs7: remove sha1 support"
In reply to: Vinod Koul: "Re: [PATCH v6 1/1] dmaengine: switchtec-dma: Introduce Switchtec DMA engine PCI driver"
Next in thread: Vinod Koul: "Re: [PATCH v6 1/1] dmaengine: switchtec-dma: Introduce Switchtec DMA engine PCI driver"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, 2023-10-09 at 11:08 +0530, Vinod Koul wrote:
> EXTERNAL EMAIL: Do not click links or open attachments unless you
> know the content is safe
>
> On 06-10-23, 22:34, Kelvin.Cao@xxxxxxxxxxxxx wrote:
> > On Fri, 2023-10-06 at 16:00 +0530, Vinod Koul wrote:
> > > EXTERNAL EMAIL: Do not click links or open attachments unless you
> > > know the content is safe
> > >
> > > On 05-10-23, 18:35, Kelvin.Cao@xxxxxxxxxxxxx wrote:
> > >
> > > > > > > +static struct dma_async_tx_descriptor *
> > > > > > > +switchtec_dma_prep_wimm_data(struct dma_chan *c,
> > > > > > > dma_addr_t
> > > > > > > dma_dst, u64 data,
> > > > > > > +                          unsigned long flags)
> > > > > >
> > > > > > can you please explain what this wimm data refers to...
> > > > > >
> > > > > > I think adding imm callback was a mistake, we need a better
> > > > > > justification for another user for this, who programs this,
> > > > > > what
> > > > > > gets
> > > > > > programmed here
> > > > >
> > > > > Sure. I think it's an alternative method to prep_mem and
> > > > > would be
> > > > > more
> > > > > convenient to use when the write is 8-byte and the data to be
> > > > > moved
> > > > > is
> > > > > not in a DMA mapped memory location. For example, we write to
> > > > > a
> > > > > doorbell register with the value from a local variable which
> > > > > is
> > > > > not
> > > > > associated with a DMA address to notify the receiver to
> > > > > consume
> > > > > the
> > > > > data, after confirming that the previously initiated DMA
> > > > > transactions
> > > > > of the data have completed. I agree that the use scenario
> > > > > would
> > > > > be
> > > > > very
> > > > > limited.
> > >
> > > Can you please explain more about this 'value' where is it
> > > derived
> > > from?
> > > Who programs it and how...
> >
> > Sure. Think of a producer/consumer use case where the producer is a
> > host DMA client driver and the consumer is a PCIe EP. On the EP,
> > there
>
> What are the examples of DMA clients here?
We have Non-Transparent bridge (NTB) feature on the same switch, which
has scratchpad/doorbell registers that could be used for for completion
notification of data movement (or any other notification). The NTB
feature is presented to host as a PCIe EP and the registers are memory-
mapped in BAR. When combined used with DMA, a DMA client which moves
data through NTB memory window (memory-mapped in bar too) to another
host, would typically work in such a way that it moves bulk data first
and then notify its peer to consume them.
>
> > is a memory-mapped data buffer for data receiving and a memory-
> > mapped
> > doorbell buffer for triggering data consuming. Each time for a bulk
> > data transfer, the DMA client driver first DMA the data of size X
> > to
> > the memory-mapped data buffer, then it DMA the value X to the
> > doorbell
> > buffer to trigger data consumption on device. On receiving a
> > doorbell
> > writing, the device starts to consume the data of size X from the
> > data
> > buffer.
> >
> > For the first DMA operation, the DMA client would use
> > dma_prep_memory()
> > like what most DMA clients do. However, for the second transfer,
> > value
> > X is held in a local variable like below.
> >
> > u64 size_to_transfer;
>
> Why cant the client driver write to doorbell, is there anything which
> prevents us from doing so?

I think the potential challenge here for the client driver to ring db
is that the client driver (host RC) is a different requester in the
PCIe hierarchy compared to DMA EP, in which case PCIe ordering need to
be considered.

As PCIe ensures that reads don't pass writes, we can insert a read DMA
operation with DMA_PREP_FENSE flag in between the two DMA writes (one
for data transfer and one for notification) to ensure the ordering for
the same requester DMA EP. I'm not sure if the RC could ensure the same
ordering if the client driver issue MMIO write to db after the data DMA
and read DMA completion, so that the consumer is guaranteed the
transferred data is ready in memory when the db is triggered by the
client MMIO write. I guess it's still doable with MMIO write but just
some special consideration needed.
>
> >
> > In this case, the client would use dma_prep_wimm_data() to DMA
> > value X
> > to the doorbell buffer, like below.
> >
> > dma_prep_wimm_data(chan, dst_for_db_buffer, size_to_transfer,
> > flag);
> >
> > Hope this example explains the thing. People would argue that the
> > client could use the same dma_prep_memory() for the doorbell
> > ringing. I
> > would agree, this API just adds an alternative way to do so when
> > the
> > data is as little as 64 bit and it also saves a call site to
> > dma_alloc_coherent() to allocate a source DMA buffer just for
> > holding
> > the 8-byte value, which is required by dma_prep_memcpy().
>
> I would prefer that, (after the option of mmio write from client),
> there
> is nothing special about this, another txn. You can queue up both and
> have dmaengine write to doorbell immediately after the buffer
> completes
Yes, I agree, just queue another dma_prep_memory() for the doorbell.
The dma_prep_wimm_data() is just an alternative in this case. We
definitely can do it with dma_prep_memory().
>
> >
> >
> > > > > > > +     /* set sq/cq */
> > > > > > > +     writel(lower_32_bits(swdma_chan->dma_addr_sq),
> > > > > > > &chan_fw-
> > > > > > > > sq_base_lo);
> > > > > > > +     writel(upper_32_bits(swdma_chan->dma_addr_sq),
> > > > > > > &chan_fw-
> > > > > > > > sq_base_hi);
> > > > > > > +     writel(lower_32_bits(swdma_chan->dma_addr_cq),
> > > > > > > &chan_fw-
> > > > > > > > cq_base_lo);
> > > > > > > +     writel(upper_32_bits(swdma_chan->dma_addr_cq),
> > > > > > > &chan_fw-
> > > > > > > > cq_base_hi);
> > > > > > > +
> > > > > > > +     writew(SWITCHTEC_DMA_SQ_SIZE, &swdma_chan-
> > > > > > > > mmio_chan_fw-
> > > > > > > > sq_size);
> > > > > > > +     writew(SWITCHTEC_DMA_CQ_SIZE, &swdma_chan-
> > > > > > > > mmio_chan_fw-
> > > > > > > > cq_size);
> > > > > >
> > > > > > what is write happening in the descriptor alloc callback,
> > > > > > that
> > > > > > does
> > > > > > not
> > > > > > sound correct to me
> > > > >
> > > > > All the queue descriptors of a channel are pre-allocated, so
> > > > > I
> > > > > think
> > > > > it's proper to convey the queue address/size to hardware at
> > > > > this
> > > > > point.
> > > > > After this initialization, we only need to assign cookie in
> > > > > submit
> > > > > and
> > > > > update queue head to hardware in issue_pending.
> > >
> > > Sorry that is not right, you can prepare multiple descriptors and
> > > then
> > > submit. Only at submit is the cookie assigned which is in order,
> > > so
> > > this
> > > should be moved to when we start the txn and not in this call
> > >
> > The hardware assumes the SQ/CQ is a contiguous circular buffer of
> > fix
> > sized element. And the above code passes the address and size of
> > SQ/CQ
> > to the hardware. At this point hardware will do nothing but take
> > note
> > of the SQ/CQ location/size.
> >
> > When do dma_issue_pending(), the actual SQ head write will occur to
> > update hardware with the current SQ head, as below:
> >
> > writew(swdma_chan->head, &swdma_chan->mmio_chan_hw->sq_tail);
>
> But if the order of txn submission is changed you will issue dma txn
> in
> wrong order, so it should be written only when desc is submitted not
> in
> the prep invocation!

The driver implements the SQ with a pre-allocated buffer which means
the descriptor we are preparing has a determined slot in the queue when
we do dma_prep_memory() and before we submit it. To align with the way
how cookies complete, I have to make sure that the descriptors are
prepared and submitted in order. I think it's also how some other DMA
drivers work, like ioat, plx.

Thanks,
Kelvin

Next message: Justin Stitt: "[PATCH] net: mvpp2: replace deprecated strncpy with strscpy"
Previous message: Dimitri John Ledkov: "[PATCH] crypto: pkcs7: remove sha1 support"
In reply to: Vinod Koul: "Re: [PATCH v6 1/1] dmaengine: switchtec-dma: Introduce Switchtec DMA engine PCI driver"
Next in thread: Vinod Koul: "Re: [PATCH v6 1/1] dmaengine: switchtec-dma: Introduce Switchtec DMA engine PCI driver"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]