Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

From: Anup Patel
Date: Sun Feb 05 2017 - 22:55:43 EST

Next message: Derek Robson: "[PATCH] Staging: iio: addac: adt7316.c - style fix, octal permission"
Previous message: Chen-Yu Tsai: "Re: [PATCH] clk: sunxi-ng: select SUNXI_CCU_MULT for sun5i"
In reply to: Dan Williams: "Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients"
Next in thread: Anup Patel: "[PATCH 4/6] async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, Feb 4, 2017 at 12:12 AM, Dan Williams <dan.j.williams@xxxxxxxxx> wrote:
> On Fri, Feb 3, 2017 at 2:59 AM, Anup Patel <anup.patel@xxxxxxxxxxxx> wrote:
>>
>>
>> On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams <dan.j.williams@xxxxxxxxx>
>> wrote:
>>>
>>> On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel <anup.patel@xxxxxxxxxxxx>
>>> wrote:
>>> > The DMAENGINE framework assumes that if PQ offload is supported by a
>>> > DMA device then all 256 PQ coefficients are supported. This assumption
>>> > does not hold anymore because we now have BCM-SBA-RAID offload engine
>>> > which supports PQ offload with limited number of PQ coefficients.
>>> >
>>> > This patch extends async_tx APIs to handle DMA devices with support
>>> > for fewer PQ coefficients.
>>> >
>>> > Signed-off-by: Anup Patel <anup.patel@xxxxxxxxxxxx>
>>> > Reviewed-by: Scott Branden <scott.branden@xxxxxxxxxxxx>
>>> > ---
>>> > crypto/async_tx/async_pq.c | 3 +++
>>> > crypto/async_tx/async_raid6_recov.c | 12 ++++++++++--
>>> > include/linux/dmaengine.h | 19 +++++++++++++++++++
>>> > include/linux/raid/pq.h | 3 +++
>>> > 4 files changed, 35 insertions(+), 2 deletions(-)
>>>
>>> So, I hate the way async_tx does these checks on each operation, and
>>> it's ok for me to say that because it's my fault. Really it's md that
>>> should be validating engine offload capabilities once at the beginning
>>> of time. I'd rather we move in that direction than continue to pile
>>> onto a bad design.
>>
>>
>> Yes, indeed. All async_tx APIs have lot of checks and for high throughput
>> RAID offload engine these checks can add some overhead.
>>
>> I think doing checks in Linux md would be certainly better but this would
>> mean lot of changes in Linux md as well as remove checks in async_tx.
>>
>> Also, async_tx APIs should not find DMA channel on its own instead it
>> should rely on Linux md to provide DMA channel pointer as parameter.
>>
>> It's better to do checks cleanup in async_tx as separate patchset and
>> keep this patchset simple.
>
> That's been the problem with async_tx being broken like this for
> years. Once you get this "small / simple" patch upstream, that
> arguably makes async_tx a little bit worse, there is no longer any
> motivation to fix the underlying issues. If you care about the long
> term health of raid offload and are enabling new hardware support you
> should first tackle the known problems with it before adding new
> features.

Apart from the checks related issue you pointed there are other
issues with async_tx APIs such as:

1. The mechanism to do update PQ (or RAID6 update) operation
in current async_tx APIs is to call async_gen_syndrome() twice
with ASYNC_TX_PQ_XOR_DST flag set. Also, async_gen_syndrome()
will always prefer SW approach when ASYNC_TX_PQ_XOR_DST flag
is set. This means async_tx API is forcing SW approach for update
PQ operation and in-addition we require two async_gen_syndrome()
calls to achieve update PQ. This limitations of async_gen_syndrome()
reduces performance of async_tx APIs. Instead of this we should
have a dedicated async_update_pq() API which will allow RAID
offload engine drivers (such as BCM-FS4-RAID) to implement
update PQ using HW offload and this new API will fall-back to
SW approach using async_gen_syndrome() if no DMA channel
provides update PQ HW offload.

2. In our stress testing, we have observed that dma_map_page()
and dma_unmap_page() used in various async_tx APIs are the
major cause of overhead. If we directly call DMA channel callbacks
with pre-DMA-mapped pages then we get very high throughput.
The async_tx APIs should provide a way for pre-DMA-mapped
pages so that Linux MD can exploit this fact for better performance.

3. We really don't have a test module to stress/benchmark all
async_tx APIs using multi-threading and batching large number
of request in each thread. This kind of test module is very much
required for performance benchmarking and stressing high
throughput (hundreds of Gbps) RAID offload engines (such as
BCM-FS4-RAID).

>From the above, we already have async_tx_test module to
address point3. We also plan to address point1 above but
this would also require changes in Linux MD to use new
async_update_pq() API.

As you can see, this patchset is not end of story of us if we
want best possible utilization of BCM-FS4-RAID.

Regards,
Anup

Next message: Derek Robson: "[PATCH] Staging: iio: addac: adt7316.c - style fix, octal permission"
Previous message: Chen-Yu Tsai: "Re: [PATCH] clk: sunxi-ng: select SUNXI_CCU_MULT for sun5i"
In reply to: Dan Williams: "Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients"
Next in thread: Anup Patel: "[PATCH 4/6] async_tx: Fix DMA_PREP_FENCE usage in do_async_gen_syndrome()"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]