Re: [PATCH] [RFC] drivers: dma-coherent: use MEMREMAP_WB instead of MEMREMAP_WC

From: Jaewon Kim
Date: Fri Nov 11 2016 - 04:54:11 EST


Hi

On 2016ë 11ì 10ì 18:51, Brian Starkey wrote:
> Hi Jaewon,
>
> On Thu, Nov 10, 2016 at 10:41:43AM +0900, Jaewon Kim wrote:
>> Hi
>>
>> On 2016ë 11ì 09ì 19:23, Brian Starkey wrote:
>>> Hi,
>>>
>>> On Wed, Nov 09, 2016 at 06:47:26PM +0900, Jaewon Kim wrote:
>>>>
>>>>
>>>> On 2016ë 11ì 09ì 18:27, Brian Starkey wrote:
>>>>> Hi Jaewon,
>>>>>
>>>>> On Wed, Nov 09, 2016 at 06:10:09PM +0900, Jaewon Kim wrote:
>>>>>> Commit 6b03ae0d42bf (drivers: dma-coherent: use MEMREMAP_WC for DMA_MEMORY_MA)
>>>>>> added MEMREMAP_WC for DMA_MEMORY_MAP. If, however, CPU cache can be used on
>>>>>> DMA_MEMORY_MAP, I think MEMREMAP_WC can be changed to MEMREMAP_WB. On my local
>>>>>> ARM device, memset in dma_alloc_from_coherent sometimes takes much longer with
>>>>>> MEMREMAP_WC compared to MEMREMAP_WB.
>>>>>>
>>>>>> Test results on AArch64 by allocating 4MB with putting trace_printk right
>>>>>> before and after memset.
>>>>>> MEMREMAP_WC : 11.0ms, 5.7ms, 4.2ms, 4.9ms, 5.4ms, 4.3ms, 3.5ms
>>>>>> MEMREMAP_WB : 0.7ms, 0.6ms, 0.6ms, 0.6ms, 0.6ms, 0.5ms, 0.4 ms
>>>>>>
>>>>>
>>>>> This doesn't look like a good idea to me. The point of coherent memory
>>>>> is to have it non-cached, however WB will make writes hit the cache.
>>>>>
>>>>> Writing to the cache is of course faster than writing to RAM, but
>>>>> that's not what we want to do here.
>>>>>
>>>>> -Brian
>>>>>
>>>> Hi Brian
>>>>
>>>> Thank you for your comment.
>>>> If allocated memory will be used by TZ side, however, I think cacheable
>>>> also can be used to be fast on memset in dma_alloc_from_coherent.
>>>
>>> Are you trying to share the buffer between the secure and non-secure
>>> worlds on the CPU? In that case, I don't think caching really helps
>>> you. I'm not a TZ expert, but I believe the two worlds can never
>>> share cached data.
>> I do not want memory sharing between the secure and non-secure worlds.
>> I just want faster allocation.
>
> So now I'm confused. Why did you mention TZ?
>
> Could you explain what your use-case for the buffer you are allocating
> is?
I wanted to send physically contiguous memory to TZapp size at Linux runtime.
And during discussion I realized I need to consider more if dma-coherent is proper approach only for TZapp.
But if another DMA master is joined, l think we can think this issue again.
Like secure HW device get memory from dma-coherent and it passes to TZapp.
>
>>
>> I am not a TZ expert, either. I also think they cannot share cached data.
>> As far as I know secure world can decide its cache policy with secure
>> world page table regardless of non-secure world.
>>> If you want the secure world to see the non-secure world's data, as
>>> far as I know you will need to clean the cache in the non-secure world
>>> to make sure the secure world can see it (and vice-versa). I'd expect
>>> this to remove most of the speed advantage of using WB in the first
>>> place, except for some possible speedup from more efficient bursting.
>> Yes I also think non-secure world need to clean the cache before secure world
>> access the memory region to avoid invalid data issue. But if other software
>> like Linux driver or hypervisor do the cache cleaning, or engineer confirm,
>> then we may be able to use MEMREMAP_WB or just skipping memset for faster
>> memory allocation in dma_alloc_from_coherent.
>
> Skipping the memset doesn't sound like a good plan, you'd potentially
> be leaking information to whatever device receives the memory. And
> adding WB mappings to an API which is intended to be used for sharing
> buffers with DMA masters doesn't sound like a good move either.
>
>>>
>>> If you're sharing the buffer with other DMA masters, regardless of
>>> secure/non-secure you're not going to want WB mappings.
>>>
>> If there is a scenario where another DMA master works on this memory,
>> an engineer, I think, need to consider cache clean if he/she uses WB.
>
> The whole point of dma-coherent memory is for use by DMA masters.
>
>>>> How do you think to add another flag to distinguish this case?
>>>
>>> You could look into the streaming DMA API. It will depend on the exact
>>> implementation, but at some point you're still going to have to pay
>>> the penalty of syncing the CPU and device.
>>>
>>> -Brian
>>>
>> I cannot find DMA API and flag for WB. So I am considering additional flag
>> to meet my request. In my opinion the flag can be either WB or non-zeroing.
>
> To me, it sounds like dma-coherent isn't the right tool to achieve
> what you want, but I'm not clear on exactly what it is you're trying
> to do (I know you want faster allocations, the point is what for?)
>
I actually looking into enum dma_attr which has DMA_ATTR_SKIP_ZEROING.
If dma_alloc_attrs in arch/arm64/include/asm/dma-mapping.h passes struct dma_attrs *attrs to dma_alloc_from_coherent,
I think I can do what I want.

Thank you for your comment.
> -Brian
>
>>
>> For case #1 - DMA_MEMORY_MAP_WB
>> --- a/drivers/base/dma-coherent.c
>> +++ b/drivers/base/dma-coherent.c
>> @@ -32,7 +32,9 @@ static bool dma_init_coherent_memory(
>> if (!size)
>> goto out;
>>
>> - if (flags & DMA_MEMORY_MAP)
>> + if (flags & DMA_MEMORY_MAP_WB)
>> + mem_base = memremap(phys_addr, size, MEMREMAP_WB);
>> + else if (flags & DMA_MEMORY_MAP)
>> mem_base = memremap(phys_addr, size, MEMREMAP_WC);
>> else
>> mem_base = ioremap(phys_addr, size);
>>
>> For case #2 - DMA_MEMORY_MAP_NOZEROING
>> --- a/drivers/base/dma-coherent.c
>> +++ b/drivers/base/dma-coherent.c
>> @@ -190,6 +190,8 @@ int dma_alloc_from_coherent(struct device *dev, ssize_t size,
>> *ret = mem->virt_base + (pageno << PAGE_SHIFT);
>> dma_memory_map = (mem->flags & DMA_MEMORY_MAP);
>> spin_unlock_irqrestore(&mem->spinlock, flags);
>> + if (mem->flags & DMA_MEMORY_MAP_NOZEROING)
>> + return 1;
>> if (dma_memory_map)
>> memset(*ret, 0, size);
>> else
>>
>> Can I get your comment?
>>
>> Thank you
>> Jaewon Kim
>>
>>>>>> Signed-off-by: Jaewon Kim <jaewon31.kim@xxxxxxxxxxx>
>>>>>> ---
>>>>>> drivers/base/dma-coherent.c | 2 +-
>>>>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/drivers/base/dma-coherent.c b/drivers/base/dma-coherent.c
>>>>>> index 640a7e6..0512a1d 100644
>>>>>> --- a/drivers/base/dma-coherent.c
>>>>>> +++ b/drivers/base/dma-coherent.c
>>>>>> @@ -33,7 +33,7 @@ static bool dma_init_coherent_memory(
>>>>>> goto out;
>>>>>>
>>>>>> if (flags & DMA_MEMORY_MAP)
>>>>>> - mem_base = memremap(phys_addr, size, MEMREMAP_WC);
>>>>>> + mem_base = memremap(phys_addr, size, MEMREMAP_WB);
>>>>>> else
>>>>>> mem_base = ioremap(phys_addr, size);
>>>>>> if (!mem_base)
>>>>>> --
>>>>>> 1.9.1
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>