Re: [PATCH for-next 2/4] RDMA/hns: Add IOMMU enable support in hip08

From: Wei Hu (Xavier)
Date: Wed Nov 08 2017 - 20:17:26 EST




On 2017/11/7 14:32, Leon Romanovsky wrote:
> On Tue, Nov 07, 2017 at 10:45:29AM +0800, Wei Hu (Xavier) wrote:
>>
>> On 2017/11/1 20:26, Robin Murphy wrote:
>>> On 01/11/17 07:46, Wei Hu (Xavier) wrote:
>>>> On 2017/10/12 20:59, Robin Murphy wrote:
>>>>> On 12/10/17 13:31, Wei Hu (Xavier) wrote:
>>>>>> On 2017/10/1 0:10, Leon Romanovsky wrote:
>>>>>>> On Sat, Sep 30, 2017 at 05:28:59PM +0800, Wei Hu (Xavier) wrote:
>>>>>>>> If the IOMMU is enabled, the length of sg obtained from
>>>>>>>> __iommu_map_sg_attrs is not 4kB. When the IOVA is set with the sg
>>>>>>>> dma address, the IOVA will not be page continuous. and the VA
>>>>>>>> returned from dma_alloc_coherent is a vmalloc address. However,
>>>>>>>> the VA obtained by the page_address is a discontinuous VA. Under
>>>>>>>> these circumstances, the IOVA should be calculated based on the
>>>>>>>> sg length, and record the VA returned from dma_alloc_coherent
>>>>>>>> in the struct of hem.
>>>>>>>>
>>>>>>>> Signed-off-by: Wei Hu (Xavier) <xavier.huwei@xxxxxxxxxx>
>>>>>>>> Signed-off-by: Shaobo Xu <xushaobo2@xxxxxxxxxx>
>>>>>>>> Signed-off-by: Lijun Ou <oulijun@xxxxxxxxxx>
>>>>>>>> ---
>>>>>>> Doug,
>>>>>>>
>>>>>>> I didn't invest time in reviewing it, but having "is_vmalloc_addr" in
>>>>>>> driver code to deal with dma_alloc_coherent is most probably wrong.
>>>>>>>
>>>>>>> Thanks
>>>>>> Hi, Leon & Doug
>>>>>> We refered the function named __ttm_dma_alloc_page in the kernel
>>>>>> code as below:
>>>>>> And there are similar methods in bch_bio_map and mem_to_page
>>>>>> functions in current 4.14-rcx.
>>>>>>
>>>>>> static struct dma_page *__ttm_dma_alloc_page(struct dma_pool *pool)
>>>>>> {
>>>>>> struct dma_page *d_page;
>>>>>>
>>>>>> d_page = kmalloc(sizeof(struct dma_page), GFP_KERNEL);
>>>>>> if (!d_page)
>>>>>> return NULL;
>>>>>>
>>>>>> d_page->vaddr = dma_alloc_coherent(pool->dev, pool->size,
>>>>>> &d_page->dma,
>>>>>> pool->gfp_flags);
>>>>>> if (d_page->vaddr) {
>>>>>> if (is_vmalloc_addr(d_page->vaddr))
>>>>>> d_page->p = vmalloc_to_page(d_page->vaddr);
>>>>>> else
>>>>>> d_page->p = virt_to_page(d_page->vaddr);
>>>>> There are cases on various architectures where neither of those is
>>>>> right. Whether those actually intersect with TTM or RDMA use-cases is
>>>>> another matter, of course.
>>>>>
>>>>> What definitely is a problem is if you ever take that page and end up
>>>>> accessing it through any virtual address other than the one explicitly
>>>>> returned by dma_alloc_coherent(). That can blow the coherency wide open
>>>>> and invite data loss, right up to killing the whole system with a
>>>>> machine check on certain architectures.
>>>>>
>>>>> Robin.
>>>> Hi, Robin
>>>> Thanks for your comment.
>>>>
>>>> We have one problem and the related code as below.
>>>> 1. call dma_alloc_coherent function serval times to alloc memory.
>>>> 2. vmap the allocated memory pages.
>>>> 3. software access memory by using the return virt addr of vmap
>>>> and hardware using the dma addr of dma_alloc_coherent.
>>> The simple answer is "don't do that". Seriously. dma_alloc_coherent()
>>> gives you a CPU virtual address and a DMA address with which to access
>>> your buffer, and that is the limit of what you may infer about it. You
>>> have no guarantee that the virtual address is either in the linear map
>>> or vmalloc, and not some other special place. You have no guarantee that
>>> the underlying memory even has an associated struct page at all.
>>>
>>>> When IOMMU is disabled in ARM64 architecture, we use virt_to_page()
>>>> before vmap(), it works. And when IOMMU is enabled using
>>>> virt_to_page() will cause calltrace later, we found the return
>>>> addr of dma_alloc_coherent is vmalloc addr, so we add the
>>>> condition judgement statement as below, it works.
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> pages[i] =
>>>> is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> vmalloc_to_page(buf->page_list[i].buf) :
>>>> virt_to_page(buf->page_list[i].buf);
>>>> Can you give us suggestion? better method?
>>> Oh my goodness, having now taken a closer look at this driver, I'm lost
>>> for words in disbelief. To pick just one example:
>>>
>>> u32 bits_per_long = BITS_PER_LONG;
>>> ...
>>> if (bits_per_long == 64) {
>>> /* memory mapping nonsense */
>>> }
>>>
>>> WTF does the size of a long have to do with DMA buffer management!?
>>>
>>> Of course I can guess that it might be trying to make some tortuous
>>> inference about vmalloc space being constrained on 32-bit platforms, but
>>> still...
>>>
>>>> The related code as below:
>>>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>>>> GFP_KERNEL);
>>>> if (!buf->page_list)
>>>> return -ENOMEM;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i) {
>>>> buf->page_list[i].buf = dma_alloc_coherent(dev,
>>>> page_size, &t,
>>>> GFP_KERNEL);
>>>> if (!buf->page_list[i].buf)
>>>> goto err_free;
>>>>
>>>> buf->page_list[i].map = t;
>>>> memset(buf->page_list[i].buf, 0, page_size);
>>>> }
>>>>
>>>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>>>> GFP_KERNEL);
>>>> if (!pages)
>>>> goto err_free;
>>>>
>>>> for (i = 0; i < buf->nbufs; ++i)
>>>> pages[i] =
>>>> is_vmalloc_addr(buf->page_list[i].buf) ?
>>>> vmalloc_to_page(buf->page_list[i].buf) :
>>>> virt_to_page(buf->page_list[i].buf);
>>>>
>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>> PAGE_KERNEL);
>>>> kfree(pages);
>>>> if (!buf->direct.buf)
>>>> goto err_free;
>>> OK, this is complete crap. As above, you cannot assume that a struct
>>> page even exists; even if it does you cannot assume that using a
>>> PAGE_KERNEL mapping will not result in mismatched attributes,
>>> unpredictable behaviour and data loss. Trying to remap coherent DMA
>>> allocations like this is just egregiously wrong.
>>>
>>> What I do like is that you can seemingly fix all this by simply deleting
>>> hns_roce_buf::direct and all the garbage code related to it, and using
>>> the page_list entries consistently because the alternate paths involving
>>> those appear to do the right thing already.
>>>
>>> That is, of course, assuming that the buffers involved can be so large
>>> that it's not practical to just always make a single allocation and
>>> fragment it into multiple descriptors if the hardware does have some
>>> maximum length constraint - frankly I'm a little puzzled by the
>>> PAGE_SIZE * 2 threshold, given that that's not a fixed size.
>>>
>>> Robin.
>> HiïRobin
>>
>> We reconstruct the code as below:
>> It replaces dma_alloc_coherent with __get_free_pages and
>> dma_map_single
>> functions. So, we can vmap serveral ptrs returned by
>> __get_free_pages, right?
> Most probably not, you should get rid of your virt_to_page/vmap calls.
>
> Thanks
Hi, Leon
Thanks for your suggestion.
I will send a patch to fix it.

Regards
Wei Hu
>>
>> buf->page_list = kcalloc(buf->nbufs, sizeof(*buf->page_list),
>> GFP_KERNEL);
>> if (!buf->page_list)
>> return -ENOMEM;
>>
>> for (i = 0; i < buf->nbufs; ++i) {
>> ptr = (void *)__get_free_pages(GFP_KERNEL | __GFP_ZERO,
>> get_order(page_size));
>> if (!ptr) {
>> dev_err(dev, "Alloc pages error.\n");
>> goto err_free;
>> }
>>
>> t = dma_map_single(dev, ptr, page_size,
>> DMA_BIDIRECTIONAL);
>> if (dma_mapping_error(dev, t)) {
>> dev_err(dev, "DMA mapping error.\n");
>> free_pages((unsigned long)ptr,
>> get_order(page_size));
>> goto err_free;
>> }
>>
>> buf->page_list[i].buf = ptr;
>> buf->page_list[i].map = t;
>> }
>>
>> pages = kmalloc_array(buf->nbufs, sizeof(*pages),
>> GFP_KERNEL);
>> if (!pages)
>> goto err_free;
>>
>> for (i = 0; i < buf->nbufs; ++i)
>> pages[i] = virt_to_page(buf->page_list[i].buf);
>>
>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>> PAGE_KERNEL);
>> kfree(pages);
>> if (!buf->direct.buf)
>> goto err_free;
>>
>>
>> Regards
>> Wei Hu
>>>> Regards
>>>> Wei Hu
>>>>>> } else {
>>>>>> kfree(d_page);
>>>>>> d_page = NULL;
>>>>>> }
>>>>>> return d_page;
>>>>>> }
>>>>>>
>>>>>> Regards
>>>>>> Wei Hu
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_alloc.c | 5 ++++-
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.c | 30
>>>>>>>> +++++++++++++++++++++++++++---
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hem.h | 6 ++++++
>>>>>>>> drivers/infiniband/hw/hns/hns_roce_hw_v2.c | 22 +++++++++++++++-------
>>>>>>>> 4 files changed, 52 insertions(+), 11 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> index 3e4c525..a69cd4b 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_alloc.c
>>>>>>>> @@ -243,7 +243,10 @@ int hns_roce_buf_alloc(struct hns_roce_dev
>>>>>>>> *hr_dev, u32 size, u32 max_direct,
>>>>>>>> goto err_free;
>>>>>>>>
>>>>>>>> for (i = 0; i < buf->nbufs; ++i)
>>>>>>>> - pages[i] = virt_to_page(buf->page_list[i].buf);
>>>>>>>> + pages[i] =
>>>>>>>> + is_vmalloc_addr(buf->page_list[i].buf) ?
>>>>>>>> + vmalloc_to_page(buf->page_list[i].buf) :
>>>>>>>> + virt_to_page(buf->page_list[i].buf);
>>>>>>>>
>>>>>>>> buf->direct.buf = vmap(pages, buf->nbufs, VM_MAP,
>>>>>>>> PAGE_KERNEL);
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> index 8388ae2..4a3d1d4 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.c
>>>>>>>> @@ -200,6 +200,7 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> gfp_t gfp_mask)
>>>>>>>> {
>>>>>>>> struct hns_roce_hem_chunk *chunk = NULL;
>>>>>>>> + struct hns_roce_vmalloc *vmalloc;
>>>>>>>> struct hns_roce_hem *hem;
>>>>>>>> struct scatterlist *mem;
>>>>>>>> int order;
>>>>>>>> @@ -227,6 +228,7 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> sg_init_table(chunk->mem, HNS_ROCE_HEM_CHUNK_LEN);
>>>>>>>> chunk->npages = 0;
>>>>>>>> chunk->nsg = 0;
>>>>>>>> + memset(chunk->vmalloc, 0, sizeof(chunk->vmalloc));
>>>>>>>> list_add_tail(&chunk->list, &hem->chunk_list);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -243,7 +245,15 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> if (!buf)
>>>>>>>> goto fail;
>>>>>>>>
>>>>>>>> - sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>>>> + if (is_vmalloc_addr(buf)) {
>>>>>>>> + vmalloc = &chunk->vmalloc[chunk->npages];
>>>>>>>> + vmalloc->is_vmalloc_addr = true;
>>>>>>>> + vmalloc->vmalloc_addr = buf;
>>>>>>>> + sg_set_page(mem, vmalloc_to_page(buf),
>>>>>>>> + PAGE_SIZE << order, offset_in_page(buf));
>>>>>>>> + } else {
>>>>>>>> + sg_set_buf(mem, buf, PAGE_SIZE << order);
>>>>>>>> + }
>>>>>>>> WARN_ON(mem->offset);
>>>>>>>> sg_dma_len(mem) = PAGE_SIZE << order;
>>>>>>>>
>>>>>>>> @@ -262,17 +272,25 @@ static struct hns_roce_hem
>>>>>>>> *hns_roce_alloc_hem(struct hns_roce_dev *hr_dev,
>>>>>>>> void hns_roce_free_hem(struct hns_roce_dev *hr_dev, struct
>>>>>>>> hns_roce_hem *hem)
>>>>>>>> {
>>>>>>>> struct hns_roce_hem_chunk *chunk, *tmp;
>>>>>>>> + void *cpu_addr;
>>>>>>>> int i;
>>>>>>>>
>>>>>>>> if (!hem)
>>>>>>>> return;
>>>>>>>>
>>>>>>>> list_for_each_entry_safe(chunk, tmp, &hem->chunk_list, list) {
>>>>>>>> - for (i = 0; i < chunk->npages; ++i)
>>>>>>>> + for (i = 0; i < chunk->npages; ++i) {
>>>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr)
>>>>>>>> + cpu_addr = chunk->vmalloc[i].vmalloc_addr;
>>>>>>>> + else
>>>>>>>> + cpu_addr =
>>>>>>>> + lowmem_page_address(sg_page(&chunk->mem[i]));
>>>>>>>> +
>>>>>>>> dma_free_coherent(hr_dev->dev,
>>>>>>>> chunk->mem[i].length,
>>>>>>>> - lowmem_page_address(sg_page(&chunk->mem[i])),
>>>>>>>> + cpu_addr,
>>>>>>>> sg_dma_address(&chunk->mem[i]));
>>>>>>>> + }
>>>>>>>> kfree(chunk);
>>>>>>>> }
>>>>>>>>
>>>>>>>> @@ -774,6 +792,12 @@ void *hns_roce_table_find(struct hns_roce_dev
>>>>>>>> *hr_dev,
>>>>>>>>
>>>>>>>> if (chunk->mem[i].length > (u32)offset) {
>>>>>>>> page = sg_page(&chunk->mem[i]);
>>>>>>>> + if (chunk->vmalloc[i].is_vmalloc_addr) {
>>>>>>>> + mutex_unlock(&table->mutex);
>>>>>>>> + return page ?
>>>>>>>> + chunk->vmalloc[i].vmalloc_addr
>>>>>>>> + + offset : NULL;
>>>>>>>> + }
>>>>>>>> goto out;
>>>>>>>> }
>>>>>>>> offset -= chunk->mem[i].length;
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> index af28bbf..62d712a 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hem.h
>>>>>>>> @@ -72,11 +72,17 @@ enum {
>>>>>>>> HNS_ROCE_HEM_PAGE_SIZE = 1 << HNS_ROCE_HEM_PAGE_SHIFT,
>>>>>>>> };
>>>>>>>>
>>>>>>>> +struct hns_roce_vmalloc {
>>>>>>>> + bool is_vmalloc_addr;
>>>>>>>> + void *vmalloc_addr;
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> struct hns_roce_hem_chunk {
>>>>>>>> struct list_head list;
>>>>>>>> int npages;
>>>>>>>> int nsg;
>>>>>>>> struct scatterlist mem[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>>>> + struct hns_roce_vmalloc vmalloc[HNS_ROCE_HEM_CHUNK_LEN];
>>>>>>>> };
>>>>>>>>
>>>>>>>> struct hns_roce_hem {
>>>>>>>> diff --git a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> index b99d70a..9e19bf1 100644
>>>>>>>> --- a/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> +++ b/drivers/infiniband/hw/hns/hns_roce_hw_v2.c
>>>>>>>> @@ -1093,9 +1093,11 @@ static int hns_roce_v2_write_mtpt(void
>>>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>>> {
>>>>>>>> struct hns_roce_v2_mpt_entry *mpt_entry;
>>>>>>>> struct scatterlist *sg;
>>>>>>>> + u64 page_addr = 0;
>>>>>>>> u64 *pages;
>>>>>>>> + int i = 0, j = 0;
>>>>>>>> + int len = 0;
>>>>>>>> int entry;
>>>>>>>> - int i;
>>>>>>>>
>>>>>>>> mpt_entry = mb_buf;
>>>>>>>> memset(mpt_entry, 0, sizeof(*mpt_entry));
>>>>>>>> @@ -1153,14 +1155,20 @@ static int hns_roce_v2_write_mtpt(void
>>>>>>>> *mb_buf, struct hns_roce_mr *mr,
>>>>>>>>
>>>>>>>> i = 0;
>>>>>>>> for_each_sg(mr->umem->sg_head.sgl, sg, mr->umem->nmap, entry) {
>>>>>>>> - pages[i] = ((u64)sg_dma_address(sg)) >> 6;
>>>>>>>> -
>>>>>>>> - /* Record the first 2 entry directly to MTPT table */
>>>>>>>> - if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>>>> - break;
>>>>>>>> - i++;
>>>>>>>> + len = sg_dma_len(sg) >> PAGE_SHIFT;
>>>>>>>> + for (j = 0; j < len; ++j) {
>>>>>>>> + page_addr = sg_dma_address(sg) +
>>>>>>>> + (j << mr->umem->page_shift);
>>>>>>>> + pages[i] = page_addr >> 6;
>>>>>>>> +
>>>>>>>> + /* Record the first 2 entry directly to MTPT table */
>>>>>>>> + if (i >= HNS_ROCE_V2_MAX_INNER_MTPT_NUM - 1)
>>>>>>>> + goto found;
>>>>>>>> + i++;
>>>>>>>> + }
>>>>>>>> }
>>>>>>>>
>>>>>>>> +found:
>>>>>>>> mpt_entry->pa0_l = cpu_to_le32(lower_32_bits(pages[0]));
>>>>>>>> roce_set_field(mpt_entry->byte_56_pa0_h, V2_MPT_BYTE_56_PA0_H_M,
>>>>>>>> V2_MPT_BYTE_56_PA0_H_S,
>>>>>>>> --
>>>>>>>> 1.9.1
>>>>>>>>
>>>>>> _______________________________________________
>>>>>> iommu mailing list
>>>>>> iommu@xxxxxxxxxxxxxxxxxxxxxxxxxx
>>>>>> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>>>>> .
>>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>
>>> .
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo@xxxxxxxxxxxxxxx
>> More majordomo info at http://vger.kernel.org/majordomo-info.html