Re: [PATCH 1/1] RDMA/odp: convert to use HMM for ODP

From: Haggai Eran
Date: Wed Feb 13 2019 - 01:28:18 EST


On 2/12/2019 6:11 PM, Jerome Glisse wrote:
> On Wed, Feb 06, 2019 at 08:44:26AM +0000, Haggai Eran wrote:
>> On 1/29/2019 6:58 PM, jglisse@xxxxxxxxxx wrote:
>> > Convert ODP to use HMM so that we can build on common infrastructure
>> > for different class of devices that want to mirror a process address
>> > space into a device. There is no functional changes.
>>
>> Thanks for sending this patch. I think in general it is a good idea to
>> use a common infrastructure for ODP.
>>
>> I have a couple of questions below.
>>
>>> -static void ib_umem_notifier_invalidate_range_end(struct mmu_notifier *mn,
>>> - const struct mmu_notifier_range *range)
>>> -{
>>> - struct ib_ucontext_per_mm *per_mm =
>>> - container_of(mn, struct ib_ucontext_per_mm, mn);
>>> -
>>> - if (unlikely(!per_mm->active))
>>> - return;
>>> -
>>> - rbt_ib_umem_for_each_in_range(&per_mm->umem_tree, range->start,
>>> - range->end,
>>> - invalidate_range_end_trampoline, true, NULL);
>>> up_read(&per_mm->umem_rwsem);
>>> + return ret;
>>> }
>> Previously the code held the umem_rwsem between range_start and
>> range_end calls. I guess that was in order to guarantee that no device
>> page faults take reference to the pages being invalidated while the
>> invalidation is ongoing. I assume this is now handled by hmm instead,
>> correct?
>
> It is a mix of HMM and driver in pagefault_mr() mlx5/odp.c
> mutex_lock(&odp->umem_mutex);
> if (hmm_vma_range_done(range)) {
> ...
>
> This is what serialize programming the hw and any concurrent CPU page
> table invalidation. This is also one of the thing i want to improve
> long term as mlx5_ib_update_xlt() can do memory allocation and i would
> like to avoid that ie make mlx5_ib_update_xlt() and its sub-functions
> as small and to the points as possible so that they could only fail if
> the hardware is in bad state not because of memory allocation issues.
I wonder if it would be possible to make use of the memory that is
already allocated (ib_umem_odp->dma_list) for that purpose. This would
probably mean that this area will need to be formatted according to the
device hardware requirements (e.g., big endian), and then you can
instruct the device to DMA the updated translations directly from their.

>
>
>>
>>> +
>>> +static uint64_t odp_hmm_flags[HMM_PFN_FLAG_MAX] = {
>>> + ODP_READ_BIT, /* HMM_PFN_VALID */
>>> + ODP_WRITE_BIT, /* HMM_PFN_WRITE */
>>> + ODP_DEVICE_BIT, /* HMM_PFN_DEVICE_PRIVATE */
>> It seems that the mlx5_ib code in this patch currently ignores the
>> ODP_DEVICE_BIT (e.g., in umem_dma_to_mtt). Is that okay? Or is it
>> handled implicitly by the HMM_PFN_SPECIAL case?
>
> This is because HMM except a bit for device memory as same API is
> use for GPU which have device memory. I can add a comment explaining
> that it is not use for ODP but there just to comply with HMM API.
>
>>
>>> @@ -327,9 +287,10 @@ void put_per_mm(struct ib_umem_odp *umem_odp)
>>> up_write(&per_mm->umem_rwsem);
>>>
>>> WARN_ON(!RB_EMPTY_ROOT(&per_mm->umem_tree.rb_root));
>>> - mmu_notifier_unregister_no_release(&per_mm->mn, per_mm->mm);
>>> + hmm_mirror_unregister(&per_mm->mirror);
>>> put_pid(per_mm->tgid);
>>> - mmu_notifier_call_srcu(&per_mm->rcu, free_per_mm);
>>> +
>>> + kfree(per_mm);
>>> }
>> Previously the per_mm struct was released through call srcu, but now it
>> is released immediately. Is it safe? I saw that hmm_mirror_unregister
>> calls mmu_notifier_unregister_no_release, so I don't understand what
>> prevents concurrently running invalidations from accessing the released
>> per_mm struct.
>
> Yes it is safe, the hmm struct has its own refcount and mirror holds a
> reference on it, the mm struct itself has a reference on the mm struct.
> So no structure can vanish before the other. However once release call-
> back happens you can no longer fault anything it will -EFAULT if you
> try to (not to mention that by then all the vma have been tear down).
> So even if some kernel thread race with destruction it will not be able
> to fault anything or use mirror struct in any meaning full way.
>
> Note that in a regular tear down the ODP put_per_mm() will happen before
> the release callback as iirc file including device file get close before
> the mm is teardown. But in anycase it would work no matter what the order
> is.
I see. I was worried about concurrent invalidations and ibv_dereg_mr,
but I understand hmm protects against that internally through the
mirrors_sem semaphore.

>
>>
>>> @@ -578,11 +578,27 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
>>>
>>> next_mr:
>>> size = min_t(size_t, bcnt, ib_umem_end(&odp->umem) - io_virt);
>>> -
>>> page_shift = mr->umem->page_shift;
>>> page_mask = ~(BIT(page_shift) - 1);
>>> + off = (io_virt & (~page_mask));
>>> + size += (io_virt & (~page_mask));
>>> + io_virt = io_virt & page_mask;
>>> + off += (size & (~page_mask));
>>> + size = ALIGN(size, 1UL << page_shift);
>>> +
>>> + if (io_virt < ib_umem_start(&odp->umem))
>>> + return -EINVAL;
>>> +
>>> start_idx = (io_virt - (mr->mmkey.iova & page_mask)) >> page_shift;
>>>
>>> + if (odp_mr->per_mm == NULL || odp_mr->per_mm->mm == NULL)
>>> + return -ENOENT;
>>> +
>>> + ret = hmm_range_register(&range, odp_mr->per_mm->mm,
>>> + io_virt, io_virt + size, page_shift);
>>> + if (ret)
>>> + return ret;
>>> +
>>> if (prefetch && !downgrade && !mr->umem->writable) {
>>> /* prefetch with write-access must
>>> * be supported by the MR
>> Isn't there a mistake in the calculation of the variable size? Itis
>> first set to the size of the page fault range, but then you add the
>> virtual address, so I guess it is actually the range end. Then you pass
>> io_virt + size to hmm_range_register. Doesn't it double the size of the
>> range
>
> No i think it is correct, bcnt is the byte count we are ask to fault,
> we align that on the maximum size the current mr covers (min_t above)
> then we align with the page size so that fault address is page align.
There are three lines that update size above, not two:
>>> size = min_t(size_t, bcnt, ib_umem_end(&odp->umem) - io_virt);
>>> + size += (io_virt & (~page_mask));
>>> + size = ALIGN(size, 1UL << page_shift);
As you said, the first one aligns to the end of the MR, the third one
aligns to page size, but the middle one seems to add the start.

> hmm_range_register() takes start address and end address which is the
> start address + size.
>
> off is the offset ie the number of extra byte we are faulting to align
> start on page size. If there is a bug this might be:
> off += (size & (~page_mask));

off is used only to calculate the number of bytes mapped in order to
proceed with the next MR in the page fault with an updated byte count. I
think it should be
off = start & ~page_mask;
meaning it shouldn't also add in (end & ~page_mask).

Regards,
Haggai