Re: [PATCH 06/36] HMM: add HMM page table v2.

From: Jerome Glisse
Date: Fri Jun 19 2015 - 14:07:36 EST


On Thu, Jun 18, 2015 at 07:06:08PM -0700, Mark Hairgrove wrote:
> On Thu, 21 May 2015, j.glisse@xxxxxxxxx wrote:

[...]
> > +
> > +static inline dma_addr_t hmm_pde_from_pfn(dma_addr_t pfn)
> > +{
> > + return (pfn << PAGE_SHIFT) | HMM_PDE_VALID;
> > +}
> > +
> > +static inline unsigned long hmm_pde_pfn(dma_addr_t pde)
> > +{
> > + return (pde & HMM_PDE_VALID) ? pde >> PAGE_SHIFT : 0;
> > +}
> > +
>
> Does hmm_pde_pfn return a dma_addr_t pfn or a system memory pfn?
>
> The types between these two functions don't match. According to
> hmm_pde_from_pfn, both the pde and the pfn are supposed to be dma_addr_t.
> But hmm_pde_pfn returns an unsigned long as a pfn instead of a dma_addr_t.
> If hmm_pde_pfn sometimes used to get a dma_addr_t pfn then shouldn't it
> also return a dma_addr_t, since as you pointed out in the commit message,
> dma_addr_t might be bigger than an unsigned long?
>

Yes internal it use dma_addr_t but for device driver that want to use
physical system page address aka pfn i want them to use the specialize
helper hmm_pte_from_pfn() and hmm_pte_pfn() so type casting happen in
hmm and it make it easier to review device driver as device driver will
be consistent ie either it wants to use pfn or it want to use dma_addr_t
but not mix the 2.

A latter patch add the hmm_pte_from_dma() and hmm_pte_dma_addr() helper
for the dma case. So this patch only introduce the pfn version.

> > [...]
> > +
> > +static inline dma_addr_t hmm_pte_from_pfn(dma_addr_t pfn)
> > +{
> > + return (pfn << PAGE_SHIFT) | (1 << HMM_PTE_VALID_PFN_BIT);
> > +}
> > +
> > +static inline unsigned long hmm_pte_pfn(dma_addr_t pte)
> > +{
> > + return hmm_pte_test_valid_pfn(&pte) ? pte >> PAGE_SHIFT : 0;
> > +}
>
> Same question as hmm_pde_pfn above.

See above.

>
> > [...]
> > +/* struct hmm_pt_iter - page table iterator states.
> > + *
> > + * @ptd: Array of directory struct page pointer for each levels.
> > + * @ptdp: Array of pointer to mapped directory levels.
> > + * @dead_directories: List of directories that died while walking page table.
> > + * @cur: Current address.
> > + */
> > +struct hmm_pt_iter {
> > + struct page *ptd[HMM_PT_MAX_LEVEL - 1];
> > + dma_addr_t *ptdp[HMM_PT_MAX_LEVEL - 1];
>
> These are sized to be HMM_PT_MAX_LEVEL - 1 rather than HMM_PT_MAX_LEVEL
> because the iterator doesn't store the top level, correct? This results in
> a lot of "level - 1" and "level - 2" logic when dealing with the iterator.
> Have you considered keeping the levels consistent to get rid of all the
> extra offset-by-1 logic?

All this should be optimized away by the compiler thought i have not
check the assembly.

[...]
> > + *
> > + * @iter: Iterator states that currently protect the directory.
> > + * @level: Level of the directory to reference.
> > + *
> > + * This function will reference a directory but it is illegal for refcount to
> > + * be 0 as this helper should only be call when iterator is protecting the
> > + * directory (ie iterator hold a reference for the directory).
> > + *
> > + * HMM user will call this with level = pt.llevel any other value is supicious
> > + * outside of hmm_pt code.
> > + */
> > +static inline void hmm_pt_iter_directory_ref(struct hmm_pt_iter *iter,
> > + char level)
> > +{
> > + /* Nothing to do for root level. */
> > + if (!level)
> > + return;
> > +
> > + if (!atomic_inc_not_zero(&iter->ptd[level - 1]->_mapcount))
> > + /* Illegal this should not happen. */
> > + BUG();
> > +}
> > +
> > [...]
> > +
> > +int hmm_pt_init(struct hmm_pt *pt)
> > +{
> > + unsigned directory_shift, i = 0, npgd;
> > +
> > + pt->last &= PAGE_MASK;
> > + spin_lock_init(&pt->lock);
> > + /* Directory shift is the number of bits that a single directory level
> > + * represent. For instance if PAGE_SIZE is 4096 and each entry takes 8
> > + * bytes (sizeof(dma_addr_t) == 8) then directory_shift = 9.
> > + */
> > + directory_shift = PAGE_SHIFT - ilog2(sizeof(dma_addr_t));
> > + /* Level 0 is the root level of the page table. It might use less
> > + * bits than directory_shift but all sub-directory level will use all
> > + * directory_shift bits.
> > + *
> > + * For instance if hmm_pt.last == (1 << 48), PAGE_SHIFT == 12 and
>
> This example should say that hmm_pt.last == (1 << 48) - 1, since last is
> inclusive. Otherwise llevel will be 4.

Yes correct i switched from exclusive to inclusive and forgot to update
comment.

>
> > + * sizeof(dma_addr_t) == 8 then :
> > + * directory_shift = 9
> > + * shift[0] = 39
> > + * shift[1] = 30
> > + * shift[2] = 21
> > + * shift[3] = 12
> > + * llevel = 3
> > + *
> > + * Note that shift[llevel] == PAGE_SHIFT because the last level
> > + * correspond to the page table entry level (ignoring the case of huge
> > + * page).
> > + */
> > + pt->shift[0] = ((__fls(pt->last >> PAGE_SHIFT) / directory_shift) *
> > + directory_shift) + PAGE_SHIFT;
> > + while (pt->shift[i++] > PAGE_SHIFT)
> > + pt->shift[i] = pt->shift[i - 1] - directory_shift;
> > + pt->llevel = i - 1;
> > + pt->directory_mask = (1 << directory_shift) - 1;
> > +
> > + for (i = 0; i <= pt->llevel; ++i)
> > + pt->mask[i] = ~((1UL << pt->shift[i]) - 1);
> > +
> > + npgd = (pt->last >> pt->shift[0]) + 1;
> > + pt->pgd = kzalloc(npgd * sizeof(dma_addr_t), GFP_KERNEL);
> > + if (!pt->pgd)
> > + return -ENOMEM;
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL(hmm_pt_init);
>
> Does this need to be EXPORT_SYMBOL? It seems like a driver would never
> need to call this, only core hmm. Same question for hmm_pt_fini.
>

Well i wanted to use that in the hmm_dummy driver, but i have mix feeling
as using it in dummy driver allow to test it in more use case but at the
same time a bug might be hidden because same code is use in dummy driver.
I will probably juste use it in dummy driver.

[...]
> > + *
> > + * @iter: Iterator states.
> > + *
> > + * This function will initialize iterator states. It must always be pair with a
> > + * call to hmm_pt_iter_fini().
> > + */
> > +void hmm_pt_iter_init(struct hmm_pt_iter *iter)
> > +{
> > + memset(iter->ptd, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
> > + memset(iter->ptdp, 0, sizeof(void *) * (HMM_PT_MAX_LEVEL - 1));
>
> The memset sizes can simply be sizeof(iter->ptd) and sizeof(iter->ptdp).

Yes i should have.

> > + INIT_LIST_HEAD(&iter->dead_directories);
> > +}
> > +EXPORT_SYMBOL(hmm_pt_iter_init);
> > +
> > +/* hmm_pt_iter_directory_unref_safe() - unref a directory that is safe to free.
> > + *
> > + * @iter: Iterator states.
> > + * @pt: HMM page table.
> > + * @level: Level of the directory to unref.
> > + *
> > + * This function will unreference a directory and add it to dead list if
> > + * directory no longer have any reference. It will also clear the entry to
> > + * that directory into the upper level directory as well as dropping ref
> > + * on the upper directory.
> > + */
> > +static void hmm_pt_iter_directory_unref_safe(struct hmm_pt_iter *iter,
> > + struct hmm_pt *pt,
> > + unsigned level)
> > +{
> > + struct page *upper_ptd;
> > + dma_addr_t *upper_ptdp;
> > +
> > + /* Nothing to do for root level. */
> > + if (!level)
> > + return;
> > +
> > + if (!atomic_dec_and_test(&iter->ptd[level - 1]->_mapcount))
> > + return;
> > +
> > + upper_ptd = level > 1 ? iter->ptd[level - 2] : NULL;
> > + upper_ptdp = level > 1 ? iter->ptdp[level - 2] : pt->pgd;
> > + upper_ptdp = &upper_ptdp[hmm_pt_index(pt, iter->cur, level - 1)];
> > + hmm_pt_directory_lock(pt, upper_ptd, level - 1);
> > + /*
> > + * There might be race btw decrementing reference count on a directory
> > + * and another thread trying to fault in a new directory. To avoid
> > + * erasing the new directory entry we need to check that the entry
> > + * still correspond to the directory we are removing.
> > + */
> > + if (hmm_pde_pfn(*upper_ptdp) == page_to_pfn(iter->ptd[level - 1]))
> > + *upper_ptdp = 0;
> > + hmm_pt_directory_unlock(pt, upper_ptd, level - 1);
> > +
> > + /* Add it to delayed free list. */
> > + list_add_tail(&iter->ptd[level - 1]->lru, &iter->dead_directories);
> > +
> > + /*
> > + * The upper directory is not safe to unref as we have an extra ref and
>
> This should be "IS safe to unref", correct?

Yes (s/not/now) -> "is NOW safe to unref" /me blame my fingers.

[...]
> > + /*
> > + * Some iterator may have dereferenced a dead directory entry and looked
> > + * up the struct page but haven't check yet the reference count. As all
> > + * the above happen in rcu read critical section we know that we need
> > + * to wait for grace period before being able to free any of the dead
> > + * directory page.
> > + */
> > + synchronize_rcu();
> > + list_for_each_entry_safe(ptd, tmp, &iter->dead_directories, lru) {
> > + list_del(&ptd->lru);
> > + atomic_set(&ptd->_mapcount, -1);
> > + __free_page(ptd);
> > + }
> > +}
>
> If I'm following this correctly, a migrate to the device will allocate HMM
> page tables and the subsequent migrate from the device will free them.
> Assuming that's the case, might thrashing of page allocations be a
> problem? What about keeping the HMM page tables around until the actual
> munmap() of the corresponding VA range?

HMM page table is allocate anytime a device mirror a range ie migration to
device is not a special case. When migrating to and from device, the HMM
page table is allocated prior to the migration and outlive the migration
back.

That said the rational here is that i want to free HMM resources as early as
possible mostly to support the use GPU on dataset onetime (ie dataset is use
once and only once by the GPU). I think it will be a common and important use
case and making sure we free resource early does not prevent other use case
where dataset are use for longer time to work properly and efficiently.

In a latter patch i add an helper so that device driver can discard a range
ie tell HMM that they no longer using a range of address allowing HMM to
free associated resources.

However you are correct that currently some MM event will lead to HMM page
table being free and then reallocated right after once again by the device.
Which is obviously bad. But because i did not want to make this patch or
this serie any more complex than it already is i did not include any mecanism
to delay HMM page table directory reclaim. Such delayed reclaim mecanism is
on my road map and i think i shared that roadmap with you. I think it is
something we can optimize latter on. The important part here is that device
driver knows that HMM page table need to be carefully accessed so that when
agressive pruning of HMM page table happens it does not disrupt the device
driver.

Cheers,
Jérôme
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at http://www.tux.org/lkml/