Re: [HMM v13 08/18] mm/hmm: heterogeneous memory management (HMM for short)

From: Anshuman Khandual
Date: Sun Nov 27 2016 - 21:59:04 EST


On 11/27/2016 06:40 PM, Jerome Glisse wrote:
> On Wed, Nov 23, 2016 at 09:33:35AM +0530, Anshuman Khandual wrote:
>> On 11/18/2016 11:48 PM, Jérôme Glisse wrote:
>
> [...]
>
>>> + *
>>> + * hmm_vma_migrate(vma, start, end, ops);
>>> + *
>>> + * With ops struct providing 2 callback alloc_and_copy() which allocated the
>>> + * destination memory and initialize it using source memory. Migration can fail
>>> + * after this step and thus last callback finalize_and_map() allow the device
>>> + * driver to know which page were successfully migrated and which were not.
>>
>> So we have page->pgmap->free_devpage() to release the individual page back
>> into the device driver management during migration and also we have this ops
>> based finalize_and_mmap() to check on the failed instances inside a single
>> migration context which can contain set of pages at a time.
>>
>>> + *
>>> + * This can easily be use outside of HMM intended use case.
>>
>> Where you think this can be used outside of HMM ?
>
> Well on the radar is new memory hierarchy that seems to be on every CPU designer
> roadmap. Where you have a fast small HBM like memory package with the CPU and then
> you have the regular memory.
>
> In the embedded world they want to migrate active process to fast CPU memory and
> shutdown the regular memory to save power.
>
> In the HPC world they want to migrate hot data of hot process to this fast memory.
>
> In both case we are talking about process base memory migration and in case of
> embedded they also have DMA engine they can use to offload the copy operation
> itself.
>
> This are the useful case i have in mind but other people might see that code and
> realise they could also use it for their own specific corner case.

If there are plans for HBM or specialized type of memory which will be
packaged inside the CPU (without any other device accessing it like in
the case of GPU or Network Card), then I think in that case using HMM
is not ideal. CPU will be the only thing accessing this memory and
there is never going to be any other device or context which can access
this outside of CPU. Hence role of a device driver is redundant, it
should be initialized and used as a basic platform component.

In that case what we need is a core VM managed memory with certain kind
of restrictions around the allocation and a way of explicit allocation
into it if required. Representing these memory like a cpu less restrictive
coherent device memory node is a better solution IMHO. These RFCs what I
have posted regarding CDM representation are efforts in this direction.

[RFC Specialized Zonelists] https://lkml.org/lkml/2016/10/24/19
[RFC Restrictive mems_allowed] https://lkml.org/lkml/2016/11/22/339

I believe both HMM and CDM have their own use cases and will complement
each other.

>
> [...]
>
>>> +/*
>>> + * hmm_pfn_t - HMM use its own pfn type to keep several flags per page
>>> + *
>>> + * Flags:
>>> + * HMM_PFN_VALID: pfn is valid
>>> + * HMM_PFN_WRITE: CPU page table have the write permission set
>>> + */
>>> +typedef unsigned long hmm_pfn_t;
>>> +
>>> +#define HMM_PFN_VALID (1 << 0)
>>> +#define HMM_PFN_WRITE (1 << 1)
>>> +#define HMM_PFN_SHIFT 2
>>> +
>>> +static inline struct page *hmm_pfn_to_page(hmm_pfn_t pfn)
>>> +{
>>> + if (!(pfn & HMM_PFN_VALID))
>>> + return NULL;
>>> + return pfn_to_page(pfn >> HMM_PFN_SHIFT);
>>> +}
>>> +
>>> +static inline unsigned long hmm_pfn_to_pfn(hmm_pfn_t pfn)
>>> +{
>>> + if (!(pfn & HMM_PFN_VALID))
>>> + return -1UL;
>>> + return (pfn >> HMM_PFN_SHIFT);
>>> +}
>>> +
>>> +static inline hmm_pfn_t hmm_pfn_from_page(struct page *page)
>>> +{
>>> + return (page_to_pfn(page) << HMM_PFN_SHIFT) | HMM_PFN_VALID;
>>> +}
>>> +
>>> +static inline hmm_pfn_t hmm_pfn_from_pfn(unsigned long pfn)
>>> +{
>>> + return (pfn << HMM_PFN_SHIFT) | HMM_PFN_VALID;
>>> +}
>>
>> Hmm, so if we use last two bits on PFN as flags, it does reduce the number of
>> bits available for the actual PFN range. But given that we support maximum of
>> 64TB on POWER (not sure about X86) we can live with this two bits going away
>> from the unsigned long. But what is the purpose of tracking validity and write
>> flag inside the PFN ?
>
> So 2^46 so with 12bits PAGE_SHIFT we only need 34 bits for pfns value hence i
> should have enough place for my flag or is unsigned long not 64bits on powerpc ?

Yeah it is 64 bits on POWER, we use 12 bits of PAGE_SHIFT for 4K
pages and 16 bits of PAGE_SHIFT for 64K pages.