Re: [PATCH V3 4/4] arm64/mm: Enable memory hot remove

From: Anshuman Khandual
Date: Thu May 16 2019 - 23:17:35 EST

On 05/16/2019 04:27 PM, Mark Rutland wrote:
> On Thu, May 16, 2019 at 11:04:48AM +0530, Anshuman Khandual wrote:
>> On 05/15/2019 05:19 PM, Mark Rutland wrote:
>>> On Tue, May 14, 2019 at 02:30:07PM +0530, Anshuman Khandual wrote:
>>>> Memory removal from an arch perspective involves tearing down two different
>>>> kernel based mappings i.e vmemmap and linear while releasing related page
>>>> table and any mapped pages allocated for given physical memory range to be
>>>> removed.
>>>> Define a common kernel page table tear down helper remove_pagetable() which
>>>> can be used to unmap given kernel virtual address range. In effect it can
>>>> tear down both vmemap or kernel linear mappings. This new helper is called
>>>> from both vmemamp_free() and ___remove_pgd_mapping() during memory removal.
>>>> For linear mapping there are no actual allocated pages which are mapped to
>>>> create the translation. Any pfn on a given entry is derived from physical
>>>> address (__va(PA) --> PA) whose linear translation is to be created. They
>>>> need not be freed as they were never allocated in the first place. But for
>>>> vmemmap which is a real virtual mapping (like vmalloc) physical pages are
>>>> allocated either from buddy or memblock which get mapped in the kernel page
>>>> table. These allocated and mapped pages need to be freed during translation
>>>> tear down. But page table pages need to be freed in both these cases.
>>> As previously discussed, we should only hot-remove memory which was
>>> hot-added, so we shouldn't encounter memory allocated from memblock.
>> Right, not applicable any more. Will drop this word.
>>>> These mappings need to be differentiated while deciding if a mapped page at
>>>> any level i.e [pte|pmd|pud]_page() should be freed or not. Callers for the
>>>> mapping tear down process should pass on 'sparse_vmap' variable identifying
>>>> kernel vmemmap mappings.
>>> I think that you can simplify the paragraphs above down to:
>>> The arch code for hot-remove must tear down portions of the linear map
>>> and vmemmap corresponding to memory being removed. In both cases the
>>> page tables mapping these regions must be freed, and when sparse
>>> vmemmap is in use the memory backing the vmemmap must also be freed.
>>> This patch adds a new remove_pagetable() helper which can be used to
>>> tear down either region, and calls it from vmemmap_free() and
>>> ___remove_pgd_mapping(). The sparse_vmap argument determines whether
>>> the backing memory will be freed.
>> The current one is bit more descriptive on detail. Anyways will replace with
>> the above writeup if that is preferred.
> I would prefer the suggested form above, as it's easier to extract the
> necessary details from it.

Fair enough.

> [...]
>>>> +static void
>>>> +remove_pagetable(unsigned long start, unsigned long end, bool sparse_vmap)
>>>> +{
>>>> + unsigned long addr, next;
>>>> + pud_t *pudp_base;
>>>> + pgd_t *pgdp;
>>>> +
>>>> + spin_lock(&init_mm.page_table_lock);
>>> It would be good to explain why we need to take the ptl here.
>> Will update both commit message and add an in-code comment here.
>>> IIUC that shouldn't be necessary for the linear map. Am I mistaken?
>> Its not absolutely necessary for linear map right now because both memory hot
>> plug & ptdump which modifies or walks the page table ranges respectively take
>> memory hotplug lock. That apart, no other callers creates or destroys linear
>> mapping at runtime.
>>> Is there a specific race when tearing down the vmemmap?
>> This is trickier than linear map. vmemmap additions would be protected with
>> memory hotplug lock but this can potential collide with vmalloc/IO regions.
>> Even if they dont right now that will be because they dont share intermediate
>> page table levels.
> Sure; if we could just state something like:
> The vmemmap region may share levels of table with the vmalloc region.
> Take the ptl so that we can safely free potentially-sahred tables.
> ... I think that would be sufficient.

Will do.