Re: [RFC PATCH 1/3] mm, numa: rework do_pages_move

From: Anshuman Khandual
Date: Wed Jan 03 2018 - 04:37:12 EST


On 01/03/2018 02:28 PM, Michal Hocko wrote:
> On Wed 03-01-18 14:12:17, Anshuman Khandual wrote:
>> On 12/08/2017 09:45 PM, Michal Hocko wrote:
> [...]

[...]

>>
>> This reuses the existing page allocation helper from migrate_pages() system
>> call. But all these allocator helper names for migrate_pages() function are
>> really confusing. Even in this case alloc_new_node_page and the original
>> new_node_page() which is still getting used in do_migrate_range() sounds
>> similar even if their implementation is quite different. IMHO either all of
>> them should be moved to the header file with proper differentiating names
>> or let them be there in their respective files with these generic names and
>> clean them up later.
>
> I believe I've tried that but I couldn't make them into a single header
> file easily because of header file dependencies. I agree that their
> names are quite confusing. Feel free to send a patch to clean this up.

Sure. Will try once this one gets into mmotm.

[...]

>>
>>
>> Just a nit. new_page_node() and store_status() seems different. Then why
>> the git diff looks so clumsy.
>
> Kirill was suggesting to use --patience to general the diff which leads
> to a slightly better output. It has been posted as a separate email [1].
> Maybe you will find that one easier to review.
>
> [1] http://lkml.kernel.org/r/20171213143948.GM25185@xxxxxxxxxxxxxx

Yeah it does look better.

[...]

>>> - return thp;
>>> - } else
>>> - return __alloc_pages_node(pm->node,
>>> - GFP_HIGHUSER_MOVABLE | __GFP_THISNODE, 0);
>>> + err = migrate_pages(pagelist, alloc_new_node_page, NULL, node,
>>> + MIGRATE_SYNC, MR_SYSCALL);
>>> + if (err)
>>> + putback_movable_pages(pagelist);
>>> + return err;
>>> }
>>
>> Even this one. IIUC, do_move_pages_to_node() migrate a chunk of pages
>> at a time which belong to the same target node. Perhaps the name should
>> suggest so. All these helper page migration helper functions sound so
>> similar.
>
> What do you suggest? I find do_move_pages_to_node quite explicit on its
> purpose.

Sure. Not a big deal.

[...]

>>> {
>>> + struct vm_area_struct *vma;
>>> + struct page *page;
>>> + unsigned int follflags;
>>> int err;
>>> - struct page_to_node *pp;
>>> - LIST_HEAD(pagelist);
>>>
>>> down_read(&mm->mmap_sem);
>>
>> Holding mmap_sem for individual pages makes sense. Current
>> implementation is holding it for an entire batch.
>
> I didn't bother to optimize this path to be honest. It is true that lock
> batching can lead to improvements but that would complicate the code
> (how many patches to batch?) so I've left that for later if somebody
> actually sees any problem.
>
>>> + err = -EFAULT;
>>> + vma = find_vma(mm, addr);
>>> + if (!vma || addr < vma->vm_start || !vma_migratable(vma))
>>
>> While here, should not we add 'addr > vma->vm_end' into this condition ?
>
> No. See what find_vma does.
>

Right.

> [...]
>
> Please cut out the quoted reply to minimum

Sure will do.

>
>>> @@ -1593,79 +1556,80 @@ static int do_pages_move(struct mm_struct *mm, nodemask_t task_nodes,
>>> const int __user *nodes,
>>> int __user *status, int flags)
>>> {
>>> - struct page_to_node *pm;
>>> - unsigned long chunk_nr_pages;
>>> - unsigned long chunk_start;
>>> - int err;
>>> -
>>> - err = -ENOMEM;
>>> - pm = (struct page_to_node *)__get_free_page(GFP_KERNEL);
>>> - if (!pm)
>>> - goto out;
>>> + int chunk_node = NUMA_NO_NODE;
>>> + LIST_HEAD(pagelist);
>>> + int chunk_start, i;
>>> + int err = 0, err1;
>>
>> err init might not be required, its getting assigned to -EFAULT right away.
>
> No, nr_pages might be 0 AFAICS.

Right but there is another err = 0 after the for loop.

>
> [...]
>>> + if (chunk_node == NUMA_NO_NODE) {
>>> + chunk_node = node;
>>> + chunk_start = i;
>>> + } else if (node != chunk_node) {
>>> + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
>>> + if (err)
>>> + goto out;
>>> + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
>>> + if (err)
>>> + goto out;
>>> + chunk_start = i;
>>> + chunk_node = node;
>>> }

[...]

>>> + err = do_move_pages_to_node(mm, &pagelist, chunk_node);
>>> + if (err)
>>> + goto out;
>>> + if (i > chunk_start) {
>>> + err = store_status(status, chunk_start, chunk_node, i - chunk_start);
>>> + if (err)
>>> + goto out;
>>> + }
>>> + chunk_node = NUMA_NO_NODE;
>>
>> This block of code is bit confusing.
>
> I believe this is easier to grasp when looking at the resulting code.
>>
>> 1) Why attempt to migrate when just one page could not be isolated ?
>> 2) 'i' is always greater than chunk_start except the starting page
>> 3) Why reset chunk_node as NUMA_NO_NODE ?
>
> This is all about flushing the pending state on an error and
> distinguising a fresh batch.

Okay. Will test it out on a multi node system once I get hold of one.