Re: [PATCH v2 0/7] mm: process_vm_mmap() -- syscall for duplication a process mapping

From: Kirill Tkhai
Date: Tue May 21 2019 - 12:02:10 EST


On 21.05.2019 18:52, Kirill Tkhai wrote:
> On 21.05.2019 17:43, Andy Lutomirski wrote:
>> On Mon, May 20, 2019 at 7:01 AM Kirill Tkhai <ktkhai@xxxxxxxxxxxxx> wrote:
>>>
>>
>>> [Summary]
>>>
>>> New syscall, which allows to clone a remote process VMA
>>> into local process VM. The remote process's page table
>>> entries related to the VMA are cloned into local process's
>>> page table (in any desired address, which makes this different
>>> from that happens during fork()). Huge pages are handled
>>> appropriately.
>>>
>>> This allows to improve performance in significant way like
>>> it's shows in the example below.
>>>
>>> [Description]
>>>
>>> This patchset adds a new syscall, which makes possible
>>> to clone a VMA from a process to current process.
>>> The syscall supplements the functionality provided
>>> by process_vm_writev() and process_vm_readv() syscalls,
>>> and it may be useful in many situation.
>>>
>>> For example, it allows to make a zero copy of data,
>>> when process_vm_writev() was previously used:
>>>
>>> struct iovec local_iov, remote_iov;
>>> void *buf;
>>>
>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>> recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> local_iov->iov_base = buf;
>>> local_iov->iov_len = n * PAGE_SIZE;
>>> remove_iov = ...;
>>>
>>> process_vm_writev(pid, &local_iov, 1, &remote_iov, 1 0);
>>> munmap(buf, n * PAGE_SIZE);
>>>
>>> (Note, that above completely ignores error handling)
>>>
>>> There are several problems with process_vm_writev() in this example:
>>>
>>> 1)it causes pagefault on remote process memory, and it forces
>>> allocation of a new page (if was not preallocated);
>>
>> I don't see how your new syscall helps. You're writing to remote
>> memory. If that memory wasn't allocated, it's going to get allocated
>> regardless of whether you use a write-like interface or an mmap-like
>> interface.
>
> No, the talk is not about just another interface for copying memory.
> The talk is about borrowing of remote task's VMA and corresponding
> page table's content. Syscall allows to copy part of page table
> with preallocated pages from remote to local process. See here:
>
> [task1] [task2]
>
> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
> MAP_PRIVATE|MAP_ANONYMOUS, ...);
>
> <task1 populates buf>
>
> buf = process_vm_mmap(pid_of_task1, addr, n * PAGE_SIZE, ...);
> munmap(buf);
>
>
> process_vm_mmap() copies PTEs related to memory of buf in task1 to task2
> just like in the way we do during fork syscall.
>
> There is no copying of buf memory content, unless COW happens. This is
> the principal difference to process_vm_writev(), which just allocates
> pages in remote VM.
>
>> Keep in mind that, on x86, just the hardware part of a
>> page fault is very slow -- populating the memory with a syscall
>> instead of a fault may well be faster.
>
> It is not as slow, as disk IO has. Just compare, what happens in case of anonymous
> pages related to buf of task1 are swapped:
>
> 1)process_vm_writev() reads them back into memory;
>
> 2)process_vm_mmap() just copies swap PTEs from task1 page table
> to task2 page table.
>
> Also, for faster page faults one may use huge pages for the mappings.
> But really, it's funny to think about page faults, when there are
> disk IO problems I shown.
>
>>>
>>> 2)amount of memory for this example is doubled in a moment --
>>> n pages in current and n pages in remote tasks are occupied
>>> at the same time;
>>
>> This seems disingenuous. If you're writing p pages total in chunks of
>> n pages, you will use a total of p pages if you use mmap and p+n if
>> you use write.
>
> I didn't understand this sentence because of many ifs, sorry. Could you
> please explain your thought once again?
>
>> That only doubles the amount of memory if you let n
>> scale linearly with p, which seems unlikely.
>>
>>>
>>> 3)received data has no a chance to be properly swapped for
>>> a long time.
>>
>> ...
>>
>>> a)kernel moves @buf pages into swap right after recv();
>>> b)process_vm_writev() reads the data back from swap to pages;
>>
>> If you're under that much memory pressure and thrashing that badly,
>> your performance is going to be awful no matter what you're doing. If
>> you indeed observe this behavior under normal loads, then this seems
>> like a VM issue that should be addressed in its own right.
>
> I don't think so. Imagine: a container migrates from one node to another.
> The nodes are the same, say, every of them has 4GB of RAM.
>
> Before the migration, the container's tasks used 4GB of RAM and 8GB of swap.
> After the page server on the second node received the pages, we want these
> pages become swapped as soon as possible, and we don't want to read them from
> swap to pass a read consumer.

Should be "to pass a *real* consumer".

>
> The page server is task1 in the example. The real consumer is task2.
>
> This is a rather normal load, I think.
>
>>> buf = mmap(NULL, n * PAGE_SIZE, PROT_READ|PROT_WRITE,
>>> MAP_PRIVATE|MAP_ANONYMOUS, ...);
>>> recv(sock, buf, n * PAGE_SIZE, 0);
>>>
>>> [Task 2]
>>> buf2 = process_vm_mmap(pid_of_task1, buf, n * PAGE_SIZE, NULL, 0);
>>>
>>> This creates a copy of VMA related to buf from task1 in task2's VM.
>>> Task1's page table entries are copied into corresponding page table
>>> entries of VM of task2.
>>
>> You need to fully explain a whole bunch of details that you're
>> ignored.
>
> Yeah, it's not a problem :) I'm ready to explain and describe everything,
> what may cause a question. Just ask ;)
>
>> For example, if the remote VMA is MAP_ANONYMOUS, do you get
>> a CoW copy of it? I assume you don't since the whole point is to
>> write to remote memory
>
> But, no, there *is* COW semantic. We do not copy memory. We copy
> page table content. This is just the same we have on fork(), when
> children duplicates parent's VMA and related page table subset,
> and parent's PTEs lose _PAGE_RW flag.
>
> There is all copy_page_range() code reused for that. Please, see [3/7]
> for the details.
>
> I'm going to get special performance using THP, when number of entries
> to copy is smaller than in case of PTE.
>
> Copy several of PMD from one task page table to another's is much much much faster,
> than process_vm_write() copies pages (even not mention about its reading from swap).
>
>> ,but it's at the very least quite unusual in
>> Linux to have two different anonymous VMAs such that writing one of
>> them changes the other one.
> Writing to a new VMA does not affect old VMA. Old VMA is just used to
> get vma->anon_vma and vma->vm_file from there. Two VMAs remain independent
> each other.
>
>> But there are plenty of other questions.
>> What happens if the remote VMA is a gate area or other special mapping
>> (vDSO, vvar area, etc)? What if the remote memory comes from a driver
>> that wasn't expecting the mapping to get magically copied to a
>> different process?
>
> In case of someone wants to duplicate such the mappings, we may consider
> that, and extend the interface in the future for VMA types, which are
> safe for that.
>
> But now the logic is very overprotective, and all the unusual mappings
> like you mentioned (also AIO, etc) is prohibited. Please, see [7/7]
> for the details.
>
>> This new API seems quite dangerous and complex to me, and I don't
>> think the value has been adequately demonstrated.
>
> I don't think it's dangerous and complex, because of I haven't introduced
> any principal VMA conceptions different to what we have now. We just
> borrow vma->anon_vma and vma->vm_file from remote process to local
> like we did on fork() (borrowing of vma->anon_vma means not blindly
> copying, but ordinary anon_vma_fork()).
>
> Maybe I had to focus the description more on copying of PTE/PMD
> instead of vma duplication. So, it's unexpected for me, that people
> think about simple memory copying after reading the example I gave.
> But I gave more explanation here, so I hope the situation became
> clearer for a reader. Anyway, if you have any questions, please
> ask me.
>
> Thanks,
> Kirill
>