Re: [PATCH 18/26] mm: Convert process_vm_rw_pages() to use get_user_pages_unlocked()

From: KOSAKI Motohiro
Date: Mon Oct 07 2013 - 20:10:40 EST


(10/7/13 4:55 PM), Jan Kara wrote:
On Thu 03-10-13 18:40:06, KOSAKI Motohiro wrote:
(10/2/13 3:36 PM), Jan Kara wrote:
On Wed 02-10-13 12:32:33, KOSAKI Motohiro wrote:
(10/2/13 10:27 AM), Jan Kara wrote:
Signed-off-by: Jan Kara <jack@xxxxxxx>
---
mm/process_vm_access.c | 8 ++------
1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/mm/process_vm_access.c b/mm/process_vm_access.c
index fd26d0433509..c1bc47d8ed90 100644
--- a/mm/process_vm_access.c
+++ b/mm/process_vm_access.c
@@ -64,12 +64,8 @@ static int process_vm_rw_pages(struct task_struct *task,
*bytes_copied = 0;

/* Get the pages we're interested in */
- down_read(&mm->mmap_sem);
- pages_pinned = get_user_pages(task, mm, pa,
- nr_pages_to_copy,
- vm_write, 0, process_pages, NULL);
- up_read(&mm->mmap_sem);
-
+ pages_pinned = get_user_pages_unlocked(task, mm, pa, nr_pages_to_copy,
+ vm_write, 0, process_pages);
if (pages_pinned != nr_pages_to_copy) {
rc = -EFAULT;
goto end;

This is wrong because original code is wrong. In this function, page may
be pointed to anon pages. Then, you should keep to take mmap_sem until
finish to copying. Otherwise concurrent fork() makes nasty COW issue.
Hum, can you be more specific? I suppose you are speaking about situation
when the remote task we are copying data from/to does fork while
process_vm_rw_pages() runs. If we are copying data from remote task, I
don't see how COW could cause any problem. If we are copying to remote task
and fork happens after get_user_pages() but before copy_to_user() then I
can see we might be having some trouble. copy_to_user() would then copy
data into both original remote process and its child thus essentially
bypassing COW. If the child process manages to COW some of the pages before
copy_to_user() happens, it can even see only some of the pages. Is that what
you mean?

scenario 1: vm_write==0

Process P1 call get_user_pages(pa, process_pages) in process_vm_rw_pages
P1 unlock mmap_sem.
Process P2 call fork(). and make P3.
P2 write memory pa. now the "process_pages" is owned by P3 (the child process)
P3 write memory pa. and then the content of "process_pages" is changed.
P1 read process_pages as P2's page. but actually, it is P3's data. Then,
P1 observe garbage, at least unintended, data was read.
Yeah, this really looks buggy because P1 can see data in (supposedly)
P2's address space which P2 never wrote there.

scenario 2: vm_write==1

Process P1 call get_user_pages(pa, process_pages) in process_vm_rw_pages.
It makes COW break and any anon page sharing broke.
P1 unlock mmap_sem.
P2 call fork(). and make P3. And, now COW page sharing is restored.
P2 write memory pa. now the "process_pages" is owned by P3.
P3 write memory pa. it mean P3 changes "process_pages".
P1 write process_pages as P2's page. but actually, it is P3's. It
override above P3's write and then P3 observe data corruption.
Yep, this is a similar problem as above. Thanks for pointing this out.

The solution is to keep holding mmap_sem until finishing process_pages
access because mmap_sem prevent fork. and then race never be happen. I
mean you cann't use get_user_pages_unlock() if target address point to
anon pages.
Yeah, if you are accessing third party mm,

one correction. the condition is,

- third party mm, or
- current process have multiple threads and other threads makes fork() and COW break

you've convinced me you
currently need mmap_sem to avoid problems with COW on anon pages. I'm just
thinking that this "hold mmap_sem to prevent fork" is somewhat subtle
(definitely would deserve a comment) and if it would be needed in more
places we might be better off if we have a more explicit mechanism for that
(like a special lock, fork sequence count, or something like that).

Hmm..

Actually, I tried this several years ago. But MM people disliked it because
they prefer faster kernel than simple code. Any additional lock potentially
makes slower the kernel.

However, I fully agree the current mechanism is too complex and misleading,
or at least, very hard to understanding. So, any improvement idea is welcome.

Anyway
I'll have this in mind and if I see other places that need this, I'll try
to come up with some solution. Thanks again for explanation.

NP.

I tried to explain this at MM summit. but my English is too poor and I couldn't
explain enough to you. Sorry about that.


Anyway, I'd like to fix process_vm_rw_pages() before my memory will be flushed again.
Thank you for helping remember this bug.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/