Re: [PATCH 10 of 20] ipath - support for userspace apps using coredriver
From: Hugh Dickins
Date: Thu Mar 16 2006 - 09:22:31 EST
On Wed, 15 Mar 2006, Andrew Morton wrote:
> "Bryan O'Sullivan" <bos@xxxxxxxxxxxxx> wrote:
> >
> > Bad page state at __free_pages_ok (in process 'mpi_hello', page ffff8100020e2f88)
> > flags:0x0100000000000804 mapping:0000000000000000 mapcount:0 count:0 (Not tainted)
>
> Someone left PG_private set on this page (!)
That's pretty sure to be from page_alloc.c's internal use of
PagePrivate, for the buddy system of free pages.
I think Bryan's problem is probably that he has allocated a
high-order page (via pci_alloc_consistent or dma_alloc_coherent),
and is then exposing the constituents of that high-order page to
userspace via nopage: when the final munmap or exit comes,
the constituent pages get freed without realizing they're
supposed to be part of one high-order page.
We had the same problem in sound/core/memalloc.c, and fixed it
by using a compound page (a high-order page in which each constituent
knows it's part of the larger whole): see the two uses of __GFP_COMP
in that file, and in particular the one where snd_malloc_dev_pages
calls dma_alloc_coherent.
Yes, as Roland suggested, it sounds sensible to use dma_alloc_coherent
throughout rather than pci_alloc_consistent, then you can pass __GFP_COMP
down with the other flags and it should work (without any PageReserved).
I haven't checked Bryan's get_pag'ing, but he seemed to have the right
idea of how it should work: was just being a bit diffident about it in
the face of the fact that it clearly wasn't working (through lack of
__GFP_COMP).
Of course, there is an alternative way of going about it, using
remap_pfn_range more (VM_PFNMAP keeping the VM away from refcounting
the pages) and nopage less. But I think Bryan should probably keep on
with the division he already has, no strong reason to change it around.
[ snipping lots of helpful explanation and advice from Andrew ]
> If you have pages which are created by ->nopage() but which are supposed to
> be shared across forks (the VMA should use VM_DONTCOPY|VM_SHARED) then each
> time a new process starts accessing these already-existing pages it'll take
> a minor fault.
No. That isn't how VM_DONTCOPY behaves, and I don't know what you were
thinking to bring it into the discussion there - if you have pages which
are supposed to be shared across forks, that happens without any special
flag, doesn't it? We only teach VM_DONTCOPY in the second year,
I think Bryan would best forget it for now ;)
Oh well, I'd better say a bit on it. VM_DONTCOPY excludes the whole vma
from being forked into the child. So if the child tries to access one
of those addresses, the child'll get a SIGSEGV not a minor fault. RDMA
is finding it a useful flag to get around a peculiarly frustrating issue
with get_user_pages: though that successfully pins pages, a fork followed
by parent writing to pinned private page will put a _copy_ of that page
into the parent's userspace (if child holds a reference to the original),
which likely defeats the purpose of pinning.
So far as Bryan is dealing with shared mappings, he won't be interested
in VM_DONTCOPY at all. If he's dealing with private mappings, he might
want to consider VM_DONTCOPY if get_user_pages and fork can occur
concurrently; but it's an advanced topic, right now he wants to get
the basis working without "Bad page state"s.
He shouldn't set VM_SHM, it means nothing. If he's dealing with real
allocated pages, he shouldn't set VM_IO (but remap_pfn_range will set
it for him if it's used). He shouldn't set VM_RESERVED (won't do any
harm, but we're quite likely to remove it soon). He can set VM_DONT-
EXPAND to avoid worrying about whether mremap to a larger extent,
followed by fault on that larger extent, would behave correctly.
He shouldn't set VM_LOCKED (it would make the locked_vm accounting
go wrong - unless he's careful to participate in that accounting,
as Infiniband's uverbs_mem.c creditably does).
Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/