Re: Question about memory mapping mechanism

From: Hugh Dickins
Date: Tue Mar 13 2007 - 16:49:39 EST


On Fri, 9 Mar 2007, Martin Drab wrote:
> On Fri, 9 Mar 2007, Martin Drab wrote:
> > On Thu, 8 Mar 2007, Martin Drab wrote:
> > > On Thu, 8 Mar 2007, Carsten Otte wrote:
> > > > On 3/8/07, Martin Drab <drab@xxxxxxxxxxxxxxxxxxx> wrote:
> > > > >
> > > > > The thing is that I'd like to prevent kernel to swap these pages out,
> > > > > because then I may loose some data when they are not available in time
> > > > > for the next round.
> > > >
> > > > One think you could do is grab a reference to the pages upfront.
> > >
> > > I'm not really sure what exactly do you mean by "grab a reference
> > > upfront"?
> >
> > I seem to get it now. So instead of setting PG_reserved upon allocation of
> > the buffer pages, I should increase the referrence of the pages by calling
> > get_page() on them and that would prevent the pages to get on the lru list
> > and thus preventing them to be swapped out. Is that it?
>
> Well, so I tried it. Truth is that the "Bad page" messages upon munmap(2)
> are gone. But whether the pages are really prevented from being swapped
> out? I don't know. I don't know how to find out.

Hi Martin, sorry for joining so late.

Most of your anxieties are unfounded: the pages your driver allocates
with __get_free_pages are not put on any LRU (until they're freed),
kernel pages are never swapped out, and making them visible to user-
space does not put them in any more danger of being swapped out;
but yes, if the refcounting goes wrong, that will cause premature
freeing, "Bad page" messages, trouble.

(Of course, filesystem pagecache pages are liable to be swapped
out, and anonymous userspace pages: but you'd have to take special
steps to go down those paths, which I'm confident you're not taking:
this code used to work, you say.)

Please disregard the suggestion to look at Infiniband, it will only
confuse you further: Infiniband core/uverbs_mem.c does provide a very
good example of how to handle the much more complex opposite of what
you're trying to do. You have a driver making its pages visible to
userspace, Infiniband shows how a driver should deal with userspace
buffers made available to it (which does involve worrying about
those issues which were concerning you).

Your problem is that PageReserved used to work for you, and now
(post-2.6.14) it doesn't: we do now rely entirely on the refcount,
and will report "Bad page state" if a pagecount goes down to zero
while still marked as PageReserved, which is what you saw.

My guess is that you're using __get_free_pages with non-0 order?
But mapping individual PAGE_SIZE pages from that into userspace
by a nopage method? That is liable to be a problem, yes, because
the refcount for the whole is kept in the first struct page, the
later struct pages showing refcount 0: the whole is supposed to
be dealt with all together, but userspace page accounting on exit
will treat each PAGE_SIZE separately. PageReserved used to
override the refcount, but it no longer does so.

There's a number of different solutions to that, and fiddling with
the reference counts of the constituent pages is certainly one of
them; though better is to use split_page() (see mm/page_alloc.c),
then free the constituent pages separately at the end; (and better
is to avoid >0-order allocations since they're harder to guarantee,
but presumably you don't want to reorganize your driver right now;)
but the simplest change is to __get_free_pages with the __GFP_COMP
flag set, which marks all the constituent pages as constituents of
a compound page, and thereby keeps the refcounting right - that's
the solution we used in sound/core/memalloc.c when this problem
first came up (but it won't work pre-2.6.15).

Maybe the solution you've already adopted, with additional
get_page()s, is correct: but you need to be careful when your
module is unloaded, there's a danger of leaking the memory if
you don't put_page() the first, and there's a danger of... I
forget what exactly if you don't reset the counts properly on
the subsequent constituents. __GFP_COMP to hold the compound
page properly together, or split_page() to split it properly
apart, are preferable.

Hugh
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/