Re: [RFC] memory defragmentation to satisfy high order allocations

From: Hirokazu Takahashi
Date: Sun Oct 03 2004 - 13:43:55 EST


Hi, Marcelo

> > > 2)
> > > At migrate_onepage you add anonymous pages which aren't swap allocated
> > > to the swap cache
> > > + /*
> > > + * Put the page in a radix tree if it isn't in the tree yet.
> > > + */
> > > +#ifdef CONFIG_SWAP
> > > + if (PageAnon(page) && !PageSwapCache(page))
> > > + if (!add_to_swap(page, GFP_KERNEL)) {
> > > + unlock_page(page);
> > > + return ERR_PTR(-ENOSPC);
> > > + }
> > > +#endif /* CONFIG_SWAP */
> > >
> > > Why's that? You can copy anonymous pages without adding them to swap (thats
> > > what the patch I posted does).
> >
> > The reason is to guarantee that any anonymous page can be migrated anytime.
> > I want to block newly occurred accesses to the page during the migration
> > because it can't be migrated if there remain some references on it by
> > system calls, direct I/O and page faults.
>
> It would be nice if we could block pte faults in a way such to not need
> adding each anonymous page to swap. It can be too costly if you have a lot memory
> and it makes the whole operation dependable on swap size (if you dont have enough
> swap, you're dead).
>
> Maybe hold mm->page_table_lock (might be too costly in terms of CPU time, but since
> migration is not a common operation anyway), or create a semaphore?

I think the problem of the holding mm->page_table_lock approach is
that it doesn't allow the migration code blocked. the semaphore
approach would be better.

I have another idea that each anonymous page can detach its swap entry
after its migration. It can be done by remove_exclusive_swap_page()
if the page is remapped to the same spaces forcibly by
touch_unmapped_address() I made.

> > Your approach will work fine on most of anonymous pages, which aren't
> > heavily accessed. I think it will be enough for memory defragmentation.
>
> Yes...
>
> > > 3) At migrate_page_common you assume additional page references
> > > (page_migratable returning -EAGAIN) means the code should try to writeout
> > > the page.
> > >
> > > Is that assumption always valid?
> >
> > -EAGAIN means that the page may require to be written back
>
> But why is it needed to writeout pages? We shouldnt need to. At least
> from what I can understand.

The migration code allows each filesystem to implement its own
migration code or just use migrate_page_buffer() or
migrate_page_common().

migrate_page_common() is a default function if filesystem doesn't
implement anything. The function is the most generic and it tries
to writeback pages only if they are dirty and have buffers.

> > or
> > just to wait for a while since the page is just referred by system call
> > or pagefault handler.
>
> I'm not sure if making that assumption is always valid.
>
> Kernel code can have an additional count on the page meaning "this page is pinned,
> dont move it". At least that should be valid.

Yes, I know. I have checked all of the code.

AIO event buffers are pinned, therefore the memory-hotplug team plans
to make pages for the event buffers assigned to non-hotpluggable
memory regions.

And pages in sendfile() might be pinned for a while in case of network
problems. I think there may be some workarounds. The easiest way
is just waiting its timeout, and another way is changing the mode
of sendfile() to copy pages in advance.

Pages for NFS also might be pinned with network problems.
One of the ideas is to restrict NFS to allocate pages from
specific memory region, sot that all memory except the region
can be hot-removed. And it's possible to implementing whole
migrate_page method, which may handled stuck pages.

If the migration code is used for memory defragmentation, pinned pages
must be avoided. I think it can be done with the non-blocking mode.

> Any piece of code which holds a reference on a page for a long
> time is going to be a pain for the algorithm right?
>

> > > 4)
> > > About implementing a nonblocking version of it. The easier way, it
> > > seems to me, is to pass a "block" argument to generic_migrate_page() and
> > > use that.
> >
> > Yes.
>
> OK. I'll try to implement it this week (plus the radix_tree_replace
> tag thingie).

Thank you for that.

Hirokazu Takahashi.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/