Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexitylinear

From: Brice Goglin
Date: Fri Oct 10 2008 - 17:00:48 EST

Next message: Parag Warudkar: "Re: [linux-pm] WARNING: at kernel/power/main.c:176 suspend_test_finish+0x70/0x80()"
Previous message: Daniel Lezcano: "Re: [RFC][PATCH 1/2] Track in-kernel when we expect checkpoint/restartto work"
In reply to: Christoph Lameter: "Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexitylinear"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Brice Goglin wrote:
> Andrew Morton wrote:
>
>> On Thu, 09 Oct 2008 14:32:26 +0200
>> Brice Goglin <Brice.Goglin@xxxxxxxx> wrote:
>>
>>
>>
>>> Add a radix-tree in do_move_pages() to associate each page with
>>> the struct page_to_node that describes its migration.
>>> new_page_node() can now easily find out the page_to_node of the
>>> given page instead of traversing the whole page_to_node array.
>>> So the overall complexity is linear instead of quadratic.
>>>
>>> We still need the page_to_node array since it is allocated by the
>>> caller (sys_move_page()) and used by do_pages_stat() when no target
>>> nodes are given by the application. And we need room to store all
>>> these page_to_node entries for do_move_pages() as well anyway.
>>>
>>> If a page is given twice by the application, the old code would
>>> return -EBUSY (failure from the second isolate_lru_page()). Now,
>>> radix_tree_insert() will return -EEXIST, and we convert it back
>>> to -EBUSY to keep the user-space ABI.
>>>
>>> The radix-tree is emptied at the end of do_move_pages() since
>>> new_page_node() doesn't know when an entry is used for the last
>>> time (unmap_and_move() could try another pass later).
>>> Marking pp->page as ZERO_PAGE(0) was actually never used. We now
>>> set it to NULL when pp is not in the radix-tree. It is faster
>>> than doing a loop of radix_tree_lookup_gang()+delete().
>>>
>>>
>> Any O(n*n) code always catches up with us in the end. But I do think
>> that to merge this code we'd need some description of the problem which
>> we fixed.
>>
>> Please send a description of the situation under which the current code
>> performs unacceptably. Some before-and-after quantitative measurements
>> would be good.
>>
>> Because it could be (as far as I know) that the problem is purely
>> theoretical, in which case we might not want the patch at all.
>>
>>
>
> Just try sys_move_pages() on a 10-100MB buffer, you'll get something
> like 50MB/s on a recent Opteron machine. This throughput decreases
> significantly with the number of pages. With this patch, we get about
> 350MB/s and the throughput is stable when the migrated buffer gets
> larger. I don't have detailled numbers at hand, I'll send them by monday.
>

Here's some quickyl-gathered numbers for the duration of move_pages().
It's between nodes #2 and #3 of a quad-quad-core opteron 2347HE with
2.6.27-rc5 + perfmon2:

buffer (kB) move_pages (us) move_pages with patch (us)
4000 12351 12580
40000 223975 123024

As you can see, with the patch applied, the migration time for 10x more
pages is 10x more. Without the patch, it's 18x.

I'll see if I can implement what Christoph's ideas.

Brice

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Parag Warudkar: "Re: [linux-pm] WARNING: at kernel/power/main.c:176 suspend_test_finish+0x70/0x80()"
Previous message: Daniel Lezcano: "Re: [RFC][PATCH 1/2] Track in-kernel when we expect checkpoint/restartto work"
In reply to: Christoph Lameter: "Re: [PATCH] mm: use a radix-tree to make do_move_pages() complexitylinear"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]