Re: [RFC 0/2] Rootmem: boot-time memory allocator

From: Johannes Weiner
Date: Sun May 04 2008 - 11:34:46 EST


Hi,

Johannes Weiner <hannes@xxxxxxxxxxxx> writes:

> Hi Yinghai,
>
> Johannes Weiner <hannes@xxxxxxxxxxxx> writes:
>
>> Hi,
>>
>> "Yinghai Lu" <yhlu.kernel@xxxxxxxxx> writes:
>>
>>> On Sat, May 3, 2008 at 10:54 AM, Ingo Molnar <mingo@xxxxxxx> wrote:
>>>>
>>>> * Johannes Weiner <hannes@xxxxxxxxxxxx> wrote:
>>>>
>>>> > I was spending some time and work on the bootmem allocator the last
>>>> > few weeks and came to the conclusion that its current design is not
>>>> > appropriate anymore.
>>>> >
>>>> > As Ingo said in another email, NUMA technologies will become weirder,
>>>> > nodes whose PFNs span other nodes for example and it makes bootmem
>>>> > code become an unreadable mess.
>>>> >
>>>> > So I sat down two days ago and rewrote the allocator, here is the
>>>> > result: rootmem!
>>>>
>>>> hehe :-)
>>>>
>>>>
>>>> > The biggest difference to the old design is that there is only one
>>>> > bitmap for all PFNs of all nodes together, so the overlapping PFN
>>>> > problems simply dissolve and fun like allocations crossing node
>>>> > boundaries work implicitely. The new API requires every node used by
>>>> > the allocator to be registered and after that the bitmap gets
>>>> > allocated and the allocator enabled.
>>>> >
>>>> > I chose to add a new allocator rather than replacing bootmem at once
>>>> > because that would have required all callsites to switch in one go,
>>>> > which would be a lot. The new allocator can be adopted more slowly
>>>> > and I added a compatibility API for everything besides actually
>>>> > setting up the allocator. When the last user dies, bootmem can be
>>>> > dropped completely (including pgdat->bdata, whee..)
>>>> >
>>>> > The main ideas from bootmem have been stolen^W preserved but the new
>>>> > design allowed me to shrink the code a lot and express things more
>>>> > simple and clear:
>>>> >
>>>> > $ sloc.awk < mm/bootmem.c
>>>> > 455 lines of code, 65 lines of comments (520 lines total)
>>>> >
>>>> > $ sloc.awk < mm/rootmem.c
>>>> > 243 lines of code, 96 lines of comments (339 lines total)
>>>>
>>>> amazing!
>>>>
>>>> i'd still suggest to keep it all named bootmem though :-/ How about
>>>> bootmem2.c and then renaming it back to bootmem.c, once the last user is
>>>> gone? That would save people from having to rename whole chapters in
>>>> entire books ;-)
>>>
>>> for spanning support node0:0-2g, 4-6g; node1: 2-4g, 6-8g, could have
>>> some problem.
>>
>> Could you eleborate on that?
>>
>>> +/*
>>> + * rootmem_register_node - register a node to rootmem
>>> + * @nid: node id
>>> + * @start: first pfn on the node
>>> + * @end: first pfn after the node
>>> + *
>>> + * This function must not be called anymore if the allocator
>>> + * is already up and running (rootmem_setup() has been called).
>>> + */
>>> +void __init rootmem_register_node(int nid, unsigned long start,
>>> + unsigned long end)
>>> +{
>>> + BUG_ON(rootmem_functional);
>>> +
>>> + if (start < rootmem_min_pfn)
>>> + rootmem_min_pfn = start;
>>> + if (end > rootmem_max_pfn)
>>> + rootmem_max_pfn = end;
>>> +
>>> + rootmem_node_pages[nid] = end - start;
>>> + rootmem_node_offsets[nid] = start;
>>> + rootmem_nr_nodes++;
>>> +}
>>>
>>> could change rootmem_node_pages/offsets to be struct array with
>>> offset, pages, and nid. and every node could several struct. and whole
>>> array should be sorted with nid.

In the long term, this would have to be implemented no matter if
rootmem/bootmem2 gets merged or not, because bootmem suffers the same
problem, right?

>> The whole point is to be agnostic about weird NUMA configs. Right now,
>> I am pretty proud of the simple data structures and I would avoid
>> blowing them up again unless there is a hard reason to do so.

This is non-helping crap, please excuse me.

> One thing I have found is that __rootmem_alloc_node can not garuantee
> that the memory it returns is on the requested node right now.

Hm, we have two choices: Either we introduce a new API that requests the
arch code to register not only node ranges but also subranges on that
node, or we won't garuantee that you get all memory on the node you
specified. Correct?

The first option would be what you have proposed, I think.

> I will include the fix in the next version.

Wow, I took my mouth too full. Right now, I have no idea what the
correct solution would be.

Hannes
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/