On Tue, 2006-11-28 at 13:24 +0000, Mel Gorman wrote:On Mon, 27 Nov 2006, Rohit Seth wrote:
Hi Mel,
On Mon, 2006-11-27 at 13:18 +0000, Mel Gorman wrote:On Wed, 22 Nov 2006, Rohit Seth wrote:
This patch provides a IO hole size in a given address range.
Hi,
This patch reintroduces a function that doubles up what
absent_pages_in_range(start_pfn, end_pfn). I recognise you do this because
you are interested in hole sizes before add_active_range() is called.
Right.
However, what is not clear is why these patches are so specific to x86_64.
Specifically in the fake numa case, we want to make sure that we don't
carve fake nodes that only have IO holes in it. Unlike the real NUMA
case, here we don't have SRAT etc. to know the memory layout beforehand.
It looks possible to do the work of functions like split_nodes_equal() in
an architecture-independent manner using early_node_map rather than
dealing with the arch-specific nodes array. That would open the
possibility of providing fake nodes on more than one architecture in the
future.
The functions like splti_nodes_equal etc. can be abstracted out to arch
independent part. I think the only API it needs from arch dependent
part is to find out how much real RAM is present in range without have
to first do add_active_range.
That is a problem because the ranges must be registered with
add_active_range() to work out how much real RAM is present.
Right. And that is why I need e820_hole_size functionality. BTW, what
is the concern in having that function?
That is precisely why I'm doing it :-)Though as a first step, let us fix the x86_64 (as it doesn't boot when
you have sizeable chunk of IO hole and nodes > 4).
Ok.
I'm also not sure if other archs actually want to have this
functionality.
It's possible that the containers people are interested in the possibility
of setting up fake nodes as part of a memory controller.
What I think can be done is that you register memory as normal and then
split up the nodes into fake nodes. This would remove the need for having
e820_hole_size() reintroduced.
Are you saying first let the system find out real numa topology and then
build fake numa on top of it?
Yes, there is nothing stopping you altering the early_node_map[] before
free_area_init_node() initialises the node_mem_map. If you do hit a
problem, it'll be because x86_64 allocates it's own node_mem_map with
CONFIG_FLAT_NODE_MEM_MAP is set. Is that set when setting up fake nodes?
I thought they both (real numa + fake numa) operate on same data
structures. I'll have to double check.
-rohit