Re: [PATCH] page_alloc: skip cpuset enforcement for lower zone allocations (v4)

From: David Rientjes
Date: Thu May 29 2014 - 19:02:04 EST


On Thu, 29 May 2014, Marcelo Tosatti wrote:

> Zone specific allocations, such as GFP_DMA32, should not be restricted
> to cpusets allowed node list: the zones which such allocations demand
> might be contained in particular nodes outside the cpuset node list.
>
> Necessary for the following usecase:
> - driver which requires zone specific memory (such as KVM, which
> requires root pagetable at paddr < 4GB).
> - user wants to limit allocations of application to nodeX, and nodeX has
> no memory < 4GB.
>
> Signed-off-by: Marcelo Tosatti <mtosatti@xxxxxxxxxx>
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 3d54c41..3bbc23f 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2374,6 +2374,7 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
> * variable 'wait' is not set, and the bit ALLOC_CPUSET is not set
> * in alloc_flags. That logic and the checks below have the combined
> * affect that:
> + * gfp_zone(mask) < policy_zone - any node ok
> * in_interrupt - any node ok (current task context irrelevant)
> * GFP_ATOMIC - any node ok
> * TIF_MEMDIE - any node ok
> @@ -2392,6 +2393,10 @@ int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
>
> if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
> return 1;
> +#ifdef CONFIG_NUMA
> + if (gfp_zone(gfp_mask) < policy_zone)
> + return 1;
> +#endif
> might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
> if (node_isset(node, current->mems_allowed))
> return 1;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 5dba293..a0ce1ba 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2726,6 +2726,11 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order,
> retry_cpuset:
> cpuset_mems_cookie = read_mems_allowed_begin();
>
> +#ifdef CONFIG_NUMA
> + if (gfp_zone(gfp_mask) < policy_zone)
> + nodemask = &node_states[N_ONLINE];
> +#endif
> +
> /* The preferred zone is used for statistics later */
> first_zones_zonelist(zonelist, high_zoneidx,
> nodemask ? : &cpuset_current_mems_allowed,

There are still three issues with this, two of which are only minor and
one that needs more thought:

(1) this doesn't affect only cpusets which the changelog indicates, it
also bypasses mempolicies for GFP_DMA and GFP_DMA32 allocations since
the nodemask != NULL in the page allocator when there is an effective
mempolicy. That may be precisely what you're trying to do (do the
same for mempolicies as you're doing for cpusets), but the comment
now in the code specifically refers to cpusets. Can you make a case
for the mempolicies exception as well? Otherwise, we'll need to do

if (!nodemask && gfp_zone(gfp_mask) < policy_zone)
nodemask = &node_states[N_ONLINE];

And the two minors:

(2) this should be &node_states[N_MEMORY], not &node_states[N_ONLINE]
since memoryless nodes should not be included. Note that
guarantee_online_mems() looks at N_MEMORY and
cpuset_current_mems_allowed is defined for N_MEMORY without
cpusets.

(3) it's unnecessary for this to be after the "retry_cpuset" label and
check the gfp mask again if we need to relook at the allowed cpuset
mask.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/