Re: [PATCH -mm] slab: fix cpuset check in fallback_alloc
From: Vladimir Davydov
Date: Mon Aug 11 2014 - 08:17:53 EST
On Mon, Aug 11, 2014 at 04:37:15AM -0700, David Rientjes wrote:
> On Mon, 11 Aug 2014, Vladimir Davydov wrote:
>
> > > diff --git a/mm/slab.c b/mm/slab.c
> > > --- a/mm/slab.c
> > > +++ b/mm/slab.c
> > > @@ -3047,16 +3047,19 @@ retry:
> > > * from existing per node queues.
> > > */
> > > for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
> > > - nid = zone_to_nid(zone);
> > > + struct kmem_cache_node *n;
> > >
> > > - if (cpuset_zone_allowed_hardwall(zone, flags) &&
> > > - get_node(cache, nid) &&
> > > - get_node(cache, nid)->free_objects) {
> > > - obj = ____cache_alloc_node(cache,
> > > - flags | GFP_THISNODE, nid);
> > > - if (obj)
> > > - break;
> > > - }
> > > + nid = zone_to_nid(zone);
> > > + if (!cpuset_zone_allowed(zone, flags | __GFP_HARDWALL))
> >
> > We must use softwall check here, otherwise we will proceed to
> > alloc_pages even if there are lots of free slabs on other nodes.
> > alloc_pages, in turn, may allocate from other nodes in case
> > cpuset.mem_hardwall=0, because it uses softwall check, so it may add yet
> > another free slab to another node's list even if it isn't empty. As a
> > result, we may get free list bloating on other nodes. I've seen a
> > machine with one of its nodes almost completely filled with inactive
> > slabs for buffer_heads (dozens of GBs) w/o any chance to drop them. So,
> > this is a bug that must be fixed.
> >
>
> Right, I understand, and my patch makes no attempt to fix that issue, it's
> simply collapsing the code down into a single cpuset_zone_allowed()
> function and the context for the allocation is controlled by the gfp
> flags (and hardwall is controlled by setting __GFP_HARDWALL) as it should
> be. I understand the issue you face, but I can't combine a cleanup with a
> fix and I would prefer to have your patch keep your commit description.
Sorry, I misunderstood you.
> The diffstat for my proposal removes many more lines than it adds and I
> think it will avoid this type of issue in the future for new callers.
> Your patch could then be based on the single cpuset_zone_allowed()
> function where you would simply have to remove the __GFP_HARDWALL above.
> Or, your patch could be merged first and then my cleanup on top, but it
> seems like your one-liner would be more clear if it is based on mine.
Having one function instead of two doing similar thing is usually better
IMO, but AFAIU your patch isn't a mere cleanup - it also slightly
changes the logic behind !__GFP_WAIT vs cpusets interaction:
> @@ -2505,18 +2501,22 @@ static struct cpuset *nearest_hardwall_ancestor(struct cpuset *cs)
> * GFP_USER - only nodes in current tasks mems allowed ok.
> *
> * Rule:
> - * Don't call cpuset_node_allowed_softwall if you can't sleep, unless you
> + * Don't call __cpuset_node_allowed if you can't sleep, unless you
> * pass in the __GFP_HARDWALL flag set in gfp_flag, which disables
> * the code that might scan up ancestor cpusets and sleep.
> */
> -int __cpuset_node_allowed_softwall(int node, gfp_t gfp_mask)
> +int __cpuset_node_allowed(int node, const gfp_t gfp_mask)
> {
> struct cpuset *cs; /* current cpuset ancestors */
> int allowed; /* is allocation in zone z allowed? */
>
> - if (in_interrupt() || (gfp_mask & __GFP_THISNODE))
> + if (in_interrupt())
> return 1;
> might_sleep_if(!(gfp_mask & __GFP_HARDWALL));
> + if (gfp_mask & __GFP_THISNODE)
> + return 1;
> + if (!(gfp_mask & __GFP_WAIT))
> + return 1;
This means cpuset_zone_allowed will now always return true for
!__GFP_WAIT allocations.
> if (node_isset(node, current->mems_allowed))
> return 1;
> /*
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1963,7 +1963,7 @@ zonelist_scan:
>
> /*
> * Scan zonelist, looking for a zone with enough free.
> - * See also __cpuset_node_allowed_softwall() comment in kernel/cpuset.c.
> + * See __cpuset_node_allowed() comment in kernel/cpuset.c.
> */
> for_each_zone_zonelist_nodemask(zone, z, zonelist,
> high_zoneidx, nodemask) {
> @@ -1974,7 +1974,7 @@ zonelist_scan:
> continue;
> if (cpusets_enabled() &&
> (alloc_flags & ALLOC_CPUSET) &&
> - !cpuset_zone_allowed_softwall(zone, gfp_mask))
> + !cpuset_zone_allowed(zone, gfp_mask))
> continue;
So, this is get_page_from_freelist. It's called from
__alloc_pages_nodemask with alloc_flags always having ALLOC_CPUSET bit
set and from __alloc_pages_slowpath with alloc_flags having ALLOC_CPUSET
bit set only for __GFP_WAIT allocations. That said, w/o your patch we
try to respect cpusets for all allocations, including atomic, and only
ignore cpusets if tight on memory (freelist's empty) for !__GFP_WAIT
allocations, while with your patch we always ignore cpusets for
!__GFP_WAIT allocations. Not sure if it really matters though, because
usually one uses cpuset.mems in conjunction with cpuset.cpus and it
won't make any difference then. It also doesn't conflict with any cpuset
documentation.
> /*
> * Distribute pages in proportion to the individual
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/