Re: [BUGFIX][PATCH] oom-kill: fix NUMA consraint check withnodemask v2

From: KAMEZAWA Hiroyuki
Date: Tue Nov 10 2009 - 03:19:54 EST


On Tue, 10 Nov 2009 17:03:38 +0900
Daisuke Nishimura <nishimura@xxxxxxxxxxxxxxxxx> wrote:

> On Tue, 10 Nov 2009 16:40:55 +0900, KAMEZAWA Hiroyuki <kamezawa.hiroyu@xxxxxxxxxxxxxx> wrote:
> > On Tue, 10 Nov 2009 16:39:02 +0900 (JST)
> > KOSAKI Motohiro <kosaki.motohiro@xxxxxxxxxxxxxx> wrote:
> >
> > > > > > +
> > > > > > + /* Check this allocation failure is caused by cpuset's wall function */
> > > > > > + for_each_zone_zonelist_nodemask(zone, z, zonelist,
> > > > > > + high_zoneidx, nodemask)
> > > > > > + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > > > > > return CONSTRAINT_CPUSET;
> > > > >
> > > > > If cpuset and MPOL_BIND are both used, Probably CONSTRAINT_MEMORY_POLICY is
> > > > > better choice.
> > > >
> > > > No. this memory allocation is failed by limitation of cpuset's alloc mask.
> > > > Not from mempolicy.
> > >
> > > But CONSTRAINT_CPUSET doesn't help to free necessary node memory. It isn't
> > > your fault. original code is wrong too. but I hope we should fix it.
> > >
> I think so too.
>
> > Hmm, maybe fair enough.
> >
> > My 3rd version will use "kill always current(CONSTRAINT_MEMPOLICY does this)
> > if it uses mempolicy" logic.
> >
> "if it uses mempoicy" ?
> You mean "kill allways current if memory allocation has failed by limitation of
> cpuset's mask"(i.e. CONSTRAINT_CPUSET case) ?
>

No. "kill always current process if memory allocation uses mempolicy"
regardless of cpuset. If the task doesn't use mempolicy allocation,
usual CONSTRAINT_CPUSET/CONSTRAINT_NONE oom handler will be invoked.

Now, without patch, CONSTRAINT_MEMPOLICY is not returned at all. I'd
like to limit the scope of this patch to return it. If it's returned,
current will be killed.

Finally, we'll have to consinder "how to manage oom under cpuset"
problem, again. It's not handled in good way, now.

The main problems are...
- Cpuset allows intersection of nodes among groups.
- Task can be migrated to other cpuset withoug moving memory.
- We don't have per-node-rss information per task.

Then,
- We have to scan all tasks.
- We have to invoke Totally-Random-Innocent-Task-Killer and pray that
someone bad will be killed.

IMHO, "find correct one" is too heavy to the kernel (under cpuset).
If we can have notifier to userland, some daemon can check numa_maps of all
tasks and will do something reasonbale.


Thanks,
-Kame

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/