Re: [patch 2/2] cpusets: add interleave_over_allowed option

From: David Rientjes
Date: Mon Oct 29 2007 - 00:48:25 EST


On Sun, 28 Oct 2007, Paul Jackson wrote:

> And, unless someone in the know tells us otherwise, I have to assume
> that this could break them. Now, the odds are that they simply don't
> run that solution stack on any system making active use of cpusets,
> so the odds are this would be no problem for them. But I don't
> presently have enough knowledge of their situation to take that risk.
>

If we can't identify any applications that would be broken by this, what's
the difference in simply implementing Choice B and then, if we hear
complaints, add your hack to revert back to Choice A behavior based on the
get_mempolicy() call you specified is always part of libnuma?

The problem that I see with immediately offering both choices is that we
don't know if anybody is actually reverting back to Choice A behavior
because libnuma, by default, would use it. That's going to making it very
painful to remove later because we've supported both options and have made
libnuma and {get,set}_mempolicy() arguments ambiguous. We should only
support both choices if they will both be used and there's no hard
evidence to suggest that at this point.

> But dual support is pretty easy so far as the kernel code is concerned.
> It's just a few nodes_remap() calls optionally invoked at a few key
> spots in mm/mempolicy.c. Consequently there won't be a big hurry to
> remove Choice A.
>

You earlier insisted on an ease of documentation for the MPOL_INTERLEAVE
case and now this dual support that you're proposing is going to make the
documentation very difficult to understand for anyone who simply wants to
use mempolicies.

Others even in this thread have had a hard enough time understanding the
difference between the two choices and you explained them very thoroughly.
It's going to be much more trouble than it's worth, I predict.

> There is no "_then_ attach the task to a cpuset." On systems with
> kernels configured with CONFIG_CPUSETS=y, all tasks are in a cpuset
> all the time. Moreover, from a practical point of view, on large
> systems managed with cpuset based mechanisms, almost all tasks are in
> cpusets that do not include all nodes, for the entire life of the task.
>

And that application would need to be implemented to know the nodes that
it has access to before it issues its set_mempolicy(MPOL_PREFERRED)
command anyway if it truly uses Choice A behavior. So unless these tasks
are looking in /proc/pid/status and parsing Mems_allowed and then
specifying one as its preferred node or always being guaranteed a certain
set of nodes that they are always attached to in a cpuset so they have
such foresight of what node to prefer, Choice A can't possibly be what
they want.

> > Yet the 'mems' file would still be system-wide; otherwise it would be
> > impossible to expand the memory your cpuset has access to.
>
> I had to read that a couple of times to make sense of it. I take that
> it means that the node numbering used in each cpuset's 'mems' file has
> to be system-wide. Yes, agreed.
>
> (Well, actually, the node numbering of each cpusets 'mems' file could
> be relative to its parent cpusets 'mem' numbers, but let's not go
> there, as this discussion is already sufficiently complicated ;)
>

I appreciate that very much.

> Would it meet the need that prompted your initial patch set if we
> added Choice B memory policy node numbering, but left Choice A as the
> kernel default, with a per-task option (perhaps invokable by a new
> option to one of the {get,set}_mempolicy() calls) to choose Choice B?
>

The needs I was addressing with my initial patchset was so that when a
cpuset is expanded, any MPOL_INTERLEAVE memory policy of attached tasks
automatically get expanded as well. This discussion has somewhat diverged
from that, but I hope you still support what we earlier talked about in
terms of adding a field to struct mempolicy to remember the intended
nodemask the application asked to interleave over.

> This lets us get Choice B out there, and lets the two main libraries,
> libnuma and libcpuset, dynamically adapt to whichever Choice is active
> for the current task.
>
> Unchanged applications and existing binaries would simply continue with
> Choice A. With one additional line of code, a user application could
> get Choice B, with its ability for example to request MPOL_INTERLEAVE
> over all cpuset allowed nodes, where the kernel automatically adapts
> that to changing cpuset changes from larger 'mems' to smaller 'mems'
> and back to larger 'mems' again.
>

You don't actually need to choose between the two choices for adapting
MPOL_INTERLEAVE over _all_ allowed cpuset nodes.

I thought what we agreed upon and what you were going to implement was
adding a nodemask_t to struct mempolicy for the intended nodemask of the
memory policy and then AND it with pol->cpuset_mems_allowed. That
completely satisfies my needs and my applications that want to allocate
over all available nodes (by simply passing numa_all_nodes to
set_mempolicy(MPOL_INTERLEAVE)). If I wanted to interleave only over a
subset, the choices would matter.

David
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/