Re: [RFC PATCH 0/4]: affinity-on-next-touch

From: Lee Schermerhorn
Date: Mon Jun 22 2009 - 09:50:21 EST

Next message: Marek Szyprowski: "RE: PROBLEM: kernel oops with g_serial USB gadget on 2.6.30"
Previous message: Andi Kleen: "Re: 2.6.30-git(16 and 17) system hangs after resume from suspendto disk, mce related?"
In reply to: Brice Goglin: "Re: [RFC PATCH 0/4]: affinity-on-next-touch"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Sat, 2009-06-20 at 09:24 +0200, Brice Goglin wrote:
> Lee Schermerhorn wrote:
> > My patches don't have per process enablement. Rather, I chose to use
> > per cpuset enablement. I view cpusets as sort of "numa control groups"
> > and thought this was an appropriate level at which to control this sort
> > of behavior--analogous to memory_spread_{page|slab}. That probably
> > needs to be discussed more widely, tho'.
> >
>
> Could you explain why you actually want to enable/disable
> migrate-on-fault on a cpuset (or process) basis? Why would an
> administrator want to disable it? Aren't the existing cpuset memory
> restriction abilities enough?
>
> Brice
>

Hello, Brice:

There are a couple of aspects to this question, I think?

1) why enable/disable at all? why not always enabled?

When I try out some new behavior such as migrate of fault, I start with
the assumption [right or wrong] that not all users will want this
behavior. For migrate-on-fault, one probably won't run into it all that
often unless the MPOL_MF_LAZY flag is used to forcibly unmap regions.
However, with swap read-ahead, one could end up with anon pages in the
swap cache with no pte references, and could experience unexpected
migrations. I've learned that some folks really don't like
surprises :). Now, when you consider the "automigration" feature
["auto" here means "self" more than "automatic"], I think it's more
important to be able to enable/disable it. I've not seen any
performance degradation when using it, but I feared that for some
workloads, thrashing could cause such degradation. Page migration isn't
free.

Also, because Linux runs on such a wide range of platforms, I don't want
to burden smaller, embedded systems with the additional code, so I also
try to make the feature source configurable. I know we worry about the
proliferation of config options, but it's easier to remove one after the
fact, I think, than to retrofit it.

2) Why a per cpuset control?

I consider cpusets to be "numa control groups". They constrain
resources on a numa node [and related cpus] granularity, and control
numa related behavior, such as migration when changing cpusets,
spreading page cache and slab pages over nodes in the cpuset, ... In
fact, I think it would have been appropriate to call the cpuset control
group the "numa control group" when cgroups were introduced, but it's
too late for that now.

Finally, and not a reason to include the controls in the mainline, it's
REALLY useful during development. One can boot a test kernel, and only
enable the feature in a test cpuset, limiting the damage of, e.g., a
reference counting bug or such. It's also useful for measuring the
overhead of the patches absent any actual page migrations. However, if
this feature ever makes it to mainline, the community will have its say
on whether these controls should be included and how.

Hope this helps,
Lee

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Marek Szyprowski: "RE: PROBLEM: kernel oops with g_serial USB gadget on 2.6.30"
Previous message: Andi Kleen: "Re: 2.6.30-git(16 and 17) system hangs after resume from suspendto disk, mce related?"
In reply to: Brice Goglin: "Re: [RFC PATCH 0/4]: affinity-on-next-touch"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]