Re: Cache Allocation Technology Design

From: Peter Zijlstra
Date: Thu Oct 30 2014 - 17:44:15 EST

Next message: Alexander Gordeev: "Re: [PATCH] rcu: Remove redundant rcu_is_cpu_rrupt_from_idle() function"
Previous message: Alexei Starovoitov: "Re: suspicious rcu_dereference_check() usage warning with 3.18-rc2"
In reply to: Tejun Heo: "Re: Cache Allocation Technology Design"
Next in thread: Tejun Heo: "Re: Cache Allocation Technology Design"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Thu, Oct 30, 2014 at 01:03:31PM -0400, Tejun Heo wrote:
> Hey, Peter.
>
> On Thu, Oct 30, 2014 at 02:18:45PM +0100, Peter Zijlstra wrote:
> > On Thu, Oct 30, 2014 at 08:43:33AM -0400, Tejun Heo wrote:
> > > And that something shouldn't be disallowing task migration across
> > > cgroups. This simply doesn't work with co-mounting or unified
> > > hierarchy. cpuset automatically takes on the nearest ancestor's
> > > configuration which has enough execution resources. Maybe that can be
> > > an option for this too?
> >
> > It will give very random and nondeterministic behaviour and basically
> > destroy the entire purpose of the controller (which are the very same
> > reasons I detest that 'new' behaviour in cpusets).
>
> I agree with you that this is a corner case behavior which deviates
> from the usual behavior; however, the deviation is inherent. This
> stems from the fact that the kernel in general doesn't allow tasks
> which cannot be run. You say that you detest the new behaviors of
> cpuset; however, the old behaviors were just as sucky - bouncing tasks
> to an ancestor cgroup forcifully and without any indication or way to
> restore the previous configuration. What's different with the new
> behavior is that it explicitly distinguishes between the configured
> and effective configurations as the kernel isn't capable for actually
> enforcing certain subset of configurations.

If a cpu bounces (by accident or whatever) then there is no trace left
behind that the system didn't in fact observe/obey its constraints. It
should have provided an error or failed the hotplug. But we digress,
lets not have this discussion (again :) and focus on the new thing.

> So, the inherent problem is always there no matter what we do and the
> question is that of a policy to deal with it. One of the main issues
> I see with failing cgroup-level operations for controller specific
> reasons is lack of visibility. All you can get out of a failed
> operation is a single error return and there's no good way to
> communicate why something isn't working, well not even who's the
> culprit. Having "effective" vs "configured" makes it explicit that
> the kernel isn't capable of honoring all configurations and make the
> details of the situation visible.

Right, so that is a short coming of the co-mount idea. Your effective vs
configured thing is misleading and surprising though. Operations might
'succeed' and still have failed, without any clear
indication/notification of change.

> Another part is inconsistencies across controllers. This sure is
> worse when there are multiple controllers involved but inconsistent
> behaviors across different hierarchies are annoying all the same with
> single controller multiple hierarchies. Userland often manages some
> of those hierarchies together and it can get horribly confusing. No
> matter what, we need to settle on a single policy and having effective
> configuration seems like the better one.

I'm not entirely sure I follow. Without co-mounting its entirely obvious
which one is failing.

Also, per the previous point, since you need a notification channel
anyway, you might as well do the expected fail and report more details
through that.

> > > One of the problems is that we generally assume that a task can run
> > > some point in time in a lot of places in the kernel and can't just not
> > > run a task indefinitely because it's in a cgroup configured certain
> > > way.
> >
> > Refusing tasks into a previously empty cgroup creates no such problems.
> > Its already in a cgroup (wherever its parent was) and it can run there,
> > failing to move it to another does not affect things.
>
> Yeah, sure, hard failing can work too. It didn't work well for cpuset
> because a runnable configuration may become not so if the system
> config changes afterwards but this probably doesn't have an issue like
> that. I'm not saying something like the above won't work. It'd but I
> don't think that's the right place to fail.

Right, this thing doesn't suffer that particular problem if its
good it stays good.

> This controller might not even require the distinction between
> configured and effective tho? Can't a new child just inherit the
> parent's configuration and never allow the config to become completely
> empty?

It can do that. But that still has a problem, there is a mapping in
hardware which restricts the number of active configurations. The total
configuration space is larger than the supported active configurations.

So _something_ must fail. The initial proposal was mkdir failing when
there were more than the hardware supported active config cgroup
directories. The alternative was on-demand activation where we only
allocate the hardware resource when the first task gets moved into the
group -- which then clearly can fail.

> > Traditionally the cgroups were task based, but many controllers are
> > process based (simply because what they control is process wide, not per
> > task), and there was talk (2-3 years ago or so) about making the entire
> > cgroup thing per process, which obviously fails for all scheduler
> > related cgroups.
>
> Yeah, it needs to be a separate interface where a given userland task
> can access its own knobs in a race-free way (cgroup interface can't
> even do that) whether that's a pseudo filesystem, say,
> /proc/self/BLAHBLAH or new syscalls. This one is necessary regardless
> of what happens with cgroup. cgroup simply isn't a suitable mechanism
> to expose these types of knobs to individual userland threads.

I'm not sure what you're saying there. You want to replace the
task-controllers with another pseudo filesystem that does it differently
but still is a hierarchical controller?, how is that different from just
not co-mounting the task and process based controllers, either way you
end up with 2 separate hierarchies.

> > > Yeah, RT is one of the main items which is problematic, more so
> > > because it's currently coupled with the normal sched controller and
> > > the default config doesn't have any RT slice.
> >
> > Simply because you cannot give a slice on creation; or if you did that
> > would mean failing mkdir when a new cgroup would exceed the available
> > time.
> >
> > Also any !0 slice is wrong because it will not match the requirements of
> > the proposed workload, the administrator will have to set it to match
> > the workload.
> >
> > Therefore 0.
>
> As long as RT is separate from normal sched controller, this *could*
> be fine. The main problem now is that userland which wants to use the
> cpu controller but doesn't want to fully manage RT slices end up
> disabling RT slices.

I don't get this, who but the admin manages things, and how would you
accidentally have an RT app and not know about it. And if you're in that
situation you're screwed anyhow, since you've no f'ing clue how to
configure your system for it anyhow. At which point you're in deep.

> It might work if a new child can share the
> parent's slice till explicitly configured.

Principle of least surprise. That's surprising behaviour. Why move it in
he first place?

> Another problem is when
> you wanna change the configuration after the hierarchy is already
> populated.

We fail the configuration change. For RR/FIFO we won't allow you to set
the slice to 0 if there's tasks. For deadline we would fail everything
that tries to lower things below the utilization required by the tasks
(and child groups).

> I don't know. I'd even be happy with cgroup not having
> anything to do with RT slice distribution. Do you have any ideas
> which can make RT slice distribution more palatable? If we can't
> decouple the two, we'd be effectively requiring whoever is managing
> the cpu controller to also become a full-fledged RT slice arbitrator,
> which might actually work too.

The admin you mean? He had better know what the heck he's doing if he's
running RT apps, great fail is otherwise fairly deterministic in his
future.

The thing is, you cannot arbiter this stuff, RR/FIFO are horrible pieces
of shit interfaces, they don't describe near enough. People need to be
involved.

> > > Do we completely block RT task w/o slice? Is that okay?
> >
> > We will not allow an RT task in, the write to the tasks file will fail.
> >
> > The same will be true for deadline tasks, we'll fail entry into a cgroup
> > when the combined requirements of the tasks exceed the provisions of the
> > group.
> >
> > There is just no way around that and still provide sane semantics.
>
> Can't a task just lose RT / deadline properties when migrating into a
> different RT / deadline domain? We already modify task properties on
> migration for cpuset after all. It'd be far simpler that way.

Again, why move it in the first place? This all sounds like whomever is
doing this is clueless. You don't move RT tasks about if you're not
intimately aware of them and their requirements.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Alexander Gordeev: "Re: [PATCH] rcu: Remove redundant rcu_is_cpu_rrupt_from_idle() function"
Previous message: Alexei Starovoitov: "Re: suspicious rcu_dereference_check() usage warning with 3.18-rc2"
In reply to: Tejun Heo: "Re: Cache Allocation Technology Design"
Next in thread: Tejun Heo: "Re: Cache Allocation Technology Design"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]