Re: [PATCH v4 09/14] cpuset: Don't use the cpu_possible_mask as a last resort for cgroup v1

From: Qais Yousef
Date: Mon Nov 30 2020 - 12:06:24 EST


On 11/27/20 13:32, Qais Yousef wrote:
> On 11/24/20 15:50, Will Deacon wrote:
> > If the scheduler cannot find an allowed CPU for a task,
> > cpuset_cpus_allowed_fallback() will widen the affinity to cpu_possible_mask
> > if cgroup v1 is in use.
> >
> > In preparation for allowing architectures to provide their own fallback
> > mask, just return early if we're not using cgroup v2 and allow
> > select_fallback_rq() to figure out the mask by itself.
>
> What about cpuset_attach()? When a task attaches to a new group its affinity
> could be reset to possible_cpu_mask if it moves to top_cpuset or the
> intersection of effective_cpus & cpu_online_mask.
>
> Probably handled with later patches.
>
> /me continue to look at the rest
>
> Okay so in patch 11 we make set_cpus_allowed_ptr() fail. cpuset_attach will
> just do WARN_ON_ONCE() and carry on.
>
> I think we can fix that by making sure cpuset_can_attach() will fail. Which can
> be done by fixing task_can_attach() to take into account the new arch
> task_cpu_possible_mask()?

I ran some experiments and indeed we have some problems here :(

I create 3 cpusets: 64bit, 32bit and mix. As the name indicates, 64bit contains
all 64bit-only cpus, 32bit contains 32bit-capable ones and mix has a mixture of
both.

If I try to move my test binary to 64bit cpuset, it moves there and I see the
WARN_ON_ONCE() triggered. The task has attached to the new cpuset but
set_allowed_cpus_ptr() has failed and we end up with whatever affinity we had
previously. Breaking cpusets effectively.

This can be easily fixable with my suggestion to educate task_can_attach() to
take into account the new arch_task_cpu_cpu_possible_mask(). The attachment
will fail and userspace will get an error then.

But this introduces another issue..

If I use cpumask_subset() in task_can_attach(), then an attempt to move to
'mix' cpuset will fail. In Android foreground cpuset usually contains a mixture
of cpus. So we want this to work.

Also 'background' tasks are usually bound to littles. If the littles are 64bit
only, then user space must ensure to add a 32bit capable cpu to the list. If we
can't attach to mixed cpuset, then this workaround won't work and Android will
have to create different cpusets for 32bit and 64bit apps. Which would be
difficult since 64bit apps do embed 32bit binaries and it'd be hard for
userspace to deal with that. Unless we extend execve syscall to take a cgroup
as an arg to attach to (like clone3 syscall), then potentially we can do
something about it, I think, by 'fixing' the libc wrapper.

If I use cpumask_intersects() to validate the attachment, then the 64bit
attachment will fail but the 'mix' one will succeed, as desired. BUT we'll
re-introduce the WARN_ON_ONCE() again since __set_cpus_allowed_ptr_locked()
validates against arch_task_cpu_possible_mask() with cpumask_subset() :-/
ie: cpuset_attach()::set_cpus_allowed_ptr() will fail with just a warning
printed once when attaching to 'mix'.

We can fix this by making __set_cpus_allowed_ptr_locked() perform cpumask_and()
like restrict_cpus_allowed_ptr() does.

Though this will not only truncate cpuset mask with the intersection of
arch_task_cpu_possible_mask(), but also truncate it for sched_setaffinity()
syscall. Something to keep in mind.

Finally there's the ugly problem that I'm not sure how to address of modifying
the cpuset.cpus of, for example 'mix' cpuset, such that we end up with 64bit
only cpus. If the 32bit task has already attached, then we'll end up with
a broken cpuset with a task attached having 'invalid' cpumask.

We can teach cpuset.c:update_cpumask()::validate_change() to do something about
it. By either walking the attached tasks and validating the new mask against
the arch one. Or by storing the arch mask in struct cpuset. Though for the
latter if we want to be truly generic, we need to ensure we handle the case of
2 tasks having different arch_task_cpu_possible_mask().

Long email. Hopefully it makes sense!

I can share my test script, binary/code and fixup patches if you like.

Thanks

--
Qais Yousef