Re: [RFC] Design proposal for upstream core-scheduling interface

From: Dhaval Giani
Date: Mon Aug 24 2020 - 16:32:00 EST


On Mon, Aug 24, 2020 at 4:32 AM Vineeth Pillai <vineethrp@xxxxxxxxx> wrote:
>
> > Let me know your thoughts and looking forward to a good LPC MC discussion!
> >
>
> Nice write up Joel, thanks for taking time to compile this with great detail!
>
> After going through the details of interface proposal using cgroup v2
> controllers,
> and based on our discussion offline, would like to note down this idea
> about a new
> pseudo filesystem interface for core scheduling. We could include
> this also for the
> API discussion during core scheduler MC.
>
> coreschedfs: pseudo filesystem interface for Core Scheduling
> ----------------------------------------------------------------------------------
>
> The basic requirement of core scheduling is simple - we need to group a set
> of tasks into a trust group that can share a core. So we don’t really
> need a nested
> hierarchy for the trust groups. Cgroups v2 follow a unified nested
> hierarchy model
> that causes a considerable confusion if the trusted tasks are in
> different levels of the
> hierarchy and we need to allow them to share the core. Cgroup v2's
> single hierarchy
> model makes it difficult to regroup tasks in different levels of
> nesting for core scheduling.
> As noted in this mail, we could use multi-file approach and other
> interfaces like prctl to
> overcome this limitation.
>
> The idea proposed here to overcome the above limitation is to come up with a new
> pseudo filesystem - “coreschedfs”. This filesystem is basically a flat
> filesystem with
> maximum nesting level of 1. That means, root directory can have
> sub-directories for
> sub-groups, but those sub-directories cannot have more sub-directories
> representing
> trust groups. Root directory is to represent the system wide trust
> group and sub-directories
> represent trusted groups. Each directory including the root directory
> has the following set
> of files/directories:
>
> - cookie_id: User exposed id for a cookie. This can be compared to a
> file descriptor.
> This could be used in programmatic API to join/leave a group
>
> - properties: This is an interface to specify how child tasks of this
> group should behave.
> Can be used for specifying future flag requirements as well.
> Current list of properties include:
> NEW_COOKIE_FOR_CHILD: All fork() for tasks in this group
> will result in
> creation of a new trust group
> SAME_COOKIE_FOR_CHILD: All fork() for tasks in this
> group will end up in
> this same group
> ROOT_COOKIE_FOR_CHILD: All fork() for tasks in this
> group goes to the root group
>
> - tasks: Lists the tasks in this group. Main interface for adding
> removing tasks in a group
>
> - <pid>: A directory per task who is am member of this trust group.
> - <pid>/properties: This file is same as the parent properties file
> but this is to override
> the group setting.
>
> This pseudo filesystem can be mounted any where in the root
> filesystem, I propose the default
> to be in “/sys/kernel/coresched”
>
> When coresched is enabled, kernel internally creates the framework for
> this filesystem.
> The filesystem gets mounted to the default location and admin can
> change this if needed.
> All tasks by default are in the root group. The admin or programs can
> then create trusted
> groups on top of this filesystem.
>
> Hooks will be placed in fork() and exit() to make sure that the
> filesystem’s view of tasks is
> up-to-date with the system. Also, APIs manipulating core scheduling
> trusted groups should
> also make sure that the filesystem's view is updated.
>
> Note: The above idea is very similar to cgroups v1. Since there is no
> unified hierarchy
> in cgroup v1, most of the features of coreschedfs could be implemented
> as a cgroup v1
> controller. As no new v1 controllers are allowed, I feel the best
> alternative to have
> a simple API is to come up with a new filesystem - coreschedfs.
>
> The advantages of this approach is:
>
> - Detached from cgroup unified hierarchy and hence the very simple requirement
> of core scheduling can be easily materialized.
> - Admin can have fine-grained control of groups using shell and scripting
> - Can have programmatic access to this using existing APIs like mkdir,rmdir,
> write, read. Or can come up with new APIs using the cookie_id which can wrap
> t he above linux apis or use a new systemcall for core scheduling.
> - Fine grained permission control using linux filesystem permissions and ACLs
>
> Disadvantages are
> - yet another psuedo filesystem.
> - very similar to cgroup v1 and might be re-implementing features
> that are already
> provided by cgroups v1.
>
> Use Cases
> -----------------
>
> Usecase 1: Google cloud
> ---------------------------------
>
> Since we no longer depend on cgroup v2 hierarchies, there will not be
> any issue of
> nesting and sharing. The main daemon can create trusted groups in the
> fileystem and
> provide required permissions for the group. Then the init processes
> for each job can
> be added to respective groups for them to create children tasks as
> needed. Multiple
> jobs under the same customer which needs to share the core can be
> housed in one group.
>
>
> Usecase 2: Chrome browser
> ------------------------
>
> We start with one group for the first task and then set properties to
> NEW_COOKIE_FOR_CHILD.
>
> Usecase 3: chrome VMs
> ---------------------
>
> Similar to chrome browser, the VM task can make each vcpu on its own group.
>
> Usecase 4: Oracle use case
> --------------------------
> This is also similar to use case 1 with this interface. All tasks that need to
> be in the root group can be easily added by the admin.
>
> Use case 5: General virtualization
> ----------------------------------
>
> The requirement is each VM should be isolated. This can be easily done
> by creating a
> new group per VM
>
>
> Please have a look at the above proposal and let us know your
> thoughts. We shall include
> this also during the interface discussion at core scheduling MC.
>

I am inclined to say no to this. Yet another FS interface :-(. We are
just reinventing the wheel here. Let's try to stick within cgroupfs
first and see if we can make it work there.

Dhaval