Re: [RFC PATCH 3/3 v0.2] sched/umcg: RFC: implement UMCG syscalls
From: Peter Oskolkov
Date: Mon Jul 12 2021 - 11:40:52 EST
On Sun, Jul 11, 2021 at 11:29 AM Thierry Delisle <tdelisle@xxxxxxxxxxxx> wrote:
>
> > Let's move the discussion to the new thread.
>
> I'm happy to start a new thread. I'm re-responding to my last post
> because many
> of my questions are still unanswered.
>
> > + * State transitions:
> > + *
> > + * RUNNING => IDLE: the current RUNNING task becomes IDLE by calling
> > + * sys_umcg_wait();
> >
> > [...]
> >
> > +/**
> > + * enum umcg_wait_flag - flags to pass to sys_umcg_wait
> > + * @UMCG_WAIT_WAKE_ONLY: wake @self->next_tid, don't put @self to sleep;
> > + * @UMCG_WF_CURRENT_CPU: wake @self->next_tid on the current CPU
> > + * (use WF_CURRENT_CPU); @UMCG_WAIT_WAKE_ONLY
> must be set.
> > + */
> > +enum umcg_wait_flag {
> > + UMCG_WAIT_WAKE_ONLY = 1,
> > + UMCG_WF_CURRENT_CPU = 2,
> > +};
>
> What is the purpose of using sys_umcg_wait without next_tid or with
> UMCG_WAIT_WAKE_ONLY? It looks like Java's park/unpark semantics to me,
> that is
> worker threads can use this for synchronization and mutual exclusion. In
> this
> case, how do these compare to using FUTEX_WAIT/FUTEX_WAKE?
sys_umcg_wait without next_tid puts the task in UMCG_IDLE state; wake
wakes it. These are standard sched operations. If they are emulated
via futexes, fast context switching will require something like
FUTEX_SWAP that was NACKed last year.
>
>
> > +struct umcg_task {
> > [...]
> > + /**
> > + * @server_tid: the TID of the server UMCG task that should be
> > + * woken when this WORKER becomes BLOCKED. Can be zero.
> > + *
> > + * If this is a UMCG server, @server_tid should
> > + * contain the TID of @self - it will be used to find
> > + * the task_struct to wake when pulled from
> > + * @idle_servers.
> > + *
> > + * Read-only for the kernel, read/write for the userspace.
> > + */
> > + uint32_t server_tid; /* r */
> > [...]
> > + /**
> > + * @idle_servers_ptr: a single-linked list pointing to the list
> > + * of idle servers. Can be NULL.
> > + *
> > + * Readable/writable by both the kernel and the userspace: the
> > + * userspace adds items to the list, the kernel removes them.
> > + *
> > + * TODO: describe how the list works.
> > + */
> > + uint64_t idle_servers_ptr; /* r/w */
> > [...]
> > +} __attribute__((packed, aligned(8 * sizeof(__u64))));
>
> From the comments and by elimination, I'm guessing that idle_servers_ptr is
> somehow used by servers to block until some worker threads become idle.
> However,
> I do not understand how the userspace is expected to use it. I also do not
> understand if these link fields form a stack or a queue and where is the
> head.
When a server has nothing to do (no work to run), it is put into IDLE
state and added to the list. The kernel wakes an IDLE server if a
blocked worker unblocks.
>
>
> > +/**
> > + * sys_umcg_ctl: (un)register a task as a UMCG task.
> > + * @flags: ORed values from enum umcg_ctl_flag; see below;
> > + * @self: a pointer to struct umcg_task that describes this
> > + * task and governs the behavior of sys_umcg_wait if
> > + * registering; must be NULL if unregistering.
> > + *
> > + * @flags & UMCG_CTL_REGISTER: register a UMCG task:
> > + * UMCG workers:
> > + * - self->state must be UMCG_TASK_IDLE
> > + * - @flags & UMCG_CTL_WORKER
> > + *
> > + * If the conditions above are met, sys_umcg_ctl()
> immediately returns
> > + * if the registered task is a RUNNING server or basic task;
> an IDLE
> > + * worker will be added to idle_workers_ptr, and the worker
> put to
> > + * sleep; an idle server from idle_servers_ptr will be
> woken, if any.
>
> This approach to creating UMCG workers concerns me a little. My
> understanding
> is that in general, the number of servers controls the amount of parallelism
> in the program. But in the case of creating new UMCG workers, the new
> threads
> only respect the M:N threading model after sys_umcg_ctl has blocked.
> What does
> this mean for applications that create thousands of short lived tasks? Are
> users expcted to create pools of reusable UMCG workers?
Yes: task/thread creation is not as lightweight as just posting work
items onto a preexisting pool of workers.
>
>
> I would suggest adding at least one uint64_t field to the struct
> umcg_task that
> is left as-is by the kernel. This allows implementers of user-space
> schedulers to add scheduler specific data structures to the threads without
> needing some kind of table on the side.
This is usually achieved by embedding the kernel struct into a larger
userspace/TLS struct. For example:
struct umcg_task_user {
struct umcg_task umcg_task;
extra_user_data d1;
extra_user_ptr p1;
/* etc. */
} __aligned(...);