[RESEND RFC PATCH 0/3] Provide fast access to thread specific data

From: Prakash Sangappa
Date: Wed Sep 08 2021 - 20:16:22 EST


Including liunx-kernel..

Resending RFC. This patchset is not final. I am looking for feedback on
this proposal to share thread specific data for us in latency sensitive
codepath.

(patchset based on v5.14-rc7)

Cover letter previously sent:
----------------------------

Some applications, like a Databases require reading thread specific stats
frequently from the kernel in latency sensitive codepath. The overhead of
reading stats from kernel using system call affects performance.
One use case is reading thread's scheduler stats from /proc schedstat file
(/proc/pid/schedstat) to collect time spent by a thread executing on the
cpu(sum_exec_runtime), time blocked waiting on runq(run_delay). These
scheduler stats, read several times per transaction in latency-sensitive
codepath, are used to measure time taken by DB operations.

This patch proposes to introduce a mechanism for kernel to share thread
stats thru a per thread shared structure shared between userspace and
kernel. The per thread shared structure is allocated on a page shared
mapped between user space and kernel, which will provide a way for fast
communication between user and kernel. Kernel publishes stats in this
shared structure. Application thread can read from it in user space
without requiring system calls.

Similarly, there can be other use cases for such shared structure
mechanism.

Introduce 'off cpu' time:

The time spent executing on a cpu(sum_exec_runtime) by a thread,
currently available thru thread's schedstat file, can be shared thru
the shared structure mentioned above. However, when a thread is running
on the cpu, this time gets updated periodically, can take upto 1ms or
more as part of scheduler tick processing. If the application has to
measure cpu time consumed across some DB operations, using
'sum_exec_runtime' will not be accurate. To address this the proposal
is to introduce a thread's 'off cpu' time, which is measured at context
switch, similar to time on runq(ie run_delay in schedstat file) is and
should be more accurate. With that the application can determine cpu time
consumed by taking the elapsed time and subtracting off cpu time. The
off cpu time will be made available thru the shared structure along with
the other schedstats from /proc/pid/schedstat file.

The elapsed time itself can be measured using clock_gettime, which is
vdso optimized and would be fast. The schedstats(runq time & off cpu time)
published in the shared structure will be accumulated time, same as what
is available thru schedstat file, all in units of nanoseconds. The
application would take the difference of the values from before and after
the operation for measurement.

Preliminary results from a simple cached read Database workload shows
performance benefit, when the database uses shared struct for reading
stats vs reading from /proc directly.

Implementation:

A new system call is added to request use of shared structure by a user
thread. Kernel will allocate page(s), shared mapped with user space in
which per-thread shared structures will be allocated. These structures
are padded to 128 bytes. This will contain struct members or nested
structures corresponding to supported stats, like the thread's schedstats,
published by the kernel for user space consumption. More struct members
can be added as new feature support is implemented. Multiple such shared
structures will be allocated from a page(upto 32 per 4k page) and avoid
having to allocate one page per thread of a process. Although, will need
optimizing for locality. Additional pages will be allocated as needed to
accommodate more threads requesting use of shared structures. Aim is to
not expose the layout of the shared structure itself to the application,
which will allow future enhancements/changes without affecting the
existing APIs.

The system call will return a pointer(user space mapped address) to the per
thread shared structure members. Application would save this per thread
pointer in a TLS variable and reference it.

The system call is of the form.
int task_getshared(int option, int flags, void __user *uaddr)

// Currently only TASK_SCHEDSTAT option is supported - returns pointer
// to struct task_schedstat. The struct task_schedstat is nested within
// the shared structure.

struct task_schedstat {
volatile u64 sum_exec_runtime;
volatile u64 run_delay;
volatile u64 pcount;
volatile u64 off_cpu;
};

Usage:

__thread struct task_schedstat *ts;
task_getshared(TASK_SCHEDSTAT, 0, &ts);

Subsequently the stats are accessed using the 'ts' pointer by the thread

Prakash Sangappa (3):
Introduce per thread user-kernel shared structure
Publish tasks's scheduler stats thru the shared structure
Introduce task's 'off cpu' time

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
include/linux/mm_types.h | 2 +
include/linux/sched.h | 9 +
include/linux/syscalls.h | 2 +
include/linux/task_shared.h | 92 ++++++++++
include/uapi/asm-generic/unistd.h | 5 +-
include/uapi/linux/task_shared.h | 23 +++
kernel/fork.c | 7 +
kernel/sched/deadline.c | 1 +
kernel/sched/fair.c | 1 +
kernel/sched/rt.c | 1 +
kernel/sched/sched.h | 1 +
kernel/sched/stats.h | 55 ++++--
kernel/sched/stop_task.c | 1 +
kernel/sys_ni.c | 3 +
mm/Makefile | 2 +-
mm/task_shared.c | 314 +++++++++++++++++++++++++++++++++
18 files changed, 501 insertions(+), 20 deletions(-)
create mode 100644 include/linux/task_shared.h
create mode 100644 include/uapi/linux/task_shared.h
create mode 100644 mm/task_shared.c

--
2.7.4