Re: [PATCH v6] io_uring: Statistics of the true utilization of sq threads.

From: Pavel Begunkov
Date: Wed Jan 17 2024 - 21:58:43 EST


On 1/17/24 08:37, Xiaobing Li wrote:
On 1/12/24 2:58 AM, Jens Axboe wrote:
On 1/11/24 6:12 PM, Xiaobing Li wrote:
On 1/10/24 16:15 AM, Jens Axboe wrote:
On 1/10/24 2:05 AM, Xiaobing Li wrote:
On 1/5/24 04:02 AM, Pavel Begunkov wrote:
On 1/3/24 05:49, Xiaobing Li wrote:
On 12/30/23 9:27 AM, Pavel Begunkov wrote:
Why it uses jiffies instead of some task run time?
Consequently, why it's fine to account irq time and other
preemption? (hint, it's not)

Why it can't be done with userspace and/or bpf? Why
can't it be estimated by checking and tracking
IORING_SQ_NEED_WAKEUP in userspace?

What's the use case in particular? Considering that
one of the previous revisions was uapi-less, something
is really fishy here. Again, it's a procfs file nobody
but a few would want to parse to use the feature.

Why it just keeps aggregating stats for the whole
life time of the ring? If the workload changes,
that would either totally screw the stats or would make
it too inert to be useful. That's especially relevant
for long running (days) processes. There should be a
way to reset it so it starts counting anew.

Hi, Jens and Pavel,
I carefully read the questions you raised.
First of all, as to why I use jiffies to statistics time, it
is because I have done some performance tests and found that
using jiffies has a relatively smaller loss of performance
than using task run time. Of course, using task run time is

How does taking a measure for task runtime looks like? I expect it to
be a simple read of a variable inside task_struct, maybe with READ_ONCE,
in which case the overhead shouldn't be realistically measurable. Does
it need locking?

The task runtime I am talking about is similar to this:
start = get_system_time(current);
do_io_part();
sq->total_time += get_system_time(current) - start;

Not sure what get_system_time() is, don't see that anywhere.

Currently, it is not possible to obtain the execution time of a piece of
code by a simple read of a variable inside task_struct.
Or do you have any good ideas?

I must be missing something, because it seems like all you need is to
read task->stime? You could possible even make do with just logging busy
loop time, as getrusage(RUSAGE_THREAD, &stat) from userspace would then
give you the total time.

stat.ru_stime would then be the total time, the thread ran, and
1 - (above_busy_stime / stat.ru_stime) would give you the time the
percentage of time the thread ran and did useful work (eg not busy
looping.

getrusage can indeed get the total time of the thread, but this
introduces an extra function call, which is relatively more
complicated than defining a variable. In fact, recording the total
time of the loop and the time of processing the IO part can achieve
our observation purpose. Recording only two variables will have less
impact on the existing performance, so why not choose a simpler and
effective method.

I'm not opposed to exposing both of them, it does make the API simpler.
If we can call it an API... I think the main point was using task->stime
for it rather than jiffies etc.

Hi, Jens and Pavel
I modified the code according to your opinions.

I got the total time of the sqpoll thread through getrusage.
eg:

fdinfo.c:
+long sq_total_time = 0;
+long sq_work_time = 0;
if (has_lock && (ctx->flags & IORING_SETUP_SQPOLL)) {
struct io_sq_data *sq = ctx->sq_data;

sq_pid = sq->task_pid;
sq_cpu = sq->sq_cpu;
+ struct rusage r;
+ getrusage(sq->thread, RUSAGE_SELF, &r);
+ sq_total_time = r.ru_stime.tv_sec * 1000000 + r.ru_stime.tv_usec;
+ sq_work_time = sq->work_time;
}

That's neat, but fwiw my concerns are mostly about what's
exposed to the user space.

seq_printf(m, "SqThread:\t%d\n", sq_pid);
seq_printf(m, "SqThreadCpu:\t%d\n", sq_cpu);
+seq_printf(m, "SqTotalTime:\t%ldus\n", sq_total_time);
+seq_printf(m, "SqWorkTime:\t%ldus\n", sq_work_time);
seq_printf(m, "UserFiles:\t%u\n", ctx->nr_user_files);

The working time of the sqpoll thread is obtained through ktime_get().
eg:

Just like with jiffies, ktime_get() is wall clock time, but
uncomfortably much more expensive. Why not stime Jens dug up
last time?

sqpoll.c:
+ktime_t start, diff;
+start = ktime_get();
list_for_each_entry(ctx, &sqd->ctx_list, sqd_list) {
int ret = __io_sq_thread(ctx, cap_entries);

if (!sqt_spin && (ret > 0 || !wq_list_empty(&ctx->iopoll_list)))
sqt_spin = true;
}
if (io_run_task_work())
sqt_spin = true;

+diff = ktime_sub(ktime_get(), start);
+if (sqt_spin == true)
+ sqd->work_time += ktime_to_us(diff);

The test results are as follows:
Every 2.0s: cat /proc/9230/fdinfo/6 | grep -E Sq
SqMask: 0x3
SqHead: 3197153
SqTail: 3197153
CachedSqHead: 3197153
SqThread: 9231
SqThreadCpu: 11
SqTotalTime: 92215321us
SqWorkTime: 15106578us

Do you think this solution work?

I'm curious, is the plan to leave it only accessible via
procfs? It's not machine readable well, so should be quite
an annoyance parsing it every time.

--
Pavel Begunkov