Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

From: Thomas Gleixner
Date: Thu Feb 29 2024 - 14:54:01 EST


On Thu, Feb 29 2024 at 10:49, Jens Axboe wrote:
> On 2/29/24 10:42 AM, Thomas Gleixner wrote:
>> So but I just noticed that there is actually an issue with this:
>>
>>> unsigned int nr_iowait_cpu(int cpu)
>>> {
>>> - return atomic_read(&cpu_rq(cpu)->nr_iowait);
>>> + struct rq *rq = cpu_rq(cpu);
>>> +
>>> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
>>
>> The access to rq->nr_iowait is not protected by the runqueue lock and
>> therefore a data race when @cpu is not the current CPU.
>>
>> This needs to be properly annotated and explained why it does not
>> matter.
>
> But that was always racy before as well, if someone else is inc/dec'ing
> ->nr_iowait while it's being read, you could get either the before or
> after value. This doesn't really change that. I could've sworn I
> mentioned that in the commit message, but I did not.

There are actually two issues here:

1) atomic_read() vs. atomic_inc/dec() guarantees that the read value
is consistent in itself.

Non-atomic inc/dec is not guaranteeing that the concurrent read is a
consistent value as the compiler is free to do store/load
tearing. Unlikely but not guaranteed to never happen.

KCSAN will complain about it sooner than later and then someone has
to go and do the analysis and the annotation. I rather let you do
the reasoning now than chasing you down later :)

2) What's worse is that the result can be completely bogus:

i.e.

CPU0 CPU1 CPU2
a = rq(CPU1)->nr_iowait; // 0
rq->nr_iowait++;
rq(CPU1)->nr_iowait_remote++;
b = rq(CPU1)->nr_iowait_remote; // 1

r = a - b; // -1
return (unsigned int) r; // UINT_MAX

The consumers of this interface might be upset. :)

While with a single atomic_t it's guaranteed that the result is
always greater or equal zero.

>> So s/Reviewed-by/Un-Reviewed-by/
>>
>> Though thinking about it some more. Is this split a real benefit over
>> always using the atomic? Do you have numbers to show?
>
> It was more on Peter's complaint that now we're trading a single atomic
> for two, hence I got to thinking about nr_iowait in general. I don't
> have numbers showing it matters, as mentioned in another email the most
> costly part about this seems to be fetching task->in_iowait and not the
> actual atomic.

On the write side (except for the remote case) the cache line is already
dirty on the current CPU and I doubt that the atomic will be
noticable. If there is concurrent remote access to the runqueue then the
cache line is bouncing no matter what.

On the read side there is always an atomic operation required, so it's
not really different.

I assume Peter's complaint was about the extra nr_iowait_acct part. I
think that's solvable without the extra atomic_t member and with a
single atomic_add()/sub(). atomic_t is 32bit wide, so what about
splitting the thing and adding/subtracting both in one go?

While sketching this I noticed that prepare/finish can be written w/o
any conditionals.

int io_schedule_prepare(void)
{
int flags = current->in_iowait + current->in_iowait_acct << 16;

current->in_iowait = 1;
current->in_iowait_acct = 1;
blk_flush_plug(current->plug, true);
return flags;
}

void io_schedule_finish(int old_wait_flags)
{
current->in_iowait = flags & 0x01;
current->in_iowait_acct = flags >> 16;
}

Now __schedule():

if (prev->in_iowait) {
int x = 1 + current->in_iowait_acct << 16;

atomic_add(x, rq->nr_iowait);
delayacct_blkio_start();
}

and ttwu_do_activate():

if (p->in_iowait) {
int x = 1 + current->in_iowait_acct << 16;

delayacct_blkio_end(p);
atomic_sub(x, task_rq(p)->nr_iowait);
}


and try_to_wake_up():

delayacct_blkio_end(p);

int x = 1 + current->in_iowait_acct << 16;

atomic_add(x, task_rq(p)->nr_iowait);

nr_iowait_acct_cpu() becomes:

return atomic_read(&cpu_rq(cpu)->nr_iowait) >> 16;

and nr_iowait_cpu():

return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1);

Obviously written with proper inline wrappers and defines, but you get
the idea.

Hmm?

Thanks,

tglx