Re: [PATCH 1/2] sched/core: switch struct rq->nr_iowait to a normal int

From: Jens Axboe
Date: Thu Feb 29 2024 - 17:30:16 EST


On 2/29/24 12:52 PM, Thomas Gleixner wrote:
> On Thu, Feb 29 2024 at 10:49, Jens Axboe wrote:
>> On 2/29/24 10:42 AM, Thomas Gleixner wrote:
>>> So but I just noticed that there is actually an issue with this:
>>>
>>>> unsigned int nr_iowait_cpu(int cpu)
>>>> {
>>>> - return atomic_read(&cpu_rq(cpu)->nr_iowait);
>>>> + struct rq *rq = cpu_rq(cpu);
>>>> +
>>>> + return rq->nr_iowait - atomic_read(&rq->nr_iowait_remote);
>>>
>>> The access to rq->nr_iowait is not protected by the runqueue lock and
>>> therefore a data race when @cpu is not the current CPU.
>>>
>>> This needs to be properly annotated and explained why it does not
>>> matter.
>>
>> But that was always racy before as well, if someone else is inc/dec'ing
>> ->nr_iowait while it's being read, you could get either the before or
>> after value. This doesn't really change that. I could've sworn I
>> mentioned that in the commit message, but I did not.
>
> There are actually two issues here:
>
> 1) atomic_read() vs. atomic_inc/dec() guarantees that the read value
> is consistent in itself.
>
> Non-atomic inc/dec is not guaranteeing that the concurrent read is a
> consistent value as the compiler is free to do store/load
> tearing. Unlikely but not guaranteed to never happen.
>
> KCSAN will complain about it sooner than later and then someone has
> to go and do the analysis and the annotation. I rather let you do
> the reasoning now than chasing you down later :)

Fair enough.

> 2) What's worse is that the result can be completely bogus:
>
> i.e.
>
> CPU0 CPU1 CPU2
> a = rq(CPU1)->nr_iowait; // 0
> rq->nr_iowait++;
> rq(CPU1)->nr_iowait_remote++;
> b = rq(CPU1)->nr_iowait_remote; // 1
>
> r = a - b; // -1
> return (unsigned int) r; // UINT_MAX
>
> The consumers of this interface might be upset. :)
>
> While with a single atomic_t it's guaranteed that the result is
> always greater or equal zero.

Yeah OK, this is a real problem...

>>> So s/Reviewed-by/Un-Reviewed-by/
>>>
>>> Though thinking about it some more. Is this split a real benefit over
>>> always using the atomic? Do you have numbers to show?
>>
>> It was more on Peter's complaint that now we're trading a single atomic
>> for two, hence I got to thinking about nr_iowait in general. I don't
>> have numbers showing it matters, as mentioned in another email the most
>> costly part about this seems to be fetching task->in_iowait and not the
>> actual atomic.
>
> On the write side (except for the remote case) the cache line is already
> dirty on the current CPU and I doubt that the atomic will be
> noticable. If there is concurrent remote access to the runqueue then the
> cache line is bouncing no matter what.

That was my exact thinking too, same cacheline and back-to-back atomics
don't really matter vs a single atomic on it.

> On the read side there is always an atomic operation required, so it's
> not really different.
>
> I assume Peter's complaint was about the extra nr_iowait_acct part. I
> think that's solvable without the extra atomic_t member and with a
> single atomic_add()/sub(). atomic_t is 32bit wide, so what about
> splitting the thing and adding/subtracting both in one go?
>
> While sketching this I noticed that prepare/finish can be written w/o
> any conditionals.
>
> int io_schedule_prepare(void)
> {
> int flags = current->in_iowait + current->in_iowait_acct << 16;
>
> current->in_iowait = 1;
> current->in_iowait_acct = 1;
> blk_flush_plug(current->plug, true);
> return flags;
> }
>
> void io_schedule_finish(int old_wait_flags)
> {
> current->in_iowait = flags & 0x01;
> current->in_iowait_acct = flags >> 16;
> }
>
> Now __schedule():
>
> if (prev->in_iowait) {
> int x = 1 + current->in_iowait_acct << 16;
>
> atomic_add(x, rq->nr_iowait);
> delayacct_blkio_start();
> }
>
> and ttwu_do_activate():
>
> if (p->in_iowait) {
> int x = 1 + current->in_iowait_acct << 16;
>
> delayacct_blkio_end(p);
> atomic_sub(x, task_rq(p)->nr_iowait);
> }
>
>
> and try_to_wake_up():
>
> delayacct_blkio_end(p);
>
> int x = 1 + current->in_iowait_acct << 16;
>
> atomic_add(x, task_rq(p)->nr_iowait);
>
> nr_iowait_acct_cpu() becomes:
>
> return atomic_read(&cpu_rq(cpu)->nr_iowait) >> 16;
>
> and nr_iowait_cpu():
>
> return atomic_read(&cpu_rq(cpu)->nr_iowait) & ((1 << 16) - 1);
>
> Obviously written with proper inline wrappers and defines, but you get
> the idea.

I'll play with this a bit, but do we want to switch to an atomic_long_t
for this? 2^16 in iowait seems extreme, but it definitely seems possible
to overflow it.

--
Jens Axboe