Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
From: Eric W. Biederman
Date: Mon Sep 09 2019 - 08:22:45 EST
Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
> On Thu, Sep 05, 2019 at 03:02:49PM -0500, Eric W. Biederman wrote:
>> Paul, what is the purpose of the barrier in rcu_assign_pointer?
>>
>> My intuition says it is the assignment half of rcu_dereference, and that
>> anything that rcu_dereference does not need is too strong.
>
> I see that Paul has also replied, but let me give another take on it.
I appreciate his reply as well. I don't think he picked up that we
are talking about my change to do rcu_assign_pointer(rq->curr, next)
in schedule.
> RCU does *two* different but related things. It provide object
> consistency and it provides object lifetime management.
>
> Object consistency is provided by rcu_assign_pointer() /
> rcu_dereference:
>
> - rcu_assign_pointer() is a PUBLISH operation, it is meant to expose an
> object to the world. In that respect it needs to ensure that all
> previous stores to this object are complete before it writes the
> pointer and exposes the object.
>
> To this purpose, rcu_assign_pointer() is an smp_store_release().
>
> - rcu_dereference() is a CONSUME operation. It matches the PUBLISH from
> above and guarantees that any further loads/stores to the observed
> object come after.
>
> Naturally this would be an smp_load_acquire(), _HOWEVER_, since we need
> the address of the object in order to construct a load/store from/to
> said object, we can rely on the address-dependency to provide this
> barrier (except for Alpha, which is fscking weird). After all if you
> haven't completed the load of the (object) base address, you cannot
> compute the object member address and issue any subsequent memops, now
> can you ;-)
I follow and agree till here.
I was definitely confused ealier on which half of the rcu_assign_pointer()
rcu_dereference was important on most cpus. As I currently understand
things, in a normal case we need a memory barrier between the assignment
of the data to a structure and the assignment of a pointer to that
structure to ensure there is an ordering between the pointer and
the data.
That memory barrier is usually provided by rcu_assign_pointer.
> Now the thing here is that the 'rq->curr = next' assignment does _NOT_
> expose the task (our object). The task is exposed the moment it enters
> the PID hash. It is this that sets us free of the PUBLISH requirement
> and thus we can use RCU_INIT_POINTER().
So I disagree about the PID hash here. I don't think exposure is
necessarily the right concept to think this through.
> The object lifetime thing is provided by
> rcu_read_load()/rcu_read_unlock() on the one hand and
> synchronize_rcu()/call_rcu() on the other. That ensures that if you
> observe an object inside the RCU read side section, the object storage
> must stay valid.
>
> And it is *that* properpy we wish to make use of.
I absolutely agree that object lifetime is something we want to make use
of.
> Does this make sense to you?
Not completely.
Let me see if I can explain my confusion in terms of task_numa_compare.
The function task_numa_comare now does:
rcu_read_lock();
cur = rcu_dereference(dst_rq->curr);
Then it proceeds to examine a few fields of cur.
cur->flags;
cur->cpus_ptr;
cur->numa_group;
cur->numa_faults[];
cur->total_numa_faults;
cur->se.cfs_rq;
My concern here is the ordering between setting rq->curr and and setting
those other fields. If we read values of those fields that were not
current when we set cur than I think task_numa_compare would give an
incorrect answer.
Now in __schedule the code does:
rq_lock(rq, &rf);
smp_mp__after_spinlock();
...
next = pick_next_task(rq, prev, &rf);
if (prev != next) {
rq->nr_switches++
rq->curr = next; // rcu_assign_pointer(rq->curr, next);
}
Which basically leads me back to Linus's analsysis where he pointed out
that it is the actions while the task is running before it calls
scheudle that in general cause the fields to be set.
Given that we have a full memory barrier for taking the previous task
off the cpu everything for the next task is guaranteed to be ordered
with the assignment to rq->curr except whatever fields pick_next_task()
decides to change.
>From what I can see pick_next_task_fair does not mess with any of
the fields that task_numa_compare uses. So the existing memory barrier
should be more than sufficient for that case.
So I think I just need to write up a patch 4/3 that replaces the my
"rcu_assign_pointer(rq->curr, next)" with "RCU_INIT_POINTER(rq->curr,
next)". And includes a great big comment to that effect.
Now maybe I am wrong. Maybe we can afford to see data that was changed
before the task became rq->curr in task_numa_compare. But I don't think
so. I really think it is that full memory barrier just a bit earlier
in __schedule that is keeping us safe.
Am I wrong?
Eric