Re: [PATCH 2/3] task: RCU protect tasks on the runqueue
From: Eric W. Biederman
Date: Tue Sep 03 2019 - 12:45:19 EST
Peter Zijlstra <peterz@xxxxxxxxxxxxx> writes:
> On Tue, Sep 03, 2019 at 09:41:17AM +0200, Peter Zijlstra wrote:
>> On Mon, Sep 02, 2019 at 11:52:01PM -0500, Eric W. Biederman wrote:
>>
>> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> > index 2b037f195473..802958407369 100644
>> > --- a/kernel/sched/core.c
>> > +++ b/kernel/sched/core.c
>>
>> > @@ -3857,7 +3857,7 @@ static void __sched notrace __schedule(bool preempt)
>> >
>> > if (likely(prev != next)) {
>> > rq->nr_switches++;
>> > - rq->curr = next;
>> > + rcu_assign_pointer(rq->curr, next);
>> > /*
>> > * The membarrier system call requires each architecture
>> > * to have a full memory barrier after updating
>>
>> This one is sad; it puts a (potentially) expensive barrier in here. And
>> I'm not sure I can explain the need for it. That is, we've not changed
>> @next before this and don't need to 'publish' it as such.
>>
>> Can we use RCU_INIT_POINTER() or simply WRITE_ONCE(), here?
>
> That is, I'm thinking we qualify for point 3 (both a and b) of
> RCU_INIT_POINTER().
I don't think point (b) is a concern on any widely visible architecture.
After taking a quick skim through the users it does appear to me that
we almost definitely have changes to the task_struct since the last time
another cpu say that structure (3 a) and that we have cases where
reading stale values in the task_struct will result in incorrect
operation of the code.
The concern of point (b) is the old alpha caching case where you could
dereference a pointer and get a stale copy of the data structure. This
is a concern when an you are following the pointer from another cpu.
>From my quick skim the cases I can see where point (b) might apply are
in fair.c:task_numa_compare lots of fields in task_struct are read. It
looks like reading a stale (old/wrong) value of cur->numa_group could be
very inexplicable and weird. Similarly in the membarrier code reading a
pre-exec version of cur->mm could give completely inexplicable results.
Finally in rcuwait_wake_up reading a stale version of the process
cur->state could cause incorrect or missed wake ups in wake_up_process.
There might already be enough barriers in the scheduler that the barrier
in rcu_update_pointer is redundant. The comment about membarrier at
least suggests that for processes that return to userspace we have a
full memory barrier.
So with a big fat comment explaining why it is safe we could potentially
use RCU_INIT_POINTER. I currently don't see where the appropriate
barriers are so I can not write that comment or with a clear conscious
write the code to use RCU_INIT_POINTER instead of rcu_assign_pointer.
Eric