Let's introduce task (Y) which has mm == src_task->mm and task (N) which has
mm != src_task->mm for the rest of the discussion.
Let's consider the scheduling state transitions we want to consider here.
There are two scheduler state transitions on context switch we care about:
(TSA) Store to rq->curr with transition from (N) to (Y)
(TSB) Store to rq->curr with transition from (Y) to (N)
On the migrate-from side, there is one transition we care about:
(TMA) cmpxchg to *pcpu_cid to set the LAZY flag
There is also a transition to UNSET state which can be performed from all
sides (scheduler, migrate-from). It is performed with a cmpxchg everywhere
which guarantees that only a single thread will succeed:
(TMB) cmpxchg to *pcpu_cid to mark UNSET
Just to be clear (at the risk of repeating myself), what we do _not_ want
to happen is a transition to UNSET when a thread is actively using the cid
(property (1)). And ideally we do not want to leak a cid after migrating the
last task from a cpu (property (2)).
Let's looks at the relevant combinations of TSA/TSB, and TMA transitions.
Scenario A) (TSA)+(TMA) (from next task perspective)
CPU0 CPU1
Context switch CS-1 Migrate-from
- store to rq->curr: (N)->(Y) (TSA) - cmpxchg to *pcpu_id to LAZY (TMA)
*** missing barrier ?? *** (implied barrier after cmpxchg)
- prepare_task_switch()
- switch_mm_cid()
- mm_cid_get (next)
- READ_ONCE(*pcpu_cid) - rcu_dereference(src_rq->curr)
This Dekker ensures that either task (Y) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.
If task (Y) store is observed by rcu_dereference(), it means that there is still
an active task on the cpu. Migrate-from will therefore not transition to UNSET, which
fulfills property (1). That observed task will itself eventually need a migrate-from
to be migrated away from that cpu, which fulfills property (2).
If task (Y) is not observed, but the lazy flag is observed by READ_ONCE(), it will
move its state to UNSET, which clears the percpu cid perhaps uselessly (which is not
an issue for correctness). Because task (Y) is not observed, CPU1 can move ahead to
set the state to UNSET. Because moving state to UNSET is done with a cmpxchg expecting
that the old state has the LAZY flag set, only one thread will successfully UNSET.
If both states (LAZY flag and task (Y)) are observed, the thread on CPU0 will observe
the LAZY flag and transition to UNSET (perhaps uselessly), and CPU1 will observe task
(Y) and do nothing more, which is fine.
What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (Y) are observed, which would fail property (1) because this would
UNSET a cid which is actively used.
Scenario B) (TSB)+(TMA) (from prev task perspective)
CPU0 CPU1
Context switch CS-1 Migrate-from
- store to rq->curr: (Y)->(N) (TSB) - cmpxchg to *pcpu_id to LAZY (TMA)
*** missing barrier ?? *** (implied barrier after cmpxchg)
- prepare_task_switch()
- switch_mm_cid()
- cid_put_lazy() (prev)
- READ_ONCE(*pcpu_cid) - rcu_dereference(src_rq->curr)
This Dekker ensures that either task (N) is observed by the rcu_dereference() or the LAZY
flag is observed by READ_ONCE(), or both are observed.
If rcu_dereference observes (N) but LAZY is not observed, migrate-from will take care to
advance the state to UNSET, thus fulfilling property (2). Because (Y) is not running anymore
property (1) is fulfilled.
If rcu_dereference does not observe (N), but LAZY is observed, migrate-from does not
advance to UNSET because it observes (Y), but LAZY flag will make the task on CPU0
take care of advancing the state to UNSET, thus fulfilling property (2).
If both (N) and LAZY are observed, both migrate-from and CPU0 will try to advance the
state to UNSET, but only one will succeed its cmpxchg.
What we are effectively preventing with this Dekker is a scenario where neither LAZY
flag nor store (N) are observed, which would fail property (2) because it would leak
a cid on a cpu that has no task left using the mm.
So based on my analysis, we are indeed missing a barrier between store to rq->curr and
load of the per-mm/cpu cid within context_switch().
Am I missing something ? How can we provide this barrier with minimal overhead ?