Re: [PATCH v9 30/32] timers: Implement the hierarchical pull model

From: Sebastian Siewior
Date: Thu Dec 07 2023 - 13:09:39 EST


On 2023-12-01 10:26:52 [+0100], Anna-Maria Behnsen wrote:
> diff --git a/kernel/time/timer_migration.c b/kernel/time/timer_migration.c
> new file mode 100644
> index 000000000000..05cd8f1bc45d
> --- /dev/null
> +++ b/kernel/time/timer_migration.c
> @@ -0,0 +1,1636 @@

> + * Protection of the tmigr group state information:
> + * ------------------------------------------------
> + *
> + * The state information with the list of active children and migrator needs to
> + * be protected by a sequence counter. It prevents a race when updates in a

s/a$//

> + * child groups are propagated in changed order. The following scenario
> + * describes what happens without updating the sequence counter:
> + *
> + * Therefore, let's take three groups and four CPUs (CPU2 and CPU3 as well
> + * as GRP0:1 will not change during the scenario):
> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = CPU0 migrator = CPU2
> + * active = CPU0 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * active idle active idle
> + *
> + *
> + * 1. CPU0 goes idle (changes are updated in GRP0:0; afterwards the current
> + * states of GRP0:0 and GRP1:0 are stored in the data for walking the
> + * hierarchy):

CPU0 goes idle. The state update is performed lock less and group
wise. In the first step only GRP0:0 has been updated. The update of
GRP1:0 is pending, the CPU walks through the hierarchy.

> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * --> migrator = TMIGR_NONE migrator = CPU2
> + * --> active = active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * --> idle idle active idle

> + * 2. CPU1 comes out of idle (changes are update in GRP0:0; afterwards the
> + * current states of GRP0:0 and GRP1:0 are stored in the data for walking the
> + * hierarchy):

While CPU0 goes idle and continues to update the state, CPU1 comes
out of idle. CPU1 updates GRP0:0. The update for GRP1:0 is pending,
tge CPU walks through the hierarchy. Both CPUs now walk the hierarchy
to perform the needed update from their point of view.
The currently visible state:

> + *
> + * LVL 1 [GRP1:0]
> + * migrator = GRP0:1
> + * active = GRP0:0, GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * --> migrator = CPU1 migrator = CPU2
> + * --> active = CPU1 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle --> active active idle
> + *
> + * 3. Here comes the change of the order: Propagating the changes of step 2
> + * through the hierarchy to GRP1:0 - nothing to be done, because GRP0:0
> + * is already up to date.

Here is the race condition: CPU1 managed to propagate its changes
through the hierarchy to GRP1:0 before CPU0 did. The active members
of GRP1:0 remain unchanged after the update since it is still valid
from CPU1 current point of view:

LVL 1 [GRP1:0]
--> migrator = GRP0:1
--> active = GRP0:0, GRP0:1
/ \
LVL 0 [GRP0:0] [GRP0:1]
migrator = CPU1 migrator = CPU2
active = CPU1 active = CPU2
/ \ / \
CPUs 0 1 2 3
idle active active idle

[ I take it as the migrator remains set to GRP0:1 by CPU1 but it could
be changed to GRP0:0. I assume that both fields (migrator+active) are
changed there via the propagation and the arrow in both fields denotes
this. ]

> + * 4. Propagating the changes of step 1 through the hierarchy to GRP1:0

Now CPU0 finally propagates its changes to GRP1:0.

> + *
> + * LVL 1 [GRP1:0]
> + * --> migrator = GRP0:1
> + * --> active = GRP0:1
> + * / \
> + * LVL 0 [GRP0:0] [GRP0:1]
> + * migrator = CPU1 migrator = CPU2
> + * active = CPU1 active = CPU2
> + * / \ / \
> + * CPUs 0 1 2 3
> + * idle active active idle
> + *
> + * Now there is a inconsistent overall state because GRP0:0 is active, but
> + * it is marked as idle in the GRP1:0. This is prevented by incrementing
> + * sequence counter whenever changing the state.

The race of CPU0 vs CPU1 led to an inconsistent state in GRP1:0.
CPU1 is active and is correctly listed as active in GRP0:0. However
GRP1:0 does not have GRP0:0 listed as active which is wrong.
The sequence counter has been added to avoid inconsistent states
during updates. The state is updated atomically only if all members,
including the sequence counter, match the expected value
(compare-and-exchange).
Looking back at the previous example with the addition of the
sequence number: The update as performed by CPU0 in step 4 will fail.
CPU1 changed the sequence number during the update in step 3 so the
expected old value (as seen by CPU0 before starting the walk) does
not match.


Sebastian