Re: [PATCH v10 18/20] timers: Implement the hierarchical pull model

From: Frederic Weisbecker
Date: Tue Jan 30 2024 - 10:38:01 EST


Le Mon, Jan 15, 2024 at 03:37:41PM +0100, Anna-Maria Behnsen a écrit :
> +/*
> + * Returns true, if there is nothing to be propagated to the next level
> + *
> + * @data->firstexp is set to expiry of first gobal event of the (top level of
> + * the) hierarchy, but only when hierarchy is completely idle.
> + *
> + * This is the only place where the group event expiry value is set.
> + */
> +static
> +bool tmigr_update_events(struct tmigr_group *group, struct tmigr_group *child,
> + struct tmigr_walk *data, union tmigr_state childstate,
> + union tmigr_state groupstate)
> +{
> + struct tmigr_event *evt, *first_childevt;
> + bool walk_done, remote = data->remote;
> + bool leftmost_change = false;
> + u64 nextexp;
> +
> + if (child) {
> + raw_spin_lock(&child->lock);
> + raw_spin_lock_nested(&group->lock, SINGLE_DEPTH_NESTING);
> +
> + if (childstate.active) {

Since you're going to do the atomic_read(&group->migr_state)
within the group->lock, you may as well do the atomic_read(&child->migr_state)
within the child->lock. It won't hurt and simplifies the picture
in the mind. Then you can add the following comment to outline the ordering
expectations:

/*
* Observing child->migr_state.active means that:
*
* 1) Either the child is effectively active, then it's fine to stop here
*
* 2) Or we are racing with a CPU going inactive and this childstate is actually
* not active anymore but tmigr_inactive_up() hasn't yet called tmigr_update_event()
* on it. It's fine to stop here because that pending call will take care
* of the rest of the propagation.
*
* 3) In any case it's impossible to observe childstate.active when a racing
* CPU made it inactive and also called tmigr_update_event() on it. The
* group->lock enforces ordering such that ->migr_state changes
* in tmigr_inactive_up() are released by group->lock UNLOCK on the
* subsequent call to tmigr_update_event() and then acquired by
* child->lock LOCK in tmigr_new_timer() -> tmigr_update_event().
*/

> + walk_done = true;
> + goto unlock;
> + }
> +
> + first_childevt = tmigr_next_groupevt(child);
> + nextexp = child->next_expiry;
> + evt = &child->groupevt;
> + } else {
> + nextexp = data->nextexp;
> +
> + first_childevt = evt = data->evt;
> +
> + /*
> + * Walking the hierarchy is required in any case when a
> + * remote expiry was done before. This ensures to not lose
> + * already queued events in non active groups (see section
> + * "Required event and timerqueue update after a remote
> + * expiry" in the documentation at the top).
> + *
> + * The two call sites which are executed without a remote expiry
> + * before, are not prevented from propagating changes through
> + * the hierarchy by the return:
> + * - When entering this path by tmigr_new_timer(), @evt->ignore
> + * is never set.
> + * - tmigr_inactive_up() takes care of the propagation by
> + * itself and ignores the return value. But an immediate
> + * return is required because nothing has to be done in this
> + * level as the event could be ignored.
> + */
> + if (evt->ignore && !remote)
> + return true;
> +
> + raw_spin_lock(&group->lock);
> + }
> +
> + if (nextexp == KTIME_MAX) {
> + evt->ignore = true;
> +
> + /*
> + * When the next child event could be ignored (nextexp is
> + * KTIME_MAX) and there was no remote timer handling before or
> + * the group is already active, there is no need to walk the
> + * hierarchy even if there is a parent group.
> + *
> + * The other way round: even if the event could be ignored, but
> + * if a remote timer handling was executed before and the group
> + * is not active, walking the hierarchy is required to not miss
> + * an enqueued timer in the non active group. The enqueued timer
> + * of the group needs to be propagated to a higher level to
> + * ensure it is handled.
> + */
> + if (!remote || groupstate.active) {

Same here, fetching group->migr_state.active from within the lock simplifies
the mind mapping.

Thanks.

> + walk_done = true;
> + goto unlock;
> + }
> + } else {
> + /*
> + * An update of @evt->cpu and @evt->ignore flag is required only
> + * when @child is set (the child is equal or higher than lvl0),
> + * but it doesn't matter if it is written once more to the per
> + * CPU event; make the update unconditional.
> + */
> + evt->cpu = first_childevt->cpu;
> + evt->ignore = false;
> + }
> +
> + walk_done = !group->parent;
> +
> + /*
> + * If the child event is already queued in the group, remove it from the
> + * queue when the expiry time changed only.
> + */
> + if (timerqueue_node_queued(&evt->nextevt)) {
> + if (evt->nextevt.expires == nextexp)
> + goto check_toplvl;
> +
> + leftmost_change = timerqueue_getnext(&group->events) == &evt->nextevt;
> + if (!timerqueue_del(&group->events, &evt->nextevt))
> + WRITE_ONCE(group->next_expiry, KTIME_MAX);
> + }
> +
> + evt->nextevt.expires = nextexp;
> +
> + if (timerqueue_add(&group->events, &evt->nextevt)) {
> + leftmost_change = true;
> + WRITE_ONCE(group->next_expiry, nextexp);
> + }
> +
> +check_toplvl:
> + if (walk_done && (groupstate.migrator == TMIGR_NONE)) {
> + /*
> + * Nothing to do when first event didn't changed and update was
> + * done during remote timer handling.
> + */
> + if (remote && !leftmost_change)
> + goto unlock;
> + /*
> + * The top level group is idle and it has to be ensured the
> + * global timers are handled in time. (This could be optimized
> + * by keeping track of the last global scheduled event and only
> + * arming it on the CPU if the new event is earlier. Not sure if
> + * its worth the complexity.)
> + */
> + data->firstexp = tmigr_next_groupevt_expires(group);
> + }
> +
> +unlock:
> + raw_spin_unlock(&group->lock);
> +
> + if (child)
> + raw_spin_unlock(&child->lock);
> +
> + return walk_done;
> +}