Re: [PATCH v5 16/18] timer: Implement the hierarchical pull model

From: Anna-Maria Behnsen
Date: Tue Apr 04 2023 - 10:58:06 EST


On Thu, 23 Mar 2023, Peter Zijlstra wrote:

> On Wed, Mar 01, 2023 at 03:17:42PM +0100, Anna-Maria Behnsen wrote:
>
> > diff --git a/kernel/time/timer_migration.h b/kernel/time/timer_migration.h
> > new file mode 100644
> > index 000000000000..ceb336e705df
> > --- /dev/null
> > +++ b/kernel/time/timer_migration.h
> > @@ -0,0 +1,123 @@
> > +#ifndef _KERNEL_TIME_MIGRATION_H
> > +#define _KERNEL_TIME_MIGRATION_H
> > +
> > +/* Per group capacity. Must be a power of 2! */
> > +#define TMIGR_CHILDS_PER_GROUP 8
> > +
> > +/**
> > + * struct tmigr_event - a timer event associated to a CPU
> > + * @nextevt: The node to enqueue an event in the parent group queue
> > + * @cpu: The CPU to which this event belongs
> > + * @ignore: Hint whether the event could be ignored; it is set when
> > + * CPU or group is active;
> > + */
> > +struct tmigr_event {
> > + struct timerqueue_node nextevt;
> > + unsigned int cpu;
> > + int ignore;
> > +};
> > +
> > +/**
> > + * struct tmigr_group - timer migration hierarchy group
> > + * @lock: Lock protecting the event information
> > + * @cpus: Array with CPUs which are member of group; required for
> > + * sibling CPUs; used only when level == 0
>
> That's 32 bytes wasted for every group that isn't 0, maybe stick on the
> end and conditionally allocate? Also, afaict it is only ever used during
> setup, and as such should not be placed in a hot line, unless you've
> done that explicitly as padding, in which case it needs a comment to
> that effect.
>
> Also, since it's setup only, why can't you match against:
>
> per_cpu_ptr(&tmigr_cpu, cpu)->parent
>
> ?

This cpus array is currently used to make sure siblings will end up in the
same level 0 group. When matching against the per_cpu_ptr(&tmigr_cpu,
cpu)->parent, I would need to rely on the topology mask and have a look if
the sibling already has a parent.

My question here is: Is it required that siblings end up in the same group
in level 0, or is it enough if the numa node is the same?

> > + * @parent: Pointer to parent group
> > + * @list: List head that is added to per level tmigr_level_list
>
> Also setup only?

Jupp.

> > + * @level: Hierarchy level of group
> > + * @numa_node: Is set to numa node when level < tmigr_crossnode_level;
> > + * otherwise it is set to NUMA_NO_NODE; Required for setup
> > + * only
> > + * @num_childs: Counter of group childs; Required for setup only
> > + * @num_cores: Counter of cores per group; Required for setup only when
> > + * level == 0 and siblings exist
>
> Also setup only, move the end?

Same. Will move all the setup stuff to the end.

> > + * @migr_state: State of group (see struct tmigr_state)
> > + * @childmask: childmask of group in parent group; is set during setup
> > + * never changed; could be read lockless
> > + * @events: Timer queue for child events queued in the group
> > + * @groupevt: Next event of group; it is only reliable when group is
> > + * !active (ignore bit is set when group is active)
> > + * @next_expiry: Base monotonic expiry time of next event of group;
> > + * Used for racy lockless check whether remote expiry is
> > + * required; it is always reliable
>
> This is basically leftmost of @events? A racy lockless check sorta
> implies not reliable, comment is confusing.

It's always updated and contains the expiry value of the first event which
is enqueued into timer queue "events" - reliable is the wrong term here.

> Also, isn't this identical to @groupevt.nextevt.expiry ?

No, it is not identical. Because groupevt is only used and reliable, when
the group is idle to enqueue the first group event into the parent
group. But the migrator of the group needs to check whether timers needs to
be handled remote because some children are idle. Therefore I do not have to
update the whole event.

> > + */
> > +struct tmigr_group {
> > + raw_spinlock_t lock;
> > + unsigned int cpus[TMIGR_CHILDS_PER_GROUP];
> > + struct tmigr_group *parent;
> > + struct list_head list;
> > + unsigned int level;
> > + unsigned int numa_node;
> > + unsigned int num_childs;
> > + unsigned int num_cores;
> > + atomic_t *migr_state;
>
> Per the other email; why isn't this:
>
> union tmigr_state migr_state;
>
> ?

I will change this into a direct member. The reason for not being a direct
member is - because it grows like this...

Only for handling the tmigr_group->migr_state, atomic operations are used
the splitted members are never accessed. All other states are not handled
with atomic operations. If it is 'union tmigr_state' inside the
tmigr_group, then I would need an atomic_t inside the union and the union
gets more complex. I hope I was able to explain my point.

> > + u32 childmask;
> > + struct timerqueue_head events;
> > + struct tmigr_event groupevt;
> > + u64 next_expiry;
> > +};
> > +

I come back to your other remarks in a separate mail.

Thanks,

Anna-Maria