Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period

From: Johannes Weiner
Date: Wed Mar 29 2023 - 15:05:48 EST


On Wed, Mar 29, 2023 at 08:32:12PM +0200, Domenico Cerasuolo wrote:
> On Wed, Mar 29, 2023 at 6:13 PM Johannes Weiner <hannes@xxxxxxxxxxx> wrote:
>
> > On Wed, Mar 29, 2023 at 05:33:27PM +0200, Domenico Cerasuolo wrote:
> > > PSI offers 2 mechanisms to get information about a specific resource
> > > pressure. One is reading from /proc/pressure/<resource>, which gives
> > > average pressures aggregated every 2s. The other is creating a pollable
> > > fd for a specific resource and cgroup.
> > >
> > > The trigger creation requires CAP_SYS_RESOURCE, and gives the
> > > possibility to pick specific time window and threshold, spawing an RT
> > > thread to aggregate the data.
> > >
> > > Systemd would like to provide containers the option to monitor pressure
> > > on their own cgroup and sub-cgroups. For example, if systemd launches a
> > > container that itself then launches services, the container should have
> > > the ability to poll() for pressure in individual services. But neither
> > > the container nor the services are privileged.
> > >
> > > This patch implements a mechanism to allow unprivileged users to create
> > > pressure triggers. The difference with privileged triggers creation is
> > > that unprivileged ones must have a time window that's a multiple of 2s.
> > > This is so that we can avoid unrestricted spawning of rt threads, and
> > > use instead the same aggregation mechanism done for the averages, which
> > > runs independently of any triggers.
> > >
> > > Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> > > Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@xxxxxxxxx>
> >
> > Overall it looks good to me. Thanks for adding the comment on the
> > privilege check, it's much easier to understand now.
> >
> > A few nitpicks below:
> >
> > > @@ -151,6 +151,9 @@ struct psi_trigger {
> > >
> > > /* Deferred event(s) from previous ratelimit window */
> > > bool pending_event;
> > > +
> > > + /* Used to differentiate destruction action*/
> > > + enum psi_aggregators aggregator;
> > > };
> >
> > The comment is a bit mysterious. How about
> >
> > /* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */
> >
> > > @@ -186,9 +186,9 @@ static void group_init(struct psi_group *group)
> > > seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> > > group->avg_last_update = sched_clock();
> > > group->avg_next_update = group->avg_last_update + psi_period;
> > > - INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> > > mutex_init(&group->avgs_lock);
> > > - /* Init trigger-related members */
> > > +
> > > + /* Init rtpoll trigger-related members */
> > > atomic_set(&group->rtpoll_scheduled, 0);
> > > mutex_init(&group->rtpoll_trigger_lock);
> > > INIT_LIST_HEAD(&group->rtpoll_triggers);
> > > @@ -197,6 +197,11 @@ static void group_init(struct psi_group *group)
> > > init_waitqueue_head(&group->rtpoll_wait);
> > > timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
> > > rcu_assign_pointer(group->rtpoll_task, NULL);
> > > +
> > > + /* Init avg trigger-related members */
> > > + INIT_LIST_HEAD(&group->avg_triggers);
> > > + memset(group->avg_nr_triggers, 0, sizeof(group->avg_nr_triggers));
> > > + INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> > > }
> >
> > Can you move those above the rtpoll inits?
> >
> > It helps with navigating the code and spotting missing inits when the
> > init sequence follows the order of the struct members.
> >
> > > @@ -430,21 +435,25 @@ static u64 window_update(struct psi_window *win,
> > u64 now, u64 value)
> > > return growth;
> > > }
> > >
> > > -static u64 update_triggers(struct psi_group *group, u64 now, bool
> > *update_total)
> > > +static u64 update_triggers(struct psi_group *group, u64 now, bool
> > *update_total,
> > > + enum psi_aggregators
> > aggregator)
> > > {
> > > struct psi_trigger *t;
> > > - u64 *total = group->total[PSI_POLL];
> > > + u64 *total = group->total[aggregator];
> > > + struct list_head *triggers = aggregator == PSI_AVGS ?
> > &group->avg_triggers
> > > + : &group->rtpoll_triggers;
> > > + u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total
> > : group->rtpoll_total;
> > > *update_total = false;
> >
> > These lines are a bit too long. When the init part gets that long,
> > it's preferable to move it outside of the decl block:
> >
> > if (aggregator == PSI_AVGS) {
> > triggers = &group->avg_triggers;
> > aggregator_total = group->avg_total;
> > } else {
> > triggers = &group->rtpoll_triggers;
> > aggregator_total = group->rtpoll_total;
> > }
> >
> > > @@ -1357,22 +1389,26 @@ void psi_trigger_destroy(struct psi_trigger *t)
> > > u64 period = ULLONG_MAX;
> > >
> > > list_del(&t->node);
> > > - group->rtpoll_nr_triggers[t->state]--;
> > > - if (!group->rtpoll_nr_triggers[t->state])
> > > - group->rtpoll_states &= ~(1 << t->state);
> > > - /* reset min update period for the remaining triggers */
> > > - list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> > > - period = min(period, div_u64(tmp->win.size,
> > > - UPDATES_PER_WINDOW));
> > > - group->rtpoll_min_period = period;
> > > - /* Destroy rtpoll_task when the last trigger is destroyed
> > */
> > > - if (group->rtpoll_states == 0) {
> > > - group->rtpoll_until = 0;
> > > - task_to_destroy = rcu_dereference_protected(
> > > - group->rtpoll_task,
> > > -
> > lockdep_is_held(&group->rtpoll_trigger_lock));
> > > - rcu_assign_pointer(group->rtpoll_task, NULL);
> > > - del_timer(&group->rtpoll_timer);
> > > + if (t->aggregator == PSI_AVGS) {
> > > + group->avg_nr_triggers[t->state]--;
> > > + } else {
> > > + group->rtpoll_nr_triggers[t->state]--;
> > > + if (!group->rtpoll_nr_triggers[t->state])
> > > + group->rtpoll_states &= ~(1 << t->state);
> > > + /* reset min update period for the remaining
> > triggers */
> > > + list_for_each_entry(tmp, &group->rtpoll_triggers,
> > node)
> > > + period = min(period, div_u64(tmp->win.size,
> > > + UPDATES_PER_WINDOW));
> > > + group->rtpoll_min_period = period;
> > > + /* Destroy rtpoll_task when the last trigger is
> > destroyed */
> > > + if (group->rtpoll_states == 0) {
> > > + group->rtpoll_until = 0;
> > > + task_to_destroy =
> > rcu_dereference_protected(
> > > + group->rtpoll_task,
> > > +
> > lockdep_is_held(&group->rtpoll_trigger_lock));
> > > + rcu_assign_pointer(group->rtpoll_task,
> > NULL);
> > > + del_timer(&group->rtpoll_timer);
> >
> > These lines are quite long too, we usually shoot for a line length of
> > 80 characters. Can you do
> >
> > if (t->aggregator == PSI_AVGS) {
> > group->avg_nr_triggers[t->state]--;
> > return;
> > }
> >
> > /* Else, it's an rtpoll trigger */
> > group->rtpoll_nr_triggers[t->state]--;
> > ...
> >
> Can't return there I think, the function doesn't end after the else branch,
> should I put a `goto out` instead to jump the rtpoll code?

You're right, I missed the bottom part beyond the diff.

Looking closer, I think trigger_create and trigger_destroy are
actually buggy. They have to protect against update_trigger(), so both
creation and destruction must take the right lock - avgs_lock or
rtpoll_trigger_lock - before modifying the list. They're both taking
only the rtpoll_trigger_lock right now.

IOW the trigger type distinction needs to be higher up in general.

if (t->aggregator == PSI_AVGS) {
mutex_lock(&group->avgs_lock);
...
} else {
mutex_lock(&group->rtpoll_trigger_lock);
...
}