Re: [PATCH v4 4/4] sched/psi: allow unprivileged polling of N*2s period

From: Johannes Weiner
Date: Wed Mar 29 2023 - 12:14:34 EST


On Wed, Mar 29, 2023 at 05:33:27PM +0200, Domenico Cerasuolo wrote:
> PSI offers 2 mechanisms to get information about a specific resource
> pressure. One is reading from /proc/pressure/<resource>, which gives
> average pressures aggregated every 2s. The other is creating a pollable
> fd for a specific resource and cgroup.
>
> The trigger creation requires CAP_SYS_RESOURCE, and gives the
> possibility to pick specific time window and threshold, spawing an RT
> thread to aggregate the data.
>
> Systemd would like to provide containers the option to monitor pressure
> on their own cgroup and sub-cgroups. For example, if systemd launches a
> container that itself then launches services, the container should have
> the ability to poll() for pressure in individual services. But neither
> the container nor the services are privileged.
>
> This patch implements a mechanism to allow unprivileged users to create
> pressure triggers. The difference with privileged triggers creation is
> that unprivileged ones must have a time window that's a multiple of 2s.
> This is so that we can avoid unrestricted spawning of rt threads, and
> use instead the same aggregation mechanism done for the averages, which
> runs independently of any triggers.
>
> Suggested-by: Johannes Weiner <hannes@xxxxxxxxxxx>
> Signed-off-by: Domenico Cerasuolo <cerasuolodomenico@xxxxxxxxx>

Overall it looks good to me. Thanks for adding the comment on the
privilege check, it's much easier to understand now.

A few nitpicks below:

> @@ -151,6 +151,9 @@ struct psi_trigger {
>
> /* Deferred event(s) from previous ratelimit window */
> bool pending_event;
> +
> + /* Used to differentiate destruction action*/
> + enum psi_aggregators aggregator;
> };

The comment is a bit mysterious. How about

/* Trigger type - PSI_AVGS for unprivileged, PSI_POLL for RT */

> @@ -186,9 +186,9 @@ static void group_init(struct psi_group *group)
> seqcount_init(&per_cpu_ptr(group->pcpu, cpu)->seq);
> group->avg_last_update = sched_clock();
> group->avg_next_update = group->avg_last_update + psi_period;
> - INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> mutex_init(&group->avgs_lock);
> - /* Init trigger-related members */
> +
> + /* Init rtpoll trigger-related members */
> atomic_set(&group->rtpoll_scheduled, 0);
> mutex_init(&group->rtpoll_trigger_lock);
> INIT_LIST_HEAD(&group->rtpoll_triggers);
> @@ -197,6 +197,11 @@ static void group_init(struct psi_group *group)
> init_waitqueue_head(&group->rtpoll_wait);
> timer_setup(&group->rtpoll_timer, poll_timer_fn, 0);
> rcu_assign_pointer(group->rtpoll_task, NULL);
> +
> + /* Init avg trigger-related members */
> + INIT_LIST_HEAD(&group->avg_triggers);
> + memset(group->avg_nr_triggers, 0, sizeof(group->avg_nr_triggers));
> + INIT_DELAYED_WORK(&group->avgs_work, psi_avgs_work);
> }

Can you move those above the rtpoll inits?

It helps with navigating the code and spotting missing inits when the
init sequence follows the order of the struct members.

> @@ -430,21 +435,25 @@ static u64 window_update(struct psi_window *win, u64 now, u64 value)
> return growth;
> }
>
> -static u64 update_triggers(struct psi_group *group, u64 now, bool *update_total)
> +static u64 update_triggers(struct psi_group *group, u64 now, bool *update_total,
> + enum psi_aggregators aggregator)
> {
> struct psi_trigger *t;
> - u64 *total = group->total[PSI_POLL];
> + u64 *total = group->total[aggregator];
> + struct list_head *triggers = aggregator == PSI_AVGS ? &group->avg_triggers
> + : &group->rtpoll_triggers;
> + u64 *aggregator_total = aggregator == PSI_AVGS ? group->avg_total : group->rtpoll_total;
> *update_total = false;

These lines are a bit too long. When the init part gets that long,
it's preferable to move it outside of the decl block:

if (aggregator == PSI_AVGS) {
triggers = &group->avg_triggers;
aggregator_total = group->avg_total;
} else {
triggers = &group->rtpoll_triggers;
aggregator_total = group->rtpoll_total;
}

> @@ -1357,22 +1389,26 @@ void psi_trigger_destroy(struct psi_trigger *t)
> u64 period = ULLONG_MAX;
>
> list_del(&t->node);
> - group->rtpoll_nr_triggers[t->state]--;
> - if (!group->rtpoll_nr_triggers[t->state])
> - group->rtpoll_states &= ~(1 << t->state);
> - /* reset min update period for the remaining triggers */
> - list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> - period = min(period, div_u64(tmp->win.size,
> - UPDATES_PER_WINDOW));
> - group->rtpoll_min_period = period;
> - /* Destroy rtpoll_task when the last trigger is destroyed */
> - if (group->rtpoll_states == 0) {
> - group->rtpoll_until = 0;
> - task_to_destroy = rcu_dereference_protected(
> - group->rtpoll_task,
> - lockdep_is_held(&group->rtpoll_trigger_lock));
> - rcu_assign_pointer(group->rtpoll_task, NULL);
> - del_timer(&group->rtpoll_timer);
> + if (t->aggregator == PSI_AVGS) {
> + group->avg_nr_triggers[t->state]--;
> + } else {
> + group->rtpoll_nr_triggers[t->state]--;
> + if (!group->rtpoll_nr_triggers[t->state])
> + group->rtpoll_states &= ~(1 << t->state);
> + /* reset min update period for the remaining triggers */
> + list_for_each_entry(tmp, &group->rtpoll_triggers, node)
> + period = min(period, div_u64(tmp->win.size,
> + UPDATES_PER_WINDOW));
> + group->rtpoll_min_period = period;
> + /* Destroy rtpoll_task when the last trigger is destroyed */
> + if (group->rtpoll_states == 0) {
> + group->rtpoll_until = 0;
> + task_to_destroy = rcu_dereference_protected(
> + group->rtpoll_task,
> + lockdep_is_held(&group->rtpoll_trigger_lock));
> + rcu_assign_pointer(group->rtpoll_task, NULL);
> + del_timer(&group->rtpoll_timer);

These lines are quite long too, we usually shoot for a line length of
80 characters. Can you do

if (t->aggregator == PSI_AVGS) {
group->avg_nr_triggers[t->state]--;
return;
}

/* Else, it's an rtpoll trigger */
group->rtpoll_nr_triggers[t->state]--;
...

With that, please add

Acked-by: Johannes Weiner <hannes@xxxxxxxxxxx>