Cgroup memory barrier usage and call frequency from scheduler

From: Mel Gorman
Date: Thu Apr 09 2020 - 11:44:19 EST


Hi Tejun,

Commit 9a9e97b2f1f2 ("cgroup: Add memory barriers to plug
cgroup_rstat_updated() race window") introduced two full memory
barriers to close a race. The one in cgroup_rstat_updated can be
called at a high frequency from the scheduler from update_curr ->
cgroup_account_cputime. The patch has no cc's, acks or reviews so I'm
not sure how closely this was looked at. cgroup_rstat_updated shows up
in profiles of netperf UDP_STREAM accounting for about 1% of overhead
which doesn't sound a lot but that's about the same weight as some of
the critical network paths. I have three questions about the patch

1. Why were full barriers used?
2. Why was it important that the data race be closed when the inaccuracy
is temporary?
3. Why is it called from the context of update_curr()?

For 1, the use of a full barrier seems unnecessary when it appears that
you could have used a read barrier and a write barrier. The following
patch drops the profile overhead to 0.1%

diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
index ca19b4c8acf5..bc3125949b4b 100644
--- a/kernel/cgroup/rstat.c
+++ b/kernel/cgroup/rstat.c
@@ -36,7 +36,7 @@ void cgroup_rstat_updated(struct cgroup *cgrp, int cpu)
* Paired with the one in cgroup_rstat_cpu_pop_upated(). Either we
* see NULL updated_next or they see our updated stat.
*/
- smp_mb();
+ smp_rmb();

/*
* Because @parent's updated_children is terminated with @parent
@@ -139,7 +139,7 @@ static struct cgroup *cgroup_rstat_cpu_pop_updated(struct cgroup *pos,
* Either they see NULL updated_next or we see their
* updated stat.
*/
- smp_mb();
+ smp_wmb();

return pos;
}

For 2, the changelog says the barriers are necessary because "we plan to use
rstat to track counters which need to be accurate". That is a bit vague.
Under what circumstances is a transient inaccuracy a serious enough
problem to justify additional barriers in the scheduler?

For 3, update_curr() is called from a lot of places, some of which are
quite hot -- e.g. task enqueue/dequeue. This is necessary information from
the runqueue needs to be preserved. However, it's less clear that the cpu
accounting information needs to be up to date on this granularity although
it might be related to question 2. Why was the delta_exec not similarly
accumulated in cpuacct_change() and defer the hierarchical update to
be called from somewhere like entity_tick()? It would need tracking the
CPU time at the last update as delta_exec would be lost so it's not very
trivial but it does not look like it would be overly complicated.

Thanks

--
Mel Gorman
SUSE Labs