Re: [PATCH V4 2/2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes

From: Jesper Dangaard Brouer
Date: Mon Jul 08 2024 - 11:26:54 EST



On 28/06/2024 11.39, Jesper Dangaard Brouer wrote:


On 28/06/2024 01.34, Shakeel Butt wrote:
On Thu, Jun 27, 2024 at 11:18:56PM GMT, Jesper Dangaard Brouer wrote:
Avoid lock contention on the global cgroup rstat lock caused by kswapd
starting on all NUMA nodes simultaneously. At Cloudflare, we observed
massive issues due to kswapd and the specific mem_cgroup_flush_stats()
call inlined in shrink_node, which takes the rstat lock.

[...]
  static void cgroup_base_stat_flush(struct cgroup *cgrp, int cpu);
@@ -312,6 +315,45 @@ static inline void __cgroup_rstat_unlock(struct cgroup *cgrp, int cpu_in_loop)
      spin_unlock_irq(&cgroup_rstat_lock);
  }
+#define MAX_WAIT    msecs_to_jiffies(100)
+/* Trylock helper that also checks for on ongoing flusher */
+static bool cgroup_rstat_trylock_flusher(struct cgroup *cgrp)
+{
+    bool locked = __cgroup_rstat_trylock(cgrp, -1);
+    if (!locked) {
+        struct cgroup *cgrp_ongoing;
+
+        /* Lock is contended, lets check if ongoing flusher is already
+         * taking care of this, if we are a descendant.
+         */
+        cgrp_ongoing = READ_ONCE(cgrp_rstat_ongoing_flusher);
+        if (cgrp_ongoing && cgroup_is_descendant(cgrp, cgrp_ongoing)) {

I wonder if READ_ONCE() and cgroup_is_descendant() needs to happen
within in rcu section. On a preemptable kernel, let's say we got
preempted in between them, the flusher was unrelated and got freed
before we get the CPU. In that case we are accessing freed memory.


I have to think about this some more.


I don't think this is necessary. We are now waiting (for completion) and not skipping flush, because as part of take down function cgroup_rstat_exit() is called, which will call cgroup_rstat_flush().


void cgroup_rstat_exit(struct cgroup *cgrp)
{
int cpu;
cgroup_rstat_flush(cgrp);


+            wait_for_completion_interruptible_timeout(
+                &cgrp_ongoing->flush_done, MAX_WAIT);
+
+            return false;
+        }
+        __cgroup_rstat_lock(cgrp, -1, false);
+    }
+    /* Obtained lock, record this cgrp as the ongoing flusher */
+    if (!READ_ONCE(cgrp_rstat_ongoing_flusher)) {