That all being said, the semantics of the new 'high' limit in cgroup2
have allowed us to move reclaim/limit enforcement out of the
allocation context and into the userspace return path.
See the call to mem_cgroup_handle_over_high() from
tracehook_notify_resume(), and the comments in try_charge() around
set_notify_resume().
This already solves the free->alloc ordering problem by allowing the
allocation to exceed the limit temporarily until at least all locks
are dropped, we know we can sleep etc., before performing enforcement.
That means we may not need the timed sleeps anymore for that purpose,
and could bring back directed waits for freeing-events again.
What do you think? Any hazards around indefinite sleeps in that resume
path? It's called before __rseq_handle_notify_resume and the
arch-specific resume callback (which appears to be a no-op currently).
Chris, Michal, what are your thoughts? It would certainly be simpler
conceptually on the memcg side.