Re: [PATCH] memcg: use ratelimited stats flush in the reclaim

From: Yosry Ahmed
Date: Mon Jun 24 2024 - 17:42:01 EST

Next message: Yosry Ahmed: "Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes"
Previous message: Ilya Maximets: "Re: [PATCH net-next v3 04/10] net: psample: allow using rate as probability"
In reply to: Shakeel Butt: "Re: [PATCH] memcg: use ratelimited stats flush in the reclaim"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Mon, Jun 24, 2024 at 1:01 PM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
>
> On Mon, Jun 24, 2024 at 12:06:28PM GMT, Yosry Ahmed wrote:
> > On Mon, Jun 24, 2024 at 11:59 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> > >
> > > On Mon, Jun 24, 2024 at 10:15:38AM GMT, Yosry Ahmed wrote:
> > > > On Mon, Jun 24, 2024 at 10:02 AM Shakeel Butt <shakeel.butt@xxxxxxxxx> wrote:
> > > > >
> > > > > On Mon, Jun 24, 2024 at 05:57:51AM GMT, Yosry Ahmed wrote:
> > > > > > > > and I will explain why below. I know it may be a necessary
> > > > > > > > evil, but I would like us to make sure there is no other option before
> > > > > > > > going forward with this.
> > > > > > >
> > > > > > > Instead of necessary evil, I would call it a pragmatic approach i.e.
> > > > > > > resolve the ongoing pain with good enough solution and work on long term
> > > > > > > solution later.
> > > > > >
> > > > > > It seems like there are a few ideas for solutions that may address
> > > > > > longer-term concerns, let's make sure we try those out first before we
> > > > > > fall back to the short-term mitigation.
> > > > > >
> > > > >
> > > > > Why? More specifically why try out other things before this patch? Both
> > > > > can be done in parallel. This patch has been running in production at
> > > > > Meta for several weeks without issues. Also I don't see how merging this
> > > > > would impact us on working on long term solutions.
> > > >
> > > > The problem is that once this is merged, it will be difficult to
> > > > change this back to a normal flush once other improvements land. We
> > > > don't have a test that reproduces the problem that we can use to make
> > > > sure it's safe to revert this change later, it's only using data from
> > > > prod.
> > > >
> > >
> > > I am pretty sure the work on long term solution would be iterative which
> > > will involve many reverts and redoing things differently. So, I think it
> > > is understandable that we may need to revert or revert the reverts.
> > >
> > > > Once this mitigation goes in, I think everyone will be less motivated
> > > > to get more data from prod about whether it's safe to revert the
> > > > ratelimiting later :)
> > >
> > > As I said I don't expect "safe in prod" as a strict requirement for a
> > > change.
> >
> > If everyone agrees that we can experiment with reverting this change
> > later without having to prove that it is safe, then I think it's fine.
> > Let's document this in the commit log though, so that whoever tries to
> > revert this in the future (if any) does not have to re-explain all of
> > this :)
>
> Sure.
>
> >
> > [..]
> > > > > > >
> > > > > > > For the cache trim mode, inactive file LRU size is read and the kernel
> > > > > > > scales it down based on the reclaim iteration (file >> sc->priority) and
> > > > > > > only checks if it is zero or not. Again precise information is not
> > > > > > > needed.
> > > > > >
> > > > > > It sounds like it is possible that we enter the cache trim mode when
> > > > > > we shouldn't if the stats are stale. Couldn't this lead to
> > > > > > over-reclaiming file memory?
> > > > > >
> > > > >
> > > > > Can you explain how this over-reclaiming file will happen?
> > > >
> > > > In one reclaim iteration, we could flush the stats, read the inactive
> > > > file LRU size, confirm that (file >> sc->priority) > 0 and enter the
> > > > cache trim mode, reclaiming file memory only. Let's assume that we
> > > > reclaimed enough file memory such that the condition (file >>
> > > > sc->priority) > 0 does not hold anymore.
> > > >
> > > > In a subsequent reclaim iteration, the flush could be skipped due to
> > > > ratelimiting. Now we will enter the cache trim mode again and reclaim
> > > > file memory only, even though the actual amount of file memory is low.
> > > > This will cause over-reclaiming from file memory and dismissing anon
> > > > memory that we should have reclaimed, which means that we will need
> > > > additional reclaim iterations to actually free memory.
> > > >
> > > > I believe this scenario would be possible with ratelimiting, right?
> > > >
> > >
> > > So, the (old_file >> sc->priority) > 0 is true but the (new_file >>
> > > sc->priority) > is false. In the next iteration, (old_file >>
> > > (sc->priority-1)) > 0 will still be true but somehow (new_file >>
> > > (sc->priority-1)) > 0 is false. It can happen if in the previous
> > > iteration, somehow kernel has reclaimed more than double what it was
> > > supposed to reclaim or there are concurrent reclaimers. In addition the
> > > nr_reclaim is still less than nr_to_reclaim and there is no file
> > > deactivation request.
> > >
> > > Yeah it can happen but a lot of wierd conditions need to happen
> > > concurrently for this to happen.
> >
> > Not necessarily sc->priority-1. Consider two separate sequential
> > reclaim attempts. At the same priority, the first reclaim attempt
> > could rightfully enter cache trim mode, while the second one
> > wrongfully enters cache trim mode due to stale stats, over-reclaim
> > file memory, and stall longer to actually reclaim the anon memory.
> >
>
> For two different reclaim attempts even more things need to go wrong.
> Anyways we are talking too much in abstract here and focusing on the
> corner cases which almost all heuristics have. Unless there is a clear
> explanation that the corner case probability will be increased, I don't
> think spending time discussing it is useful.
>
> > I am sure such a scenario is not going to be common, but I am also
> > sure if it happens it will be a huge pain to debug.
> >
> > If others agree that this is fine, let's document this with a comment
> > and in the commit log. I am not sure how common the cache trim mode is
> > in practice to understand the potential severity of such problems.
> > There may also be other consequences that I am not aware of.
>
> What is your definition of "others" though?

I am just interested to hear more opinions. If others (e.g. people in
the CC) agree with you that this is the approach we should be taking
then I won't stand in the way. If others share my concerns, then maybe
we should not proceed. It seemed like at least Jesper had some
concerns as well.

If no one cares enough to voice their opinions then I suppose it's up to you :)

Next message: Yosry Ahmed: "Re: [PATCH V2] cgroup/rstat: Avoid thundering herd problem by kswapd across NUMA nodes"
Previous message: Ilya Maximets: "Re: [PATCH net-next v3 04/10] net: psample: allow using rate as probability"
In reply to: Shakeel Butt: "Re: [PATCH] memcg: use ratelimited stats flush in the reclaim"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]