Re: [RESEND][PATCH v2] mm: don't call lru draining in the nested lru_cache_disable

From: Minchan Kim
Date: Tue Jan 25 2022 - 16:06:26 EST


On Tue, Jan 25, 2022 at 10:23:13AM +0100, Michal Hocko wrote:
> On Mon 24-01-22 14:22:03, Minchan Kim wrote:
> [...]
> > CPU 0 CPU 1
> >
> > lru_cache_disable lru_cache_disable
> > ret = atomic_inc_return;(ret = 1)
> >
> > ret = atomic_inc_return;(ret = 2)
> >
> > lru_add_drain_all(true);
> > lru_add_drain_all(false)
> > mutex_lock() is holding
> > mutex_lock() is waiting
> >
> > IPI with !force_all_cpus
> > ...
> > ...
> > IPI done but it skipped some CPUs
> >
> > ..
> > ..
> >
> >
> > Thus, lru_cache_disable on CPU 1 doesn't run with every CPUs so it
> > introduces race of lru_disable_count so some pages on cores
> > which didn't run the IPI could accept upcoming pages into per-cpu
> > cache.
>
> Yes, that is certainly possible but the question is whether it really
> matters all that much. The race would require also another racer to be
> adding a page to an _empty_ pcp list at the same time.
>
> pagevec_add_and_need_flush
> 1) pagevec_add # add to pcp list
> 2) lru_cache_disabled
> atomic_read(lru_disable_count) = 0
> # no flush but the page is on pcp
>
> There is no strong memory ordering between 1 and 2 and that is why we
> need an IPI to enforce it in general IIRC.

Correct.

>
> But lru_cache_disable is not a strong synchronization primitive. It aims
> at providing a best effort means to reduce false positives, right? IMHO

Nope. d479960e44f27, mm: disable LRU pagevec during the migration temporarily

Originally, it was designed to close the race fundamentally.

> it doesn't make much sense to aim for perfection because all users of
> this interface already have to live with temporary failures and pcp
> caches is not the only reason to fail - e.g. short lived page pins.

short lived pages are true but that doesn't mean we need to make the
allocation faster. As I mentioned, the IPI takes up to hundreds
milliseconds easily once CPUs are fully booked. If we reduce the
cost, we could spend the time more productive works. I am working
on making CMA more determinstic and this patch is one of parts.

>
> That being said, I would rather live with a best effort and simpler
> implementation approach rather than aim for perfection in this case.
> The scheme is already quite complex and another lock in the mix doesn't

lru_add_drain_all already hides the whole complexity inside and
lru_cache_disable adds A simple synchroniztion to keep ordering
on top of it. That's natural SW stack and I don't see too complication
here.

> make it any easier to follow. If others believe that another lock makes

Disagree. lru_cache_disable is designed to guarantee closing the race
you are opening again so the other code in allocator since disabling
per-cpu cache doesn't need to consider the race at all. It's more
simple/deterministic and we could make other stuff based on it(e.g.,
zone->pcp).

> the implementation more straightforward I will not object but I would go
> with the following.
>
> diff --git a/mm/swap.c b/mm/swap.c
> index ae8d56848602..c140c3743b9e 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -922,7 +922,8 @@ atomic_t lru_disable_count = ATOMIC_INIT(0);
> */
> void lru_cache_disable(void)
> {
> - atomic_inc(&lru_disable_count);
> + int count = atomic_inc_return(&lru_disable_count);
> +
> #ifdef CONFIG_SMP
> /*
> * lru_add_drain_all in the force mode will schedule draining on
> @@ -931,8 +932,28 @@ void lru_cache_disable(void)
> * The atomic operation doesn't need to have stronger ordering
> * requirements because that is enforeced by the scheduling
> * guarantees.
> + * Please note that there is a potential for a race condition:
> + * CPU0 CPU1 CPU2
> + * pagevec_add_and_need_flush
> + * pagevec_add # to the empty list
> + * lru_cache_disabled
> + * atomic_read # 0
> + * lru_cache_disable lru_cache_disable
> + * atomic_inc_return (1)
> + * atomic_inc_return (2)
> + * __lru_add_drain_all(true)
> + * __lru_add_drain_all(false)
> + * mutex_lock
> + * mutex_lock
> + * # skip cpu0 (pagevec_add not visible yet)
> + * mutex_unlock
> + * # fail because of pcp(0) pin
> + * queue_work_on(0)
> + *
> + * but the scheme is a best effort and the above race quite unlikely
> + * to matter in real life.
> */
> - __lru_add_drain_all(true);
> + __lru_add_drain_all(count == 1);
> #else
> lru_add_and_bh_lrus_drain();
> #endif
> --
> Michal Hocko
> SUSE Labs