Re: [PATCH v3] mm/vmscan: stop the loop if enough pages have been page_out
From: Barry Song
Date: Thu Oct 10 2024 - 12:36:10 EST
On Thu, Oct 10, 2024 at 4:59 PM Kefeng Wang <wangkefeng.wang@xxxxxxxxxx> wrote:
>
> Hi Ridong,
>
> This should be the first version for upstream, and the issue only
> occurred when large folio is spited.
>
> Adding more CCs to see if there's more feedback.
>
>
> On 2024/10/10 16:18, Chen Ridong wrote:
> > From: Chen Ridong <chenridong@xxxxxxxxxx>
> >
> > An issue was found with the following testing step:
> > 1. Compile with CONFIG_TRANSPARENT_HUGEPAGE=y
> > 2. Mount memcg v1, and create memcg named test_memcg and set
> > usage_in_bytes=2.1G, memsw.usage_in_bytes=3G.
> > 3. Create a 1G swap file, and allocate 2.2G anon memory in test_memcg.
> >
> > It was found that:
> >
> > cat memory.usage_in_bytes
> > 2144940032
> > cat memory.memsw.usage_in_bytes
> > 2255056896
> >
> > free -h
> > total used free
> > Mem: 31Gi 2.1Gi 27Gi
> > Swap: 1.0Gi 618Mi 405Mi
> >
> > As shown above, the test_memcg used about 100M swap, but 600M+ swap memory
> > was used, which means that 500M may be wasted because other memcgs can not
> > use these swap memory.
> >
> > It can be explained as follows:
> > 1. When entering shrink_inactive_list, it isolates folios from lru from
> > tail to head. If it just takes folioN from lru(make it simple).
> >
> > inactive lru: folio1<->folio2<->folio3...<->folioN-1
> > isolated list: folioN
> >
> > 2. In shrink_page_list function, if folioN is THP, it may be splited and
> > added to swap cache folio by folio. After adding to swap cache, it will
> > submit io to writeback folio to swap, which is asynchronous.
> > When shrink_page_list is finished, the isolated folios list will be
> > moved back to the head of inactive lru. The inactive lru may just look
> > like this, with 512 filioes have been move to the head of inactive lru.
> >
> > folioN512<->folioN511<->...filioN1<->folio1<->folio2...<->folioN-1
> >
> > 3. When folio writeback io is completed, the folio may be rotated to tail
> > of lru. The following lru list is expected, with those filioes that have
> > been added to swap cache are rotated to tail of lru. So those folios
> > can be reclaimed as soon as possible.
> >
> > folio1<->folio2<->...<->folioN-1<->filioN1<->...folioN511<->folioN512
> >
> > 4. However, shrink_page_list and folio writeback are asynchronous. If THP
> > is splited, shrink_page_list loops at least 512 times, which means that
> > shrink_page_list is not completed but some folios writeback have been
> > completed, and this may lead to failure to rotate these folios to the
> > tail of lru. The lru may look likes as below:
I assume you’re referring to PMD-mapped THP, but your code also modifies
mTHP, which might not be that large. For instance, it could be a 16KB mTHP.
> >
> > folioN50<->folioN49<->...filioN1<->folio1<->folio2...<->folioN-1<->
> > folioN51<->folioN52<->...folioN511<->folioN512
> >
> > Although those folios (N1-N50) have been finished writing back, they
> > are still at the head of lru. When isolating folios from lru, it scans
> > from tail to head, so it is difficult to scan those folios again.
> >
> > What mentioned above may lead to a large number of folios have been added
> > to swap cache but can not be reclaimed in time, which may reduce reclaim
> > efficiency and prevent other memcgs from using this swap memory even if
> > they trigger OOM.
> >
> > To fix this issue, it's better to stop looping if THP has been splited and
> > nr_pageout is greater than nr_to_reclaim.
> >
> > Signed-off-by: Chen Ridong <chenridong@xxxxxxxxxx>
> > ---
> > mm/vmscan.c | 16 +++++++++++++++-
> > 1 file changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 749cdc110c74..fd8ad251eda2 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1047,7 +1047,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > LIST_HEAD(demote_folios);
> > unsigned int nr_reclaimed = 0;
> > unsigned int pgactivate = 0;
> > - bool do_demote_pass;
> > + bool do_demote_pass, splited = false;
> > struct swap_iocb *plug = NULL;
> >
> > folio_batch_init(&free_folios);
> > @@ -1065,6 +1065,16 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >
> > cond_resched();
> >
> > + /*
> > + * If a large folio has been split, many folios are added
> > + * to folio_list. Looping through the entire list takes
> > + * too much time, which may prevent folios that have completed
> > + * writeback from rotateing to the tail of the lru. Just
> > + * stop looping if nr_pageout is greater than nr_to_reclaim.
> > + */
> > + if (unlikely(splited && stat->nr_pageout > sc->nr_to_reclaim))
> > + break;
I’m not entirely sure about the theory behind comparing stat->nr_pageout
with sc->nr_to_reclaim. However, the condition might still hold true even
if you’ve split a relatively small “large folio,” such as 16kB?
> > +
> > folio = lru_to_folio(folio_list);
> > list_del(&folio->lru);
> >
> > @@ -1273,6 +1283,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > if ((nr_pages > 1) && !folio_test_large(folio)) {
> > sc->nr_scanned -= (nr_pages - 1);
> > nr_pages = 1;
> > + splited = true;
> > }
> >
> > /*
> > @@ -1375,12 +1386,14 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > if (nr_pages > 1 && !folio_test_large(folio)) {
> > sc->nr_scanned -= (nr_pages - 1);
> > nr_pages = 1;
> > + splited = true;
> > }
> > goto activate_locked;
> > case PAGE_SUCCESS:
> > if (nr_pages > 1 && !folio_test_large(folio)) {
> > sc->nr_scanned -= (nr_pages - 1);
> > nr_pages = 1;
> > + splited = true;
> > }
> > stat->nr_pageout += nr_pages;
> >
> > @@ -1491,6 +1504,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> > if (nr_pages > 1) {
> > sc->nr_scanned -= (nr_pages - 1);
> > nr_pages = 1;
> > + splited = true;
> > }
> > activate_locked:
> > /* Not a candidate for swapping, so reclaim swap space. */
>