Re: (2) [PATCH v2] page_alloc: consider highatomic reserve in wmartermark fast

From: Jaewon Kim
Date: Tue Jun 16 2020 - 12:08:21 EST


.,

2020ë 6ì 16ì (í) ìí 11:17, Baoquan He <bhe@xxxxxxxxxx>ëì ìì:
>
> On 06/16/20 at 04:30pm, êìì wrote:
> > >>> > > <4>[ 6207.637627] [3: Binder:9343_3:22875] Normal free:10908kB min:6192kB low:44388kB high:47060kB active_anon:409160kB inactive_anon:325924kB active_file:235820kB inactive_file:276628kB unevictable:2444kB writepending:252kB present:3076096kB managed:2673676kB mlocked:2444kB kernel_stack:62512kB pagetables:105264kB bounce:0kB free_pcp:4140kB local_pcp:40kB free_cma:712kB
> > >>
> > >> Checked this mem info, wondering why there's no 'reserved_highatomic'
> > >> printing in all these examples.
> > >
> > >Yeah, it better be printed, especially after it's included in watermark
> > >calculation, so we're less confused by reports of allocation failure where
> > >watermarks are seemingly ok.
> > >
> >
> >
> > Hello Vlastimil and Baoquan
> >
> > The log in previous mail was captured from kernel based on v4.14.
> > After adding the reserved_highatomic log, I finally got a new log below
> > Let me change description in next patch.
> >
> > There seems be reserved_highatomic:32768KB and actually 14232kB free.
> >
> > [ 4738.329298] kswapd0: page allocation failure: order:0, mode:0x140000a(GFP_NOIO|__GFP_HIGHMEM|__GFP_MOVABLE), nodemask=(null)
> > [ 4738.329325] kswapd0 cpuset=/ mems_allowed=0
> > [ 4738.329339] CPU: 4 PID: 1221 Comm: kswapd0 Not tainted 4.14.113-18770262-userdebug #1
> > [ 4738.329350] Call trace:
> > [ 4738.329366] [<0000000000000000>] dump_backtrace+0x0/0x248
> > [ 4738.329377] [<0000000000000000>] show_stack+0x18/0x20
> > [ 4738.329390] [<0000000000000000>] __dump_stack+0x20/0x28
> > [ 4738.329398] [<0000000000000000>] dump_stack+0x68/0x90
> > [ 4738.329409] [<0000000000000000>] warn_alloc+0x104/0x198
> > [ 4738.329417] [<0000000000000000>] __alloc_pages_nodemask+0xdc0/0xdf0
> > [ 4738.329427] [<0000000000000000>] zs_malloc+0x148/0x3d0
> > [ 4738.329438] [<0000000000000000>] zram_bvec_rw+0x410/0x798
> > [ 4738.329446] [<0000000000000000>] zram_rw_page+0x88/0xdc
> > [ 4738.329455] [<0000000000000000>] bdev_write_page+0x70/0xbc
> > [ 4738.329463] [<0000000000000000>] __swap_writepage+0x58/0x37c
> > [ 4738.329469] [<0000000000000000>] swap_writepage+0x40/0x4c
> > [ 4738.329478] [<0000000000000000>] shrink_page_list+0xc30/0xf48
> > [ 4738.329486] [<0000000000000000>] shrink_inactive_list+0x2b0/0x61c
> > [ 4738.329494] [<0000000000000000>] shrink_node_memcg+0x23c/0x618
> > [ 4738.329501] [<0000000000000000>] shrink_node+0x1c8/0x304
> > [ 4738.329509] [<0000000000000000>] kswapd+0x680/0x7c4
> > [ 4738.329518] [<0000000000000000>] kthread+0x110/0x120
> > [ 4738.329527] [<0000000000000000>] ret_from_fork+0x10/0x18
> > [ 4738.329538] Mem-Info:
> > [ 4738.329574] active_anon:111826 inactive_anon:65557 isolated_anon:0\x0a active_file:44260 inactive_file:83422 isolated_file:0\x0a unevictable:4158 dirty:117 writeback:0 unstable:0\x0a slab_reclaimable:13943 slab_unreclaimable:43315\x0a mapped:102511 shmem:3299 pagetables:19566 bounce:0\x0a free:3510 free_pcp:553 free_cma:0
> > [ 4738.329593] Node 0 active_anon:447304kB inactive_anon:262228kB active_file:177040kB inactive_file:333688kB unevictable:16632kB isolated(anon):0kB isolated(file):0kB mapped:410044kB d irty:468kB writeback:0kB shmem:13196kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
> > [ 4738.329603] Normal free:14040kB min:7440kB low:94500kB high:98136kB reserved_highatomic:32768KB active_anon:447336kB inactive_anon:261668kB active_file:177572kB inactive_file:333768k B unevictable:16632kB writepending:480kB present:4081664kB managed:3637088kB mlocked:16632kB kernel_stack:47072kB pagetables:78264kB bounce:0kB free_pcp:2280kB local_pcp:720kB free_cma:0kB [ 4738.329607] lowmem_reserve[]: 0 0
> > [ 4738.329615] Normal: 860*4kB (H) 453*8kB (H) 180*16kB (H) 26*32kB (H) 34*64kB (H) 6*128kB (H) 2*256kB (H) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 14232kB
> >
> >
> > >...
> > >
> > >>> > > /*
> > >>> > > * Fast check for order-0 only. If this fails then the reserves
> > >>> > > @@ -3598,9 +3604,12 @@ static inline bool zone_watermark_fast(struct zone *z, unsigned int order,
> > >>> > > * the caller is !atomic then it'll uselessly search the free
> > >>> > > * list. That corner case is then slower but it is harmless.
> > >>> >
> > >>> > Do we need remove or adjust the code comment at this place? So Mel have
> > >>> > foreseen the corner case, just reclaiming to unreserve the highatomic
> > >>> > might be ignored.
> > >>> >
> > >>>
> > >>> Hello thank you for your comment.
> > >>>
> > >>> My previous mail to Vlastimil Babka seems to have html.
> > >>> Let me put again here because I also think the comment should be changed.
> > >>> I'd like to hear opinions from the open source community.
> > >>
> > >> Yeah, your replying mail to Vlastimil looks a little messy on format, I
> > >> didn't scroll down to check.
> > >>
> > >>>
> > >>> Additionally actually I think we need accurate counting of highatomic
> > >>> free after this patch.
> > >>>
> > >>> ----------------------------------------------------------------------------------------
> > >>>
> > >>> As Mel also agreed with me in v1 mail thread, this highatomic reserved should
> > >>> be considered even in watermark fast.
> > >>>
> > >>> The comment, I think, may need to be changed. Prior to this patch, non
> > >>> highatomic
> > >>> allocation may do useless search, but it also can take ALL non highatomic free.
> > >>>
> > >>> With this patch, non highatomic allocation will NOT do useless search. Rather,
> > >>> it may be required direct reclamation even when there are some non
> > >>> high atomic free.
> > >>>
> > >>> i.e)
> > >>> In following situation, watermark check fails (9MB - 8MB < 4MB) though there are
> > >>> enough free (9MB - 4MB > 4MB). If this is really matter, we need to
> > >>> count highatomic
> > >>> free accurately.
> > >>
> > >> Seems to make sense. We only use zone->nr_reserved_highatomic to account
> > >> the reserved highatomic, don't track how much have been used for
> > >> highatomic allocation. But not sure if this will happen often, or just a
> > >> very rare case, whether taking that into account will impact anything.
> > >
> > >Unfortunately there's a problem when trying to account free pages of a migrate
> > >type exactly, as e.g. during reserve_highatomic_pageblock(), some pages might be
> > >in pcplist of other cpu with other migratetype, and once they are freed, the
> > >buddy merging will merge the different migratetypes and distort the accounting.
>
> Yeah, the migratetype could have been cached in page->index before we
> finish the reserve_highatomic_pageblock(). Seems we take a very coarse
> grained way to do the highatomic reservation and accounting. When I went
> through the related code, found out we call
> reserve_highatomic_pageblock() if below condition is met. So what if
> order is 1, and all other pages on the page block have been used? Do we
> possibly have this kind of extreme case?

If I correctly understand your question, yes I think so.
If a hight-order free page was allocated on ALLOC_HARDER context, and the page
was the last order-1, then zone->nr_reserved_highatomic will be increased by
pageblock_nr_pages even though there was actually no free page moved to the
highatomic free page list.

The highatomic logic, I think, was originally designed to reserve in
that coarse grained
way to mitigate highatomic allocation failure.

>
> From Jaewon's captured information, we can see that the available free
> highatomic is even less than half the zone->nr_reserved_highatomic.
> Should we at least limit the reservation to the case with a bigger
> order. E.g 1/2 of pageblock_nr_pages? Please correct me if I am wrong or
> miss anyting.
>

I do not know well, but I think high-order lower than 1/2 of pageblock_nr_pages
also should be considered. i.e) a system using huge order-3 atomic allocation
in a short time.

> "reserved_highatomic:32768KB and actually 14232kB free."

I think this unwanted case might happen by freed pages. The pages allocated
for non-high-atomic also would be freed back into highatomic free
list. But I guess
there was sudden huge use of highatomic and partially freed.

>
> get_page_from_freelist
> {
> ...
> if (unlikely(order && (alloc_flags & ALLOC_HARDER)))
> reserve_highatomic_pageblock(page, zone, order);
> ...
> }
> > >Fixing this for all migratetypes would have performance overhead, so we only do
> > >that for MIGRATE_ISOLATE which is not that frequent (and it took a while to
> > >eliminate all corner cases), and CMA which doesn't change pageblocks dynamically.
> >
> > AFAIK we do not account free cma in pcp either. But yes accurate check could be
> > overhead. For example, __mod_zone_freepage_state should account highatomic free
> > as cma free. And we may see some incorrect accounting issue.
> >
> > >
> > >So either we live with the possible overreclaim due to inaccurate counting per
> > >your example above, or we instead let order-0 atomic allocations use highatomic
> > >reserves.
> > >
> >
> > Additionally regarding existing Mel's comment, let me remove some of it if you
> > don't mind.
> >
> > /*
> > * Fast check for order-0 only. If this fails then the reserves
> > - * need to be calculated. There is a corner case where the check
> > - * passes but only the high-order atomic reserve are free. If
> > - * the caller is !atomic then it'll uselessly search the free
> > - * list. That corner case is then slower but it is harmless.
> > + * need to be calculated.
> > */
> >
> > I will prepare v3 patch.
> > Thank you again.
> >
>