Re: [PATCH] mm: make fault_around_bytes configurable

From: Minchan Kim
Date: Mon May 16 2016 - 10:19:10 EST


On Tue, May 10, 2016 at 11:48:42AM +0900, Minchan Kim wrote:
> On Mon, May 09, 2016 at 04:32:51PM +0900, Minchan Kim wrote:
> > Hello,
> >
> > On Mon, Apr 25, 2016 at 05:21:11PM +0530, Vinayak Menon wrote:
> > >
> > >
> > > On 4/22/2016 3:14 PM, Kirill A. Shutemov wrote:
> > > > On Fri, Apr 22, 2016 at 02:15:08PM +0530, Vinayak Menon wrote:
> > > >> On 04/22/2016 05:31 AM, Andrew Morton wrote:
> > > >>> On Mon, 18 Apr 2016 20:47:16 +0530 Vinayak Menon <vinmenon@xxxxxxxxxxxxxx> wrote:
> > > >>>
> > > >>>> Mapping pages around fault is found to cause performance degradation
> > > >>>> in certain use cases. The test performed here is launch of 10 apps
> > > >>>> one by one, doing something with the app each time, and then repeating
> > > >>>> the same sequence once more, on an ARM 64-bit Android device with 2GB
> > > >>>> of RAM. The time taken to launch the apps is found to be better when
> > > >>>> fault around feature is disabled by setting fault_around_bytes to page
> > > >>>> size (4096 in this case).
> > > >>> Well that's one workload, and a somewhat strange one. What is the
> > > >>> effect on other workloads (of which there are a lot!).
> > > >>>
> > > >> This workload emulates the way a user would use his mobile device, opening
> > > >> an application, using it for some time, switching to next, and then coming
> > > >> back to the same application later. Another stat which shows significant
> > > >> degradation on Android with fault_around is device boot up time. I have not
> > > >> tried any other workload other than these.
> > > >>
> > > >>>> The tests were done on 3.18 kernel. 4 extra vmstat counters were added
> > > >>>> for debugging. pgpgoutclean accounts the clean pages reclaimed via
> > > >>>> __delete_from_page_cache. pageref_activate, pageref_activate_vm_exec,
> > > >>>> and pageref_keep accounts the mapped file pages activated and retained
> > > >>>> by page_check_references.
> > > >>>>
> > > >>>> === Without swap ===
> > > >>>> 3.18 3.18-fault_around_bytes=4096
> > > >>>> -----------------------------------------------------------------------
> > > >>>> workingset_refault 691100 664339
> > > >>>> workingset_activate 210379 179139
> > > >>>> pgpgin 4676096 4492780
> > > >>>> pgpgout 163967 96711
> > > >>>> pgpgoutclean 1090664 990659
> > > >>>> pgalloc_dma 3463111 3328299
> > > >>>> pgfree 3502365 3363866
> > > >>>> pgactivate 568134 238570
> > > >>>> pgdeactivate 752260 392138
> > > >>>> pageref_activate 315078 121705
> > > >>>> pageref_activate_vm_exec 162940 55815
> > > >>>> pageref_keep 141354 51011
> > > >>>> pgmajfault 24863 23633
> > > >>>> pgrefill_dma 1116370 544042
> > > >>>> pgscan_kswapd_dma 1735186 1234622
> > > >>>> pgsteal_kswapd_dma 1121769 1005725
> > > >>>> pgscan_direct_dma 12966 1090
> > > >>>> pgsteal_direct_dma 6209 967
> > > >>>> slabs_scanned 1539849 977351
> > > >>>> pageoutrun 1260 1333
> > > >>>> allocstall 47 7
> > > >>>>
> > > >>>> === With swap ===
> > > >>>> 3.18 3.18-fault_around_bytes=4096
> > > >>>> -----------------------------------------------------------------------
> > > >>>> workingset_refault 597687 878109
> > > >>>> workingset_activate 167169 254037
> > > >>>> pgpgin 4035424 5157348
> > > >>>> pgpgout 162151 85231
> > > >>>> pgpgoutclean 928587 1225029
> > > >>>> pswpin 46033 17100
> > > >>>> pswpout 237952 127686
> > > >>>> pgalloc_dma 3305034 3542614
> > > >>>> pgfree 3354989 3592132
> > > >>>> pgactivate 626468 355275
> > > >>>> pgdeactivate 990205 771902
> > > >>>> pageref_activate 294780 157106
> > > >>>> pageref_activate_vm_exec 141722 63469
> > > >>>> pageref_keep 121931 63028
> > > >>>> pgmajfault 67818 45643
> > > >>>> pgrefill_dma 1324023 977192
> > > >>>> pgscan_kswapd_dma 1825267 1720322
> > > >>>> pgsteal_kswapd_dma 1181882 1365500
> > > >>>> pgscan_direct_dma 41957 9622
> > > >>>> pgsteal_direct_dma 25136 6759
> > > >>>> slabs_scanned 689575 542705
> > > >>>> pageoutrun 1234 1538
> > > >>>> allocstall 110 26
> > > >>>>
> > > >>>> Looks like with fault_around, there is more pressure on reclaim because
> > > >>>> of the presence of more mapped pages, resulting in more IO activity,
> > > >>>> more faults, more swapping, and allocstalls.
> > > >>> A few of those things did get a bit worse?
> > > >> I think some numbers (like workingset, pgpgin, pgpgoutclean etc) looks
> > > >> better with fault_around because, increased number of mapped pages is
> > > >> resulting in less number of file pages being reclaimed (pageref_activate,
> > > >> pageref_activate_vm_exec, pageref_keep above), but increased swapping.
> > > >> Latency numbers are far bad with fault_around_bytes + swap, possibly because
> > > >> of increased swapping, decrease in kswapd efficiency and increase in
> > > >> allocstalls.
> > > >> So the problem looks to be that unwanted pages are mapped around the fault
> > > >> and page_check_references is unaware of this.
> > > > Hm. It makes me think we should make ptes setup by faultaround old.
> > > >
> > > > Although, it would defeat (to some extend) purpose of faultaround on
> > > > architectures without HW accessed bit :-/
> > > >
> > > > Could you check if the patch below changes the situation?
> > > > It would require some more work to not mark the pte we've got fault for old.
> > >
> > > Column at the end shows the values with the patch
> > >
> > > 3.18 3.18-fab=4096 3.18-Kirill's-fix
> > >
> > > ---------------------------------------------------------
> > >
> > > workingset_refault 597687 878109 790207
> > >
> > > workingset_activate 167169 254037 207912
> > >
> > > pgpgin 4035424 5157348 4793116
> > >
> > > pgpgout 162151 85231 85539
> > >
> > > pgpgoutclean 928587 1225029 1129088
> > >
> > > pswpin 46033 17100 8926
> > >
> > > pswpout 237952 127686 103435
> > >
> > > pgalloc_dma 3305034 3542614 3401000
> > >
> > > pgfree 3354989 3592132 3457783
> > >
> > > pgactivate 626468 355275 326716
> > >
> > > pgdeactivate 990205 771902 697392
> > >
> > > pageref_activate 294780 157106 138451
> > >
> > > pageref_activate_vm_exec 141722 63469 64585
> > >
> > > pageref_keep 121931 63028 65811
> > >
> > > pgmajfault 67818 45643 34944
> > >
> > > pgrefill_dma 1324023 977192 874497
> > >
> > > pgscan_kswapd_dma 1825267 1720322 1577483
> > >
> > > pgsteal_kswapd_dma 1181882 1365500 1243968
> > >
> > > pgscan_direct_dma 41957 9622 9387
> > >
> > > pgsteal_direct_dma 25136 6759 7108
> > >
> > > slabs_scanned 689575 542705 618839
> > >
> > > pageoutrun 1234 1538 1450
> > >
> > > allocstall 110 26 13
> > >
> > > Everything seems to have improved except slabs_scanned, possibly because
> > > of this check which Minchan pointed out, that results in higher pressure on slabs.
> > >
> > > if (page_mapped(page) || PageSwapCache(page))
> > >
> > > sc->nr_scanned++;
> > >
> > > I had added some traces to monitor the vmpressure values. Those also seems to
> > > be high, possibly because of the same reason.
> > >
> > > Should the pressure be doubled only if page is mapped and referenced ?
> >
> > Yes, pte_mkold is not perfect at the moment.
> >
> > Anyway, above heuristic has been in there for a long time since I was born
> > maybe :) (I don't want to argue why it's there and whether it's right) So,
> > I'm really hesitant to change it that it might bite some workloads.
> > (But I don't mean I'm against it but just don't want to make it by myself
> > to avoid potential blame). IOW, Kirill's fault_around broke it too so it
> > could bite some workloads.
> >
> > At least, as Vinayak mentioned, it would change vmpressure level so users of
> > vmpressure can be affected. AFAIK, some vendors in embedded side relies on
> > vmpressure to control memory management so it will hurt them.
> > As well, slab shrinking behavior was changed, too. Unfortunately, I don't
> > know any workload is dependent with it.
> >
> > As other regression in my company product, we have snapshot a process
> > with workingset for later fast resume. For that, we have considered
> > pte-mapped pages as workingset for snapshot but snapshot start to include
> > non-workingset pages since fault-around is merged. It means snapshot
> > image size is increased so that we need more storage space and it starts
> > the thing slow down. I guess mincore(2) users will be affected.
> >
> > Additional Note: There are lots of products with ARM which is non-HW access
> > bit system in embedded world although ARM start to support it recenlty and
> > sequential file access workload is not important compared to memory reclaim
> > So, fault_around's benefit could be higly limited compared to HW-access bit
> > architectures on server workload.
> >
> > I want to ask again.
> > I guess we could disable fault_around by kernel parameter but does it
> > sound reasonable to enable fault_around by default for every arches
> > at the cost of above regression?
> >
> > I'm not against for that. Just what I want is some fixes about the
> > regression should go to -stable.
> >
> > >
> > > There is big improvement in avg latency, but still 5% higher than with fault_around
> > > disabled. I will try to debug this further.
>
> I did quick test in my ARM machine.
>
> 512M file mmap sequential every word read
>
> = vanilla fault_around=4096 =
> minor fault: 131291
> elapsed time(usec): 6686236
>
> = vanilla fault_around=65536 =
> minor fault: 12577
> elapsed time(usec): 6586959
>
> I tested 3 times and result seemed to be stable.
> 90% minor fault was reduced. It's huge win but as looking at elapsed time,
> it's not huge win. Just about 1.5%.
>
> = pte_mkold applied fault_around=4096 =
> minor fault: 131291
> elapsed time(usec): 6608358
>
> = pte_mkold applied fault_around=65536 =
> minor fault: 143609
> elapsed time(usec): 6772520
>
> I tested 3 times and result seemed to be stable.
> minor fault was rather increased and elapsed time was slow with
> fault_around.
> Gain is really not clear.

Kirill,
You wanted to test non-HW access bit system and I did.
What's your opinion?