Re: [PATCH 4/5] mm: compaction: Determine if dirty pages can bemigreated without blocking within ->migratepage

From: Mel Gorman
Date: Thu Nov 24 2011 - 07:21:52 EST


On Thu, Nov 24, 2011 at 02:19:43AM +0100, Andrea Arcangeli wrote:
> On Tue, Nov 22, 2011 at 12:59:06PM +0000, Mel Gorman wrote:
> > On Mon, Nov 21, 2011 at 11:45:45PM +0100, Andrea Arcangeli wrote:
> > > On Mon, Nov 21, 2011 at 11:17:26AM +0000, Mel Gorman wrote:
> > > > On Fri, Nov 18, 2011 at 10:35:30PM +0100, Andrea Arcangeli wrote:
> > > > > folks who wants low latency or no memory overhead should simply
> > > > > disable compaction.
> > > >
> > > > That strikes me as being somewhat heavy handed. Compaction should be as
> > > > low latency as possible.
> > >
> > > Yes I was meaning in the very short term. Optimizations are always
> > > possible :) we've just to sort out some issues (as previous part of
> > > the email discussed).
> > >
> >
> > To put some figures on the latency impacts, I added a test to MM
> > Tests similar to yours that is in three parts
> >
> > 1. A process reads /dev/sda and writes to /dev/null
>
> Yes also note, ironically this is likely to be a better test for this
> without the __GFP_MOVABLE in block_dev.c. Even if we want it fixed,
> maybe another source that reduces the non movable pages may be needed then.
>

I'm also running other tests to avoid tuning for just this test cases.
Right now, the list looks like;

1. postmark with something creating anonymous mappings in the background
2. plain USB writing while creating anonymous mappings
3. Same this as described here except instead of reading /dev/sda, it's
reading a large file off a filesystem

Unfortunately I had to restart all the tests yesterday evening as the
USB stick got trashed (to be fair, I had severely abused it). That
set things back a bit.

> > 2. A process reads from /dev/zero and writes to a filesystem on a USB stick
> > 3. When memory is full, a process starts that creates an anonymous
> > mapping
>
> Great.
>
> > Looking at the vfat figures, we see
> >
> > THPAVAIL
> > thpavail-3.0.0-vanilla-thpavailthpavail-3.1.0-vanilla-thpavail thpavail-3.2.0 3.2.0-rc2 3.2.0-rc2
> > 3.0.0-vanilla 3.1.0-vanilla rc2-vanilla migratedirty-v4r2 synclight-v4r2
> > System Time 1.89 ( 0.00%) 1.90 ( -0.21%) 3.83 ( -50.63%) 19.95 ( -90.52%) 71.91 ( -97.37%)
> > +/- 0.15 ( 0.00%) 0.05 ( 205.45%) 2.39 ( -93.87%) 27.59 ( -99.47%) 0.75 ( -80.49%)
> > User Time 0.10 ( 0.00%) 0.10 ( -4.00%) 0.10 ( -7.69%) 0.12 ( -17.24%) 0.11 ( -9.43%)
> > +/- 0.02 ( 0.00%) 0.02 ( -24.28%) 0.03 ( -38.31%) 0.02 ( 0.00%) 0.04 ( -56.57%)
> > Elapsed Time 986.53 ( 0.00%) 1189.77 ( -17.08%) 589.48 ( 67.36%) 506.45 ( 94.79%) 73.39 ( 1244.27%)
> > +/- 35.52 ( 0.00%) 90.09 ( -60.57%) 49.00 ( -27.51%) 213.56 ( -83.37%) 0.47 ( 7420.98%)
> > THP Active 118.80 ( 0.00%) 89.20 ( -33.18%) 35.40 ( -235.59%) 8.00 (-1385.00%) 44.00 ( -170.00%)
> > +/- 65.57 ( 0.00%) 35.37 ( -85.38%) 29.23 ( -124.33%) 12.51 ( -424.28%) 19.83 ( -230.65%)
> > Fault Alloc 308.80 ( 0.00%) 244.20 ( -26.45%) 59.40 ( -419.87%) 81.00 ( -281.23%) 95.80 ( -222.34%)
> > +/- 81.62 ( 0.00%) 80.53 ( -1.36%) 26.85 ( -203.98%) 67.85 ( -20.30%) 34.29 ( -138.05%)
> > Fault Fallback 697.20 ( 0.00%) 761.60 ( -8.46%) 946.40 ( -26.33%) 924.80 ( -24.61%) 909.80 ( -23.37%)
> > +/- 81.62 ( 0.00%) 80.52 ( 1.37%) 26.65 ( 206.28%) 67.93 ( 20.16%) 34.25 ( 138.33%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds) 541.58 645.79 1220.88 1131.04 976.54
> > Total Elapsed Time (seconds) 5066.05 6083.58 3068.08 2653.11 493.96
>
> Notice how the fault alloc goes down from 308-244 to 59 in
> rc2-vanilla, that's the vmscan.c patch I backed out in my series...

Yes, although that backout is nak'd already due to its agression. I
recognise the problem but I'm not considering the revert as a solution
at the moment. I know it helps THP allocation success rates but at
the cost of performance for other workloads if it reclaims too much.

> So
> that explains why I backed out that bit. Ok the elapsed time improved
> because reclaim is used less, but nothing close to x10 times faster
> thanks to it, even for this workload and even on vfat, and if you add
> __GFP_MOVABLE to block_dev.c things will likely change for the better
> too.

Have you checked that those allocations are really movable? I haven't
and it's relatively low on my list of priorities unfortunately.

> (It was x10 times slower if that lead to swapping, likely not the
> case here, but again the __GFP_MOVABLE should take care of it)
>
> > Straight off, 3.0, 3.1 and 3.2-rc2 have a number of THPs in use. 3.0
> > had a rougly 30% success rate with 10% of the mapping using THP but
> > look at the cost in time. It took about 16 minutes per iteration (5
> > iterations which is why total elapsed time looks large) to fault in a
> > mapping the size of physical memory versus 1.5 minutes with my series
> > (which is still very slow).
>
> I'd be curious if you tested my exact status too.
>

Not yet, these tests take time and the loss of the USB stick didn't
help but I'll run it over the weekend.

> > In my series system time is stupidly high but I haven't run oprofile
> > to see exactly where. It could be because it's running compaction
> > because defer_compaction() is not being called at the right times
> > but it could also be because direct reclaim scanning is through the
> > roof albeit better than 3.2-rc2 vanilla. I suspect direct reclaim
> > scanning could be because we are not calling ->writepage in direct
> > reclaim any more - 3.0 wrote 178215 pages while my series wrote 5.
> >
> > The story looks better for ext4 as it is not writing page pages in
> > fallback_migrate_page
>
> That certianly has an effect on the usb stick as they're all vfat.
>

I thought initially that the stupidly high scanning was due to large
numbers of pages under writeback but the counts are too low for that
to be the problem. The immediate reclaim logic for PageReclaim pages
appears to be working. It also is getting throttled although woken
quickly by the faster device. In itself is not a problem but is
the next direction I'm looking.

> > THPAVAIL
> > thpavail-3.0.0-vanilla-thpavailthpavail-3.1.0-vanilla-thpavail thpavail-3.2.0 3.2.0-rc2 3.2.0-rc2
> > 3.0.0-vanilla 3.1.0-vanilla rc2-vanilla migratedirty-v4r2 synclight-v4r2
> > System Time 2.64 ( 0.00%) 2.49 ( 6.10%) 3.09 ( -14.38%) 4.29 ( -38.40%) 12.20 ( -78.33%)
> > +/- 0.29 ( 0.00%) 0.61 ( -51.80%) 1.17 ( -74.90%) 1.33 ( -78.04%) 16.27 ( -98.20%)
> > User Time 0.12 ( 0.00%) 0.08 ( 51.28%) 0.10 ( 13.46%) 0.10 ( 13.46%) 0.09 ( 34.09%)
> > +/- 0.03 ( 0.00%) 0.02 ( 6.30%) 0.01 ( 94.49%) 0.01 ( 158.69%) 0.03 ( 0.00%)
> > Elapsed Time 107.76 ( 0.00%) 64.42 ( 67.29%) 60.89 ( 76.99%) 30.90 ( 248.69%) 50.72 ( 112.45%)
> > +/- 39.83 ( 0.00%) 17.03 ( 133.90%) 8.75 ( 355.08%) 17.20 ( 131.65%) 22.87 ( 74.14%)
> > THP Active 82.80 ( 0.00%) 84.20 ( 1.66%) 35.00 ( -136.57%) 53.00 ( -56.23%) 36.40 ( -127.47%)
> > +/- 81.42 ( 0.00%) 80.82 ( -0.74%) 20.89 ( -289.73%) 34.50 ( -136.01%) 51.51 ( -58.07%)
> > Fault Alloc 246.00 ( 0.00%) 292.60 ( 15.93%) 66.20 ( -271.60%) 129.40 ( -90.11%) 90.20 ( -172.73%)
> > +/- 173.64 ( 0.00%) 161.13 ( -7.76%) 21.30 ( -715.14%) 52.41 ( -231.32%) 104.14 ( -66.74%)
> > Fault Fallback 759.60 ( 0.00%) 712.40 ( 6.63%) 939.80 ( -19.17%) 876.20 ( -13.31%) 915.60 ( -17.04%)
> > +/- 173.49 ( 0.00%) 161.13 ( 7.67%) 21.30 ( 714.44%) 52.16 ( 232.59%) 104.00 ( 66.82%)
> > MMTests Statistics: duration
> > User/Sys Time Running Test (seconds) 63.41 50.64 39.79 58.8 171
> > Total Elapsed Time (seconds) 732.79 483.23 447.17 335.28 412.21
>
> Here again fault alloc goes down from 173-161 to 66 from 3.0-3.1 to
> rc2. And elapsed time with ext4 is the same for 3.1.0 and rc2 (thanks
> to ext4 instead of vfat). So again no big difference in the
> vmscan.c backout.
>
> I think we can keep a limit to how deep reclaim can go (maybe same
> percentage logic we use in kswapd for the wmarks), but what was
> implemented is too strig and just skipping reclaim when compaction
> fails, doesn't sound ok.

Neither is reclaiming way too much memory. I think the answer there
is limiting how deep reclaim goes but have not actually gotten around
to implementing anything yet. I've been distracted unfortunately.

> At least until compaction gets stronger in
> being able to compact memory even when few ram is free. Compaction
> can't free memory, it takes what is free, and creates "contigous free
> pages" from "fragmented free pages". So the more free memory the
> higher chance of compaction success.

Fully agreed and this is part of the reason why min_free_kbytes is
higher when THP is enabled - the other part being it helps fragmentation
avoidance.

> This is why not stopping relclaim
> (if compaction fails) is good and increases the success rate
> significantly (triple it in the above load) without actually slowing
> down the elapsed time at all as shown above.
>
> > THP availability is very variable between iterations and the stall
> > times are nowhere near as bad as with vfat but still, my series cuts
> > the time to fault in a mapping by half (and that is still very slow).
>
> The results you got in compaction success for the first three columns
> is similar to what I get. Of course this is an extreme load with flood
> of allocations and tons of ram dirty so it's normal the success rate
> of compaction is significantly lower than it would be in a light-VM
> condition. But if reclaim is completely stopped by the vmscan patch
> then even under light-VM condition the success rate of THP allocation
> goes down.
>

Fully understood and agreed.

> > If you want, the __GFP_NO_KSWAPD check patch can be dropped as that
> > will also keep David happy until this series is fully baked if we
> > agree in principal that the current stalls are unacceptably large. It
> > will be a stretch to complete it in time for 3.2 though meaning that
> > stalls in 3.2 will be severe with THP enabled.
>
> Well there's no risk that it will be worse than 3.1 at least.

Other the continued risk of hearing complaints about stalling of
course.

> Also
> with a more powerful async compaction the chance we end up in sync
> compaction diminishes.
>

That is a large part of the motivation for this series. Keeping sync
compaction may be necessary (which is why my newer series keeps it)
but we can reduce how often it is used.

> > > <SNIP>
> > > When compaction_suitable is happy about the
> > > wmarks, the compaction loop isn't still as reliable as it could be in
> > > the movable zone, and doing more reclaim helps a lot, the more free
> > > pages the more compaction goes ahead and has a change to compact more
> > > stuff before the two scans meets in the middle. With Rik we thought
> > > about a way to do just a no-swapping reclaim to shrink caches like
> > > slab.
> > >
> >
> > Ok, is there a patch? I'll investigate this later when I get the time
> > but in the shorter term, I'll be looking at why the direct reclaim
> > scan figures are so high.
>
> No patch for this one, just my quick backout of the vmscan.c part
> and your benchmark above explains why I did it.
>

And it's helpful for illustration but a revert is not going to fly.

> > In retrospect, the main difference may be between raw access to the
> > device and when the block device backs a filesystem that is currently
> > mounted. While mounted, the filesystem may pin pages for metadata
> > access which would prevent them ever being migrated and make them
> > unsuitable for use with __GFP_MOVABLE.
>
> The fs will search the bh from the hash and get access to the
> pagecache from there (to avoid the aliasing between blkdev pagecache
> and bh so when you write to /dev/sda on mounted fs it's immediately
> seen by the filesystem and the writes from pagecache won't get
> discared when the fs flushes its dirty buffers). The pin you mention
> should be in the bh, like the superblocks.
>
> But funny thing grow_dev_page already sets __GFP_MOVABLE. That's
> pretty weird and it's probably source of a few not movable pages in
> the movable block. But then many bh are movable... most of them are,
> it's just the superblock that isn't.
>
> But considering grow_dev_page sets __GFP_MOVABLE, any worry about pins
> from the fs on the block_dev.c pagecache shouldn't be a concern...
>

Except in quantity. We can cope with some pollution of MIGRATE_MOVABLE
but if it gets excessive, it will cause a lot of trouble. Superblock
bh's may not be movable but there are not many of them and they are
long lived.

> > In Rik's case, this might be less of an issue. It's using the
> > def_blk_aops as the address space operations that does not have a
> > ->migratepage handler so minimally if they are dirty they have to be
> > written out meaning it will behave more like vfat than ext4 in terms
> > of compaction performance. I did not spot anything that would actually
> > pin the buffers and prevent them from being released though but would
> > prefer it was double checked.
>
> Double checking is good idea.
>
> The reason Rik's case can be improved greatly by the __GFP_MOVABLE in
> block_dev.c, is that when there's very little movable memory, it's
> less likely we compact stuff. All pagecache (actually looks like
> buffercache in free but it's pagecache) goes into the non movable
> pageblocks, so they get polluted by mixture of slab and stuff. So the
> vast majority of memory becomes not compactable. Compaction works
> better if all ram is movable of course. With that missing
> __GFP_MOVABLE it meant async migration would fail always and invoke
> reclaim way more than it would have done it on a real fs. And that
> explains his swap storms I think. I havn't reproduced it but that's my
> theory at least, and it should be tested.
>

Testing to confirm will be necessary as well as checking if the pages
are really movable and what happens if the block device is backed by
a mounted filesystem.

> > If you're going to test a series, can you look at the V4 I posted that
> > attempts to reconcile our two series?
>
> I had to look into other things sorry. Hope to get around testing this
> too. I actually played on some other change in the compaction scan
> which is more a test and not related to the above but I've nothing
> conclusive yet.
>

Understood.

> Also what we have in 3.0/3.1 works _very_ well, there is no urgency to
> make too many changes which weren't benchmarked/tested too much yet
> and that will improve latency but reduce the allocation rate. If the
> allocations are very long lived, giving a better chance to allocate
> those pages is good even if there's write I/O. vfat + slow storage is
> a bit of a pathological case it doesn't actually apply to all
> hardware/fs combinations,

True, but it's an interesting case at the same time. I reckon if we
can make it work well, it'll improve a number of other stall-related
issues at the same time.

> __GFP_MOVABLE missing block_dev also was not
> so common and it most certainly contributed to a reclaim more
> aggressive than it would have happened with that fix. I think you can
> push things one at time without urgency here, and I'd prefer maybe if
> block_dev patch is applied and the other reversed in vmscan.c or
> improved to start limiting only if we're above 8*high or some
> percentage check to allow a little more reclaim than rc2 allows

The limiting is my current preferred option - at least until it is
confirmed that it really is ok to mark block_dev pages movable and that
Rik is ok with the revert.

> (i.e. no reclaim at all which likely results in a failure in hugepage
> allocation). Not unlimited as 3.1 is ok with me but if kswapd can free
> a percentage I don't see why reclaim can't (consdiering more free
> pages in movable pageblocks are needed to succeed compaction). The
> ideal is to improve the compaction rate and at the same time reduce
> reclaim aggressiveness. Let's start with the parts that are more
> obviously right fixes and that don't risk regressions, we don't want
> compaction regressions :).
>

I don't think there are any "obviously right fixes" right now until the
block_dev patch is proven to be ok and that reverting does not regress
Rik's workload. Going to take time.

--
Mel Gorman
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/