Re: Performance regression in scsi sequential throughput (iozone)due to "e084b - page-allocator: preserve PFN ordering when__GFP_COLD is set"

From: Mel Gorman
Date: Fri Feb 19 2010 - 10:19:54 EST

Next message: Joerg Roedel: "[PATCH 09/11] KVM: SVM: Make lazy FPU switching work with nested svm"
Previous message: Mauro Carvalho Chehab: "Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependenton PCI"
In reply to: Christian Ehrhardt: "Re: Performance regression in scsi sequential throughput (iozone)due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD isset""
Next in thread: Christian Ehrhardt: "Re: Performance regression in scsi sequential throughput (iozone)due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD isset""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

On Fri, Feb 19, 2010 at 12:19:27PM +0100, Christian Ehrhardt wrote:
> >>>>
> >>>> PAGES-FREED fast path slow path
> >>>> GOOD CASE ~62 ~1.46
> >>>> BAD CASE ~62 ~37
> >>> 5f8dcc21 introduced per migrate type pcp lists, is it possible that
> >>> we run in a scenario where try_to_free frees a lot of pages via, but
> >>> of the wrong migrate type?
> >>
> >> It's possible but the window is small. When a threshold is reached on the
> >> PCP lists, they get drained to the buddy lists and later picked up again
> >> by __rmqueue_fallback(). I had considered the possibility of pages of the
> >> wrong type being on the PCP lists which led to the patch "page-allocator:
> >> Fallback to other per-cpu lists when the target list is empty and
> >> memory is
> >> low" but you reported it made no difference even when fallback was
> >> allowed
> >> with high watermarks.
> >> [...]
> >
> > Today I created rather huge debug logs - I'll spare everyone with too
> > much detail.
> > Eventually it comes down to some kind of cat /proc/zoneinfo like output
> > extended to list all things per migrate type too.
> >
> > From that I still think there should be plenty of pages to get the
> > allocation through, as it was suggested by the high amount of pages
> > freed by try_to_free.
> > >
> [...]
>
> > More about that tomorrow,
>
> Well tomorrow is now, and I think I got some important new insights.
>
> As mentioned before I realized that a second call still fails most of the
> time (>99%). Therefore I added a "debugme" parameter to get_page_from_freelist
> and buffered_rmqueue to see where the allocations exactly fails (patch
> attached).
>
> Now with debugme active in a second call after it had progress&&!page in direct_reclaim I saw the following repeating pattern in most of the cases:
> get_page_from_freelist - zone loop - current zone 'DMA'
> get_page_from_freelist - watermark check due to !(alloc_flags & ALLOC_NO_WATERMARKS)
> get_page_from_freelist - goto zone_full due to zone_reclaim_mode==0
> get_page_from_freelist - return page '(null)'
>
> Ok - now we at least exactly know why it gets no page.
> Remember there are plenty of pages like it was in my zoneinfo like report in the last mail.
> I didn't expect that, but actually watermarks are what stops the allocations here.

That is somewhat expected. We also don't want to go underneath them
beause that can lead to system deadlock.

> The zone_watermark_ok check reports that there are not enough pages for the current watermark and
> finally it ends with zone_reclaim_mode which is always zero on s390 as we are not CONFIG_NUMA.
>
> Ok remember my scenario - several parallel iozone sequential read processes
> - theres not much allocation going on except for the page cache read ahead
> related to that read workload.
> The allocations for page cache seem to have no special watermarks selected
> via their GFP flags and therefore get stalled by congestion_wait - which in
> consequence of no available writes in flight consumes its full timeout.
>

Which I'd expect to some extent, but it still stuns me that e084b makes
a difference to any of this. The one-list-per-migratetype patch would
make some sense except restoring something similar to the old behaviour
didn't help either.

> As I see significant impacts to the iozone throughput and not only
> e.g. bad counters in direct_reclaim the congestion_wait stalling seems to
> be so often/long to stall the application I/O itself.
>
> That means from the time VFS starting a read ahead it seems to stall that
> page allocation long enough that the data is not ready when the application
> tries to read it, while it would be if the allocation would be fast enough.
>
> A simple test for this theory was to allow those failing allocations a
> second chance without watermark restrictions before putting them to sleep
> for such a long time.

>
> Index: linux-2.6-git/mm/page_alloc.c
> ===================================================================
> --- linux-2.6-git.orig/mm/page_alloc.c 2010-02-19 09:53:14.000000000 +0100
> +++ linux-2.6-git/mm/page_alloc.c 2010-02-19 09:56:26.000000000 +0100
> @@ -1954,7 +1954,22 @@
>
> if (should_alloc_retry(gfp_mask, order, pages_reclaimed)) {
> /* Wait for some write requests to complete then retry */
> - congestion_wait(BLK_RW_ASYNC, HZ/50);
> + /*
> + * If it gets here try it without watermarks before going
> + * to sleep.
> + *
> + * This will end up in alloc_high_priority and if that fails
> + * once more direct_reclaim but this time without watermark
> + * checks.
> + *
> + * FIXME: that is just for verification - a real fix needs to
> + * ensure e.g. page cache allocations don't drain all pages
> + * under watermark
> + */
> + if (!(alloc_flags & ALLOC_NO_WATERMARKS))
> + alloc_flags &= ALLOC_NO_WATERMARKS;
> + else
> + congestion_wait(BLK_RW_ASYNC, HZ/50);
> goto rebalance;
> }
>
> This fixes all issues I have, but as stated in the FIXME it is unfortunately
> no fix for the real world.

It's possible to deadlock a system with this patch. It's also not acting
as you intended. I think you meant either |= or = there but anyway.

> Unfortunately even now knowing the place of the issue so well I don't see
> the connection to the commits e084b+5f8dcc21

Still a mystery.

> - I couldn't find something but
> did they change the accounting somewhere or e.g. changed the timing/order
> of watermark updates and allocations?
>

Not that I can think of.

> Eventually it might come down to a discussion of allocation priorities and
> we might even keep them as is and accept this issue - I still would prefer
> a good second chance implementation, other page cache allocation flags or
> something else that explicitly solves this issue.
>

In that line, the patch that replaced congestion_wait() with a waitqueue
makes some sense.

> Mel's patch that replaces congestion_wait with a wait for the zone watermarks
> becoming available again is definitely a step in the right direction and
> should go into upstream and the long term support branches.

I'll need to do a number of tests before I can move that upstream but I
don't think it's a merge candidate. Unfortunately, I'll be offline for a
week starting tomorrow so I won't be able to do the testing.

When I get back, I'll revisit those patches with the view to pushing
them upstream. I hate to treat symptoms here without knowing the
underlying problem but this has been spinning in circles for ages with
little forward progress :(

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Next message: Joerg Roedel: "[PATCH 09/11] KVM: SVM: Make lazy FPU switching work with nested svm"
Previous message: Mauro Carvalho Chehab: "Re: [tip:x86/mce] x86, mce: Make xeon75xx memory driver dependenton PCI"
In reply to: Christian Ehrhardt: "Re: Performance regression in scsi sequential throughput (iozone)due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD isset""
Next in thread: Christian Ehrhardt: "Re: Performance regression in scsi sequential throughput (iozone)due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD isset""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]