Re: Performance regression in scsi sequential throughput (iozone)due to "e084b - page-allocator: preserve PFN ordering when __GFP_COLD isset"

From: Christian Ehrhardt
Date: Thu Jan 14 2010 - 07:30:36 EST


Mel Gorman wrote:
On Fri, Dec 18, 2009 at 02:38:15PM +0100, Christian Ehrhardt wrote:
Christian Ehrhardt wrote:
[...]
You mention that the "GOOD" kernel is one commit before e084b but can
you test with just that patch reverted please? As the patch is only
about list manipulation, I'm failing to see how it can affect
congestion_wait().

Unfortunately I don't understand memory management enough to find the relation between e084b actually changing the order and position of freed pages in the LRU list to __alloc_pages_direct_reclaim not getting pages as effectively as before.

Off-hand, neither can I. I regret that I'm going off-line for the
Christmas very soon and I'll be in the middle of nowhere with no
internet connection or test machines in the interim. It's going to be
the new year before I get to look at this properly.

Even if this patch is not the real problem, your instrumentation patches
will help pin down the real problem. Thanks and sorry I'll be delayed in
getting back to you properly.

Back to business :-)
First - yes it is reproducible in 2.6.32 final and fixable by unapplying e084b.
But I don't think un-applying it is really a fix as no one really understands how e084b cause this regression (it might just work around whatever goes on in the background and someone still hits the hidden issue).

Lets better look what we know summarized:
- the patch e084a causes this regression
- it causes it only in very rare if not theoretically high memory constraints
- time is lost between read system call and the block device layer enter
- via debugging and discussion we found that time lost is spent in congestion_wait

I did not yet check if it is possible, but we might have a spot tho might fix the issue.
Congestion_wait states it would "Waits for up to @timeout jiffies for a backing_dev (any backing_dev) to exit write congestion. If no backing_devs are congested then just wait for the next write to be completed.".
Now in some case (and also my test) there is absolutely no write to do or in flight. So maybe it would be possible to either detect this before calling congestion_wait from page_alloc.c or let congestion_wait return an error code or something similar.

I mean as far as I can see the kernel currently does a loop like that (pseudo code and simplified):
1. get a page <- fail
2. try_to_free page <- returns >=1 pages freed
3. get_page_from_freelist <- fails
4. should_alloc_retry <- true
5. congestion_wait <- FUTILE - just loosing time, nothing will be freed by writes
6. goto 1
-> So maybe we should detect the futility of that congestion_wait call and not do it at all repeating instantly, go to oom, .., whatever (thats out for discussion).

If we can't fix it that way I would propose we keep the code as is, hoping that it is a theoretical case that never hits a customer system. But in that case we should definitely add a userspace readable counter to congestion_wait similar to my debug patches. This would allow everyone that ever assumes an issue might be related to this to verify it by checking the congestion_wait counter.
This should not be too hard in a technical way, but also not performance critical as congestion_wait is going to wait anyway. On the other hand everyone hates introducing new kernel interfaces that need to be kept compatible until nirvana - especially for bug tracking its not the best idea :-) So it should be our very last resort.

Comments/Opinions to those ideas ?

--

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, Open Virtualization

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/