On Wed, Feb 17, 2010 at 10:55:08AM +0100, Christian Ehrhardt wrote:Christian Ehrhardt wrote:Mel Gorman wrote:THe reason here is probably the the fact that in the bad case a lot of processes are waiting on congestion_wait and are therefore not runnnable and that way not scheduled via cond_resched.On Mon, Feb 15, 2010 at 04:46:53PM +0100, Christian Ehrhardt wrote:[...][...]The differences in asm are pretty much the same, as before rmqueue_bulk was already inlined the actually intended change to its parameters was negligible.
I wondered if it would be important if that is a constant value (-1) or if the reason was caused by that shift. So I tried:
23 @@ -965,7 +965,7 @@
24 set_page_private(page, migratetype);
25 list = &page->lru;
26 }
27 - __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
28 + __mod_zone_page_state(zone, NR_FREE_PAGES, -i);
29 spin_unlock(&zone->lock);
30 return i;
31 }
It "fixes" it only by not calling direct reclaim when it should :(yeah as we both realized -1 was not right so it was more a crazy workaround :-)
Anyway after that being a dead end again I dug even deeper into the details of direct_reclaim - I think we can agree that out of the counters we already know the race between try_to_free making progress and get_page not getting a page causing the congestion_wait is source of the issue.
So what I tried to bring some more light into all that was extending my perf counters to track a few more details in direct_reclaim.
Two patches are attached and apply after the other three already available in that thread.
The intention is
a) to track the time
a1) spent in try_to_free_pages
a2) consumed after try_to_free_pages until get_page_from_freelist
a3) spent in get_pages_from_freelist
b1) after seeing that order!=0 -> drain_all_pages I wondered if that might differ even all calls look like they have zero
b2) tracking the average amount of pages freed by try_to_free_pages for fast path and slow path (progres&!page)
Naming convention (otherwise it might get confusing here)
Good case - the scenario e.g. with e084b and 5f8dcc21 reverted resulting in high throughput and a low ratio of direct_reclaim running into progress&!page
Bad case - the scenario e.g. on a clean 2.6.32
Fast path - direct reclaim calls that did not run into progress&!page
Slow path - direct reclaim calls that ran into progress&!page ending up in a long congestion_wait and therefore called "slow" path
Mini summary of what we had before in huge tables:
fast path slow path
GOOD CASE ~98% ~1-3%
BAD CASE ~70% ~30%
-> leading to throughput impact of e.g. 600 mb/s with 16 iozone threads (worse with even more threads)
Out of the numbers I got the following things might help us to create a new approach to a solution.
The timings showed that that the so called slow case is actually much faster passing though direct_reclaim in bad case.
GOOD CASE duration
a1 Fast-avg-duration_pre_ttf_2_post_ttf 164099
a2 Fast-avg-duration_post_ttf_2_pre_get_page 459
a3 Fast-avg-duration_pre_get_page_2_post_get_page 346
a1 Slow-avg-duration_pre_ttf_2_post_ttf 127621
a2 Slow-avg-duration_post_ttf_2_pre_get_page 1957
a3 Slow-avg-duration_pre_get_page_2_post_get_page 256
BAD CASE duration deviation to good case in %
a1 Fast-avg-duration_pre_ttf_2_post_ttf 122921 -25.09%
a2 Fast-avg-duration_post_ttf_2_pre_get_page 521 13.53%
a3 Fast-avg-duration_pre_get_page_2_post_get_page 244 -29.55%
a1 Slow-avg-duration_pre_ttf_2_post_ttf 109740 -14.01%
a2 Slow-avg-duration_post_ttf_2_pre_get_page 250 -87.18%
a3 Slow-avg-duration_pre_get_page_2_post_get_page 117 -54.16%
That means that in the bad case the execution is much faster. Especially in the case that eventually runs into the slow path try_to_free is 14% faster, more important the time between try_to_free and get_pages is 87%! faster => less than a fifth and finally get_page is 54% faster, but that is probably just failing in an almost empty list which is fast^^.
As I checked order which always was zero the time is not spent in drain_all_pages and the only other thing left might be cond_resched ?!
Everything else are a few assignments so it can't be much else.
But why would e.g. not running into schedule in cond_resched cause get_pages to not find anything - I don't know and I would expect it should be the other way around - the faster you get from free to get the more pages should be left.
I'll test this theory today or tomorrow with cond_resched in direct_reclaim commented out and expect almost no difference.
I thought the progress try_to_free_pages is doing might be important as well so I took numbers for that too.5f8dcc21 introduced per migrate type pcp lists, is it possible that we run in a scenario where try_to_free frees a lot of pages via, but of the wrong migrate type?
From those I see that the good case as well as the bad case has an average of 62 pages freed in fast path.
But in slow path they start to differ - while the good case that is running only seldom in that path anyway frees an average of 1.46 pages (that might be the thing causing it not getting a page eventually) in the bad case it makes a progress of avg 37 pages even in slow path.
PAGES-FREED fast path slow path
GOOD CASE ~62 ~1.46
BAD CASE ~62 ~37
It's possible but the window is small. When a threshold is reached on the
PCP lists, they get drained to the buddy lists and later picked up again
by __rmqueue_fallback(). I had considered the possibility of pages of the
wrong type being on the PCP lists which led to the patch "page-allocator:
Fallback to other per-cpu lists when the target list is empty and memory is
low" but you reported it made no difference even when fallback was allowed
with high watermarks.
[...]