Nick,
On Mon, Mar 07, 2005 at 04:34:18PM +1100, Nick Piggin wrote:
Active balancing should only kick in after the prescribed number
of rebalancing failures - can_migrate_task will see this, and
will allow the balancing to take place.
We are resetting the nr_balance_failed to cache_nice_tries after kicking active balancing. But can_migrate_task will succeed only if
nr_balance_failed > cache_nice_tries.
That said, we currently aren't doing _really_ well for SMT on
some workloads, however with this patch we are heading in the
right direction I think.
Lets take an example of three packages with two logical threads each. Assume P0 is loaded with two processes(one in each logical thread), P1 contains only one process and P2 is idle.
In this example, active balance will be kicked on one of the threads(assume
thread 0) in P0, which then should find an idle package and move it to one of the idle threads in P2.
With your current patch, idle package check in active_load_balance has disappeared, and we may endup moving the process from thread 0 to thread 1 in P0. I can't really make logic out of the active_load_balance code after your patch 10/13
I have been mainly looking at tuning CMP Opterons recently (they
are closer to a "traditional" SMP+NUMA than SMT, when it comes
to the scheduler's point of view). However, in earlier revisions
of the patch I had been looking at SMT performance and was able
to get it much closer to perfect:
I am reasonably sure that the removal of cpu_and_siblings_are_idle check
from active_load_balance will cause HT performance regressions.
I was working on a 4 socket x440 with HT. The problem area is
usually when the load is lower than the number of logical CPUs.
So on tbench, we do say 450MB/s with 4 or more threads without
HT, and 550MB/s with 8 or more threads with HT, however we only
do 300MB/s with 4 threads.
Are you saying 2.6.11 has this problem?
Those aren't the exact numbers, but that's basically what they
look like. Now I was able to bring the 4 thread + HT case much
closer to the 4 thread - HT numbers, but with earlier patchsets.
When I get a chance I will do more tests on the HT system, but
the x440 is infuriating for fine tuning performance, because it
is a NUMA system, but it doesn't tell the kernel about it, so
it will randomly schedule things on "far away" CPUs, and results
vary.
Why don't you use any other simple HT+SMP system?
I will also do some performance analysis with your other patches
on some of the systems that I have access to.