Re: [PATCH 0/4] Really lazy fpu

From: Avi Kivity
Date: Wed Jun 16 2010 - 05:29:17 EST


On 06/16/2010 11:39 AM, Ingo Molnar wrote:
(Cc:-ed various performance/optimization folks)

* Avi Kivity<avi@xxxxxxxxxx> wrote:

On 06/16/2010 10:32 AM, H. Peter Anvin wrote:
On 06/16/2010 12:24 AM, Avi Kivity wrote:
Ingo, Peter, any feedback on this?
Conceptually, this makes sense to me. However, I have a concern what
happens when a task is scheduled on another CPU, while its FPU state is
still in registers in the original CPU. That would seem to require
expensive IPIs to spill the state in order for the rescheduling to
proceed, and this could really damage performance.
Right, this optimization isn't free.

I think the tradeoff is favourable since task migrations are much
less frequent than context switches within the same cpu, can the
scheduler experts comment?
This cannot be stated categorically without precise measurements of
known-good, known-bad, average FPU usage and average CPU usage scenarios. All
these workloads have different characteristics.

I can imagine bad effects across all sorts of workloads: tcpbench, AIM7,
various lmbench components, X benchmarks, tiobench - you name it. Combined
with the fact that most micro-benchmarks wont be using the FPU, while in the
long run most processes will be using the FPU due to SIMM instructions. So
even a positive result might be skewed in practice. Has to be measured
carefully IMO - and i havent seen a _single_ performance measurement in the
submission mail. This is really essential.

I have really no idea what to measure. Which would you most like to see?

So this does not look like a patch-set we could apply without gathering a
_ton_ of hard data about advantages and disadvantages.

I agree (not to mention that I'm not really close to having an applyable patchset).

Note some of the advantages will not be in throughput but in latency (making kernel_fpu_begin() preemptible, and reducing context switch time for event threads).

We can also mitigate some of the IPIs if we know that we're migrating on the
cpu we're migrating from (i.e. we're pushing tasks to another cpu, not
pulling them from their cpu). Is that a common case, and if so, where can I
hook a call to unlazy_fpu() (or its new equivalent)?
When the system goes from idle to less idle then most of the 'fast' migrations
happen on a 'push' model - on a busy CPU we wake up a new task and push it out
to a known-idle CPU. At that point we can indeed unlazy the FPU with probably
little cost.

Can you point me to the code which does this?

But on busy servers where most wakeups are IRQ based the chance of being on
the right CPU is 1/nr_cpus - i.e. decreasing with every new generation of
CPUs.

But don't we usually avoid pulls due to NUMA and cache considerations?

If there's some sucky corner case in theory we could approach it statistically
and measure the ratio of fast vs. slow migration vs. local context switches -
but that looks a bit complex.


I certainly wouldn't want to start with it.

Dunno.

--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/