Re: Enhancement for PLE handler in KVM

From: Paolo Bonzini
Date: Wed Mar 05 2014 - 09:49:41 EST


Il 05/03/2014 15:17, Li, Bin (Bin) ha scritto:
Hello, Paolo,

We are using a customized embedded SMP OS as guest OS. It is not meaningful to post the guest OS code.
Also, there is no "performance numbers for common workloads" since there is no common workloads to compare with.
In our OS, there is still a big kernel lock to protect the kernel.

Does this means that average spinning time for the spinlock is relatively high compared to Linux or Windows?

- when the in-correct boosting happens, the vCPU in spin lock will run
longer time on the pCPU and causing the lock holder vCPU having less
time to run on pCPU since they are sharing the on same pCPU.

Correct. This is an unfortunate problem in the current implementation of PLE.

Adding hyper call in every kernel enter and kernel exist is
expensive. From the trace log collect from i7 running @ 3.0GHz , the
cost per hyper is <1us.

Right, it is around 1500 cycles and 0.4-0.5 us, i.e. approximately 1 us for enter and exit together.

This is not too bad for a kernel with a big lock, but not acceptable if you do not have it (as is the case for Linux and Windows).

Regarding to the " paravirtual ticketlock ", we did try the same idea in our embedded guest OS.
We got following results:

a) We implemented similar approach like linux "paravirtual
ticketlock". The system clock jitter does get reduced a lot. But, the
system clock jitter is still happening at lower rate. In a few hours
system stress test, we still see the big jitter few times.

Did you find out why? It could happen if the virtual CPU is scheduled out for a relatively long time. A small number of spinning iterations can then account for a relatively large time.

My impression is that you're implementing a paravirtual spinlock, except that you're relying on PLE to decide when to go to sleep. PLE is implemented using the TSC. Can you assume the host TSC is of good quality? If so, perhaps you can try to modify the pv ticketlock algorithm, and use a threshold based on TSC instead of an iteration count?

b) When using "paravirtual ticketlock", the threshold to decide "are
we spinning too much" becomes an important factor need to be tuned to
the final system case by case. What we found from the test is, different
application running in our guest OS would require different threshold
setting.

Did you also find out here why this is the case?

Paolo
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/