On Wed, Nov 30, 2016 at 12:52:28PM +0100, Nicolai Hähnle wrote:
On 30.11.2016 10:40, Chris Wilson wrote:
On Mon, Nov 28, 2016 at 01:20:01PM +0100, Nicolai Hähnle wrote:
I've included timings taken from a contention-heavy stress test to some of
the patches. The stress test performs actual GPU operations which take a
good chunk of the wall time, but even so, the series still manages to
improve the wall time quite a bit.
In looking at your contention scenarios, what was the average/max list
size? Just wondering if it makes sense to use an rbtree + first_waiter
instead of a sorted list from the start.
I haven't measured this with the new series; previously, while I was
debugging the deadlock on older kernels, I occasionally saw wait
lists of up to ~20 tasks, spit-balling the average over all the
deadlock cases I'd say the average was not more than ~5. The average
_without_ deadlocks should be lower, if anything.
Right, I wasn't expecting the list to be large, certainly no larger than
cores typically. On the borderline of where a more complex tree starts
to pay off.
I saw that your test cases go quite a bit higher, but even the
rather extreme load I was testing with -- which is not quite a load
from an actual application, though it is related to one -- has 40
threads and so a theoretical maximum of 40.
The stress loads were just values plucked out of nowhere to try and have
a reasonable stab at hitting the deadlock. Certainly if we were to wrap
that up in a microbenchmark we would want to have wider coverage (so the
graph against contention is more useful).
Do you have a branch I can pull the patches for (or what did you use as
the base)?