Re: [PATCH] x86 rwsem optimization extreme

From: Zachary Amsden
Date: Wed Feb 17 2010 - 23:26:31 EST



On 02/17/2010 05:53 PM, Linus Torvalds wrote:
- but adc _throughput_ is also typically much higher, which indicates
that even if you do flag renaming, the 'adc' quite likely only
schedules in a single ALU unit.

For example, on a Pentium, adc/sbb can only go in the U pipe, and I think
the same is true of 'stc'. Now, nobody likely cares about Pentiums any
more, but the point is, 'adc' does often have constraints that a regular
'add' does not, and there's an example of a 'stc+adc' pair would at the
very least have to be scheduled with an instruction in between.
No doubt. I doubt it much matters in this context, but either way I
think the patch is probably a bad idea... much for the same as my incl
hack was - since the code isn't actually inline, saving a handful bytes
is not the right tradeoff.

-hpa


Incidentally, the cost of putting all the rwsem code inline, using the straightforward approach, for git-tip, using defconfig on x86_64 is 3565 bytes / 20971778 bytes total, or 0.0168%, using gcc 4.4.3.

That's small enough to actually consider it.

Even smaller if you leave trylock as a function... actually no, that didn't work, size increased. I'm guessing many call sites also end up calling the explicit form as a fallback.

If you inline only read_lock functions and write release, nope, that didn't work either.

If you inline only read_lock functions, that still isn't it. Many other permutations are possible, but I've wasted enough time.

Although, with a more clever inline implementation, if some of the constraints to %rdx go away...

Zach
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/