Re: [PATCH] x86: Turn off GCC branch probability heuristics

From: Jan Hubicka
Date: Sun Apr 12 2015 - 17:15:24 EST


Hello,
for me as GCC developer, this is definitely an intersting observation. Let
me briefly explain what happen here.

-fguess-branch-probability does a lot more than just BB reordering. The way
GCC works is that it first guesses probability of every branch and then uses
the probabilities to estimate a profile (that is frequency at which every BB
executed). This profile is then used thorough the optimization queue to make
decisions that either bloat the code (such as loop header copying) or that
pessimize one code path and improve other code path (such spill code placement).

This is based on
T Ball, JR Larus: Branch prediction for free
and
Y Wu, JR Larus: Static branch frequency and program profile analysis.
In GCC it was implemented in 2000
http://www.ucw.cz/~hubicka/papers/proj/index.html and the earlier ad-hoc
heuristics (that, for example, placed spill code depending on loop depth of
the basic blocks) was slowly replaced by profile driven ones. (Most nowdays
compilers works this way.) This however also means that the compiler is
quite dependent on the profile.

-fno-guess-branch-probability will leave all frequencies to be 0 and all
edge probabilities 0%. This will make most of these heuristics to shut
itself off or possibly behave funny. Disabling all the heuristics except for
builtin_expect will likely result in quite funny profiles, too, where
frequencies will drop to 50% after every branch.

I am very interested in getting -O2 code size down (GCC 5 has quite few
improvements in this area thus most motivated by C++ code and LTO). We
should try to identify what optimization causes most of bloat and start from
these.

http://www.ucw.cz/~hubicka/papers/amd64/node4.html has 13 years old data one
performance and code size with this flag. Back then branch prediction
improved performance by 4.1% on specint at the expense of 5.66% code size
cost. Very large portion of the bloat was the basic block reordering. I will
try to re-run the benchmarks.

In 2003 the largest portion of growth came from basic block reordering
(-freorder-blocks) which does some tail code duplication that helped the AMD
K7/K8 generation chips it was tested on because of decoder limitations.
Basic block reordering pass was implemented in 2000 based on
http://dl.acm.org/citation.cfm?id=305178 and while it was improved in
meantime (primarily by Google adding special cases and hot/cold
partitioning), it is still the same software trace algorithm. So it may be
time to rewrite it which should not be too hard.

Would it be possible to identify functions that change most and look into these?

Honza


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/