Make Gcc specify which templates are taking the longest - c++

Gcc can profile itself, and with this you can see which part of the compilation process takes the longest.
A sample output:
Time variable usr sys wall GGC
phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1369 kB ( 0%)
phase parsing : 5.76 ( 72%) 2.38 ( 87%) 9.27 ( 78%) 554966 kB ( 80%)
phase lang. deferred : 0.50 ( 6%) 0.16 ( 6%) 0.67 ( 6%) 62109 kB ( 9%)
phase opt and generate : 1.58 ( 20%) 0.18 ( 7%) 1.78 ( 15%) 66512 kB ( 10%)
phase last asm : 0.14 ( 2%) 0.02 ( 1%) 0.15 ( 1%) 4587 kB ( 1%)
phase finalize : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
|name lookup : 0.90 ( 11%) 0.36 ( 13%) 1.71 ( 14%) 17506 kB ( 3%)
|overload resolution : 0.78 ( 10%) 0.24 ( 9%) 1.17 ( 10%) 68510 kB ( 10%)
garbage collection : 0.58 ( 7%) 0.00 ( 0%) 0.79 ( 7%) 0 kB ( 0%)
dump files : 0.07 ( 1%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
callgraph construction : 0.31 ( 4%) 0.02 ( 1%) 0.29 ( 2%) 26559 kB ( 4%)
callgraph optimization : 0.03 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 10 kB ( 0%)
ipa function summary : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 412 kB ( 0%)
ipa inlining heuristics : 0.01 ( 0%) 0.01 ( 0%) 0.01 ( 0%) 282 kB ( 0%)
ipa pure const : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 26 kB ( 0%)
cfg cleanup : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 1 kB ( 0%)
trivially dead code : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 2 kB ( 0%)
df scan insns : 0.00 ( 0%) 0.01 ( 0%) 0.01 ( 0%) 2 kB ( 0%)
df live regs : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 24 kB ( 0%)
df reg dead/unused notes : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 172 kB ( 0%)
register information : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
alias stmt walking : 0.03 ( 0%) 0.01 ( 0%) 0.00 ( 0%) 241 kB ( 0%)
rebuild jump labels : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
preprocessing : 0.66 ( 8%) 0.61 ( 22%) 1.53 ( 13%) 45104 kB ( 7%)
parser (global) : 0.78 ( 10%) 0.59 ( 22%) 1.47 ( 12%) 107059 kB ( 16%)
parser struct body : 0.77 ( 10%) 0.18 ( 7%) 1.00 ( 8%) 64460 kB ( 9%)
parser enumerator list : 0.05 ( 1%) 0.02 ( 1%) 0.07 ( 1%) 2628 kB ( 0%)
parser function body : 0.20 ( 3%) 0.10 ( 4%) 0.35 ( 3%) 9952 kB ( 1%)
parser inl. func. body : 0.35 ( 4%) 0.19 ( 7%) 0.62 ( 5%) 25224 kB ( 4%)
parser inl. meth. body : 1.20 ( 15%) 0.28 ( 10%) 1.49 ( 13%) 110313 kB ( 16%)
template instantiation : 1.60 ( 20%) 0.48 ( 18%) 2.55 ( 21%) 172942 kB ( 25%)
constant expression evaluation : 0.10 ( 1%) 0.05 ( 2%) 0.08 ( 1%) 1091 kB ( 0%)
early inlining heuristics : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 292 kB ( 0%)
inline parameters : 0.01 ( 0%) 0.01 ( 0%) 0.03 ( 0%) 2592 kB ( 0%)
integration : 0.14 ( 2%) 0.08 ( 3%) 0.11 ( 1%) 8382 kB ( 1%)
tree gimplify : 0.01 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 3581 kB ( 1%)
tree eh : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 373 kB ( 0%)
tree CFG construction : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1628 kB ( 0%)
tree CFG cleanup : 0.01 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 10 kB ( 0%)
tree SSA other : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 183 kB ( 0%)
tree SSA incremental : 0.00 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 100 kB ( 0%)
tree operand scan : 0.07 ( 1%) 0.00 ( 0%) 0.07 ( 1%) 2924 kB ( 0%)
tree CCP : 0.01 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 118 kB ( 0%)
tree FRE : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 363 kB ( 0%)
tree forward propagate : 0.01 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 112 kB ( 0%)
tree aggressive DCE : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 152 kB ( 0%)
tree DSE : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 28 kB ( 0%)
PHI merge : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 4 kB ( 0%)
dominance computation : 0.00 ( 0%) 0.01 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
expand vars : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 153 kB ( 0%)
expand : 0.03 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 3103 kB ( 0%)
varconst : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 6 kB ( 0%)
forward prop : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 176 kB ( 0%)
CSE : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 15 kB ( 0%)
dead store elim1 : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 117 kB ( 0%)
dead store elim2 : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 149 kB ( 0%)
loop init : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 647 kB ( 0%)
branch prediction : 0.01 ( 0%) 0.02 ( 1%) 0.02 ( 0%) 229 kB ( 0%)
combiner : 0.03 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 96 kB ( 0%)
integrated RA : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 1862 kB ( 0%)
LRA non-specific : 0.01 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 62 kB ( 0%)
LRA create live ranges : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 10 kB ( 0%)
reload CSE regs : 0.03 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 101 kB ( 0%)
thread pro- & epilogue : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 131 kB ( 0%)
hard reg cprop : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 8 kB ( 0%)
machine dep reorg : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 7 kB ( 0%)
reg stack : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
final : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 1906 kB ( 0%)
symout : 0.44 ( 6%) 0.07 ( 3%) 0.49 ( 4%) 87737 kB ( 13%)
variable tracking : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 901 kB ( 0%)
var-tracking dataflow : 0.08 ( 1%) 0.00 ( 0%) 0.08 ( 1%) 34 kB ( 0%)
var-tracking emit : 0.03 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 1180 kB ( 0%)
initialize rtl : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 12 kB ( 0%)
rest of compilation : 0.05 ( 1%) 0.00 ( 0%) 0.01 ( 0%) 289 kB ( 0%)
remove unused locals : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 1 kB ( 0%)
address taken : 0.00 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
TOTAL : 7.98 2.74 11.90 689554 kB
I would like to know other things however. For example which files are taking the longest to compile? Which functions. And in particular, it seems my compilation bottleneck is template instantiation. I would like to know which templates exactly are taking the longest.
I tried looking this up but all I find is documentation on how to generate the above table.
The table is generated by adding -ftime-report to the g++ flags.

Related

How to understand the output of the -ftime-report flag of gcc?

I profiled the compilation of my code with g++ -ftime-report to try to find a way to speed it up.
Here is the output :
Time variable usr sys wall GGC
phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 1353 kB ( 0%)
phase parsing : 2.06 ( 5%) 1.13 ( 50%) 3.30 ( 8%) 565836 kB ( 30%)
phase lang. deferred : 0.30 ( 1%) 0.06 ( 3%) 0.36 ( 1%) 65727 kB ( 4%)
phase opt and generate : 37.96 ( 94%) 1.07 ( 47%) 39.03 ( 91%) 1224911 kB ( 66%)
|name lookup : 0.23 ( 1%) 0.06 ( 3%) 0.34 ( 1%) 18602 kB ( 1%)
|overload resolution : 0.36 ( 1%) 0.10 ( 4%) 0.41 ( 1%) 83103 kB ( 4%)
garbage collection : 0.42 ( 1%) 0.00 ( 0%) 0.43 ( 1%) 0 kB ( 0%)
dump files : 0.02 ( 0%) 0.01 ( 0%) 0.06 ( 0%) 0 kB ( 0%)
callgraph construction : 0.18 ( 0%) 0.01 ( 0%) 0.17 ( 0%) 12930 kB ( 1%)
callgraph optimization : 0.10 ( 0%) 0.01 ( 0%) 0.05 ( 0%) 371 kB ( 0%)
ipa function summary : 0.07 ( 0%) 0.00 ( 0%) 0.08 ( 0%) 1110 kB ( 0%)
ipa dead code removal : 0.03 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
ipa devirtualization : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 134 kB ( 0%)
ipa cp : 0.10 ( 0%) 0.01 ( 0%) 0.12 ( 0%) 8595 kB ( 0%)
ipa inlining heuristics : 3.18 ( 8%) 0.00 ( 0%) 3.20 ( 7%) 19108 kB ( 1%)
ipa function splitting : 0.21 ( 1%) 0.00 ( 0%) 0.19 ( 0%) 286 kB ( 0%)
ipa reference : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
ipa pure const : 0.05 ( 0%) 0.01 ( 0%) 0.06 ( 0%) 17 kB ( 0%)
ipa icf : 0.12 ( 0%) 0.00 ( 0%) 0.12 ( 0%) 1 kB ( 0%)
ipa SRA : 0.33 ( 1%) 0.03 ( 1%) 0.27 ( 1%) 23892 kB ( 1%)
ipa free inline summary : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
cfg construction : 0.04 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 2185 kB ( 0%)
cfg cleanup : 0.63 ( 2%) 0.02 ( 1%) 0.65 ( 2%) 4734 kB ( 0%)
trivially dead code : 0.16 ( 0%) 0.00 ( 0%) 0.09 ( 0%) 0 kB ( 0%)
df scan insns : 0.24 ( 1%) 0.01 ( 0%) 0.22 ( 1%) 8 kB ( 0%)
df multiple defs : 0.21 ( 1%) 0.01 ( 0%) 0.23 ( 1%) 0 kB ( 0%)
df reaching defs : 0.75 ( 2%) 0.00 ( 0%) 0.74 ( 2%) 0 kB ( 0%)
df live regs : 2.00 ( 5%) 0.00 ( 0%) 2.05 ( 5%) 0 kB ( 0%)
df live&initialized regs : 0.69 ( 2%) 0.00 ( 0%) 0.76 ( 2%) 0 kB ( 0%)
df must-initialized regs : 0.61 ( 2%) 0.24 ( 11%) 0.83 ( 2%) 0 kB ( 0%)
df use-def / def-use chains : 0.25 ( 1%) 0.00 ( 0%) 0.26 ( 1%) 0 kB ( 0%)
df reg dead/unused notes : 0.87 ( 2%) 0.00 ( 0%) 0.79 ( 2%) 14516 kB ( 1%)
register information : 0.10 ( 0%) 0.00 ( 0%) 0.15 ( 0%) 0 kB ( 0%)
alias analysis : 0.40 ( 1%) 0.00 ( 0%) 0.34 ( 1%) 28831 kB ( 2%)
alias stmt walking : 0.72 ( 2%) 0.07 ( 3%) 0.64 ( 1%) 5194 kB ( 0%)
register scan : 0.05 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 217 kB ( 0%)
rebuild jump labels : 0.08 ( 0%) 0.00 ( 0%) 0.10 ( 0%) 0 kB ( 0%)
preprocessing : 0.60 ( 1%) 0.59 ( 26%) 1.38 ( 3%) 194467 kB ( 10%)
parser (global) : 0.22 ( 1%) 0.23 ( 10%) 0.38 ( 1%) 102668 kB ( 6%)
parser struct body : 0.27 ( 1%) 0.06 ( 3%) 0.35 ( 1%) 62614 kB ( 3%)
parser function body : 0.35 ( 1%) 0.09 ( 4%) 0.38 ( 1%) 70207 kB ( 4%)
parser inl. func. body : 0.06 ( 0%) 0.04 ( 2%) 0.07 ( 0%) 7795 kB ( 0%)
parser inl. meth. body : 0.16 ( 0%) 0.04 ( 2%) 0.22 ( 1%) 32985 kB ( 2%)
template instantiation : 0.64 ( 2%) 0.14 ( 6%) 0.78 ( 2%) 160006 kB ( 9%)
constant expression evaluation : 0.01 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 348 kB ( 0%)
early inlining heuristics : 0.12 ( 0%) 0.01 ( 0%) 0.09 ( 0%) 50683 kB ( 3%)
inline parameters : 0.16 ( 0%) 0.00 ( 0%) 0.16 ( 0%) 9128 kB ( 0%)
integration : 1.01 ( 3%) 0.13 ( 6%) 1.20 ( 3%) 272019 kB ( 15%)
tree gimplify : 0.09 ( 0%) 0.02 ( 1%) 0.10 ( 0%) 43912 kB ( 2%)
tree eh : 0.15 ( 0%) 0.00 ( 0%) 0.17 ( 0%) 49453 kB ( 3%)
tree CFG construction : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 24163 kB ( 1%)
tree CFG cleanup : 0.77 ( 2%) 0.02 ( 1%) 0.86 ( 2%) 570 kB ( 0%)
tree tail merge : 0.10 ( 0%) 0.00 ( 0%) 0.09 ( 0%) 1409 kB ( 0%)
tree VRP : 0.68 ( 2%) 0.00 ( 0%) 0.76 ( 2%) 30167 kB ( 2%)
tree Early VRP : 0.08 ( 0%) 0.00 ( 0%) 0.08 ( 0%) 4515 kB ( 0%)
tree copy propagation : 0.19 ( 0%) 0.00 ( 0%) 0.20 ( 0%) 286 kB ( 0%)
tree PTA : 0.65 ( 2%) 0.00 ( 0%) 0.69 ( 2%) 5326 kB ( 0%)
tree PHI insertion : 0.01 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 5166 kB ( 0%)
tree SSA rewrite : 0.29 ( 1%) 0.02 ( 1%) 0.27 ( 1%) 28108 kB ( 2%)
tree SSA other : 0.04 ( 0%) 0.01 ( 0%) 0.05 ( 0%) 357 kB ( 0%)
tree SSA incremental : 0.38 ( 1%) 0.02 ( 1%) 0.39 ( 1%) 13003 kB ( 1%)
tree operand scan : 0.27 ( 1%) 0.05 ( 2%) 0.21 ( 0%) 41554 kB ( 2%)
dominator optimization : 0.62 ( 2%) 0.03 ( 1%) 0.70 ( 2%) 26865 kB ( 1%)
backwards jump threading : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1082 kB ( 0%)
tree SRA : 0.07 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 757 kB ( 0%)
isolate eroneous paths : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
tree CCP : 0.38 ( 1%) 0.01 ( 0%) 0.38 ( 1%) 6460 kB ( 0%)
tree split crit edges : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 6860 kB ( 0%)
tree reassociation : 0.07 ( 0%) 0.00 ( 0%) 0.06 ( 0%) 95 kB ( 0%)
tree PRE : 0.49 ( 1%) 0.07 ( 3%) 0.59 ( 1%) 29233 kB ( 2%)
tree FRE : 0.31 ( 1%) 0.06 ( 3%) 0.37 ( 1%) 5463 kB ( 0%)
tree code sinking : 0.05 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 1998 kB ( 0%)
tree linearize phis : 0.06 ( 0%) 0.00 ( 0%) 0.06 ( 0%) 235 kB ( 0%)
tree backward propagate : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
tree forward propagate : 0.15 ( 0%) 0.00 ( 0%) 0.19 ( 0%) 3598 kB ( 0%)
tree phiprop : 0.00 ( 0%) 0.01 ( 0%) 0.01 ( 0%) 162 kB ( 0%)
tree conservative DCE : 0.17 ( 0%) 0.04 ( 2%) 0.18 ( 0%) 121 kB ( 0%)
tree aggressive DCE : 0.09 ( 0%) 0.00 ( 0%) 0.18 ( 0%) 3761 kB ( 0%)
tree buildin call DCE : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 38 kB ( 0%)
tree DSE : 0.10 ( 0%) 0.01 ( 0%) 0.13 ( 0%) 2485 kB ( 0%)
PHI merge : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 202 kB ( 0%)
complete unrolling : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 242 kB ( 0%)
tree slp vectorization : 0.14 ( 0%) 0.00 ( 0%) 0.16 ( 0%) 40876 kB ( 2%)
tree iv optimization : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 484 kB ( 0%)
tree SSA uncprop : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
tree switch conversion : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
gimple CSE sin/cos : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 3 kB ( 0%)
gimple widening/fma detection : 0.04 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%)
tree strlen optimization : 0.05 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 192 kB ( 0%)
dominance frontiers : 0.06 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%)
dominance computation : 0.70 ( 2%) 0.00 ( 0%) 0.78 ( 2%) 0 kB ( 0%)
control dependences : 0.03 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
out of ssa : 0.09 ( 0%) 0.00 ( 0%) 0.09 ( 0%) 1047 kB ( 0%)
expand vars : 0.61 ( 2%) 0.00 ( 0%) 0.59 ( 1%) 11361 kB ( 1%)
expand : 0.24 ( 1%) 0.02 ( 1%) 0.27 ( 1%) 110705 kB ( 6%)
post expand cleanups : 0.11 ( 0%) 0.00 ( 0%) 0.09 ( 0%) 11138 kB ( 1%)
varconst : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 9 kB ( 0%)
forward prop : 0.28 ( 1%) 0.00 ( 0%) 0.32 ( 1%) 7463 kB ( 0%)
CSE : 0.47 ( 1%) 0.01 ( 0%) 0.50 ( 1%) 6406 kB ( 0%)
dead code elimination : 0.15 ( 0%) 0.00 ( 0%) 0.16 ( 0%) 0 kB ( 0%)
dead store elim1 : 0.25 ( 1%) 0.00 ( 0%) 0.25 ( 1%) 7807 kB ( 0%)
dead store elim2 : 0.15 ( 0%) 0.01 ( 0%) 0.13 ( 0%) 12268 kB ( 1%)
loop init : 0.30 ( 1%) 0.00 ( 0%) 0.32 ( 1%) 3678 kB ( 0%)
loop invariant motion : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1 kB ( 0%)
loop fini : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
CPROP : 0.49 ( 1%) 0.00 ( 0%) 0.49 ( 1%) 12212 kB ( 1%)
PRE : 3.67 ( 9%) 0.04 ( 2%) 3.69 ( 9%) 17514 kB ( 1%)
CSE 2 : 0.28 ( 1%) 0.01 ( 0%) 0.29 ( 1%) 2791 kB ( 0%)
branch prediction : 0.13 ( 0%) 0.00 ( 0%) 0.13 ( 0%) 823 kB ( 0%)
combiner : 0.48 ( 1%) 0.00 ( 0%) 0.51 ( 1%) 19027 kB ( 1%)
if-conversion : 0.08 ( 0%) 0.00 ( 0%) 0.07 ( 0%) 3838 kB ( 0%)
integrated RA : 1.09 ( 3%) 0.00 ( 0%) 1.14 ( 3%) 72103 kB ( 4%)
LRA non-specific : 0.52 ( 1%) 0.01 ( 0%) 0.47 ( 1%) 3373 kB ( 0%)
LRA virtuals elimination : 0.14 ( 0%) 0.00 ( 0%) 0.11 ( 0%) 9546 kB ( 1%)
LRA reload inheritance : 0.04 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 822 kB ( 0%)
LRA create live ranges : 0.30 ( 1%) 0.00 ( 0%) 0.39 ( 1%) 1330 kB ( 0%)
LRA hard reg assignment : 0.01 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
LRA rematerialization : 0.10 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 0 kB ( 0%)
reload CSE regs : 0.49 ( 1%) 0.00 ( 0%) 0.53 ( 1%) 15987 kB ( 1%)
load CSE after reload : 2.73 ( 7%) 0.00 ( 0%) 2.72 ( 6%) 11924 kB ( 1%)
ree : 0.07 ( 0%) 0.00 ( 0%) 0.10 ( 0%) 74 kB ( 0%)
thread pro- & epilogue : 0.18 ( 0%) 0.00 ( 0%) 0.22 ( 1%) 446 kB ( 0%)
if-conversion 2 : 0.08 ( 0%) 0.00 ( 0%) 0.06 ( 0%) 23 kB ( 0%)
split paths : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 16 kB ( 0%)
combine stack adjustments : 0.01 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
peephole 2 : 0.09 ( 0%) 0.00 ( 0%) 0.11 ( 0%) 878 kB ( 0%)
hard reg cprop : 0.19 ( 0%) 0.00 ( 0%) 0.15 ( 0%) 11 kB ( 0%)
scheduling 2 : 1.08 ( 3%) 0.02 ( 1%) 1.13 ( 3%) 8034 kB ( 0%)
machine dep reorg : 0.09 ( 0%) 0.00 ( 0%) 0.09 ( 0%) 2338 kB ( 0%)
reorder blocks : 0.33 ( 1%) 0.00 ( 0%) 0.31 ( 1%) 4597 kB ( 0%)
shorten branches : 0.09 ( 0%) 0.00 ( 0%) 0.10 ( 0%) 0 kB ( 0%)
final : 0.19 ( 0%) 0.00 ( 0%) 0.20 ( 0%) 21653 kB ( 1%)
tree if-combine : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 386 kB ( 0%)
straight-line strength reduction : 0.04 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 26 kB ( 0%)
store merging : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 868 kB ( 0%)
initialize rtl : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 12 kB ( 0%)
address lowering : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
early local passes : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
rest of compilation : 0.59 ( 1%) 0.00 ( 0%) 0.55 ( 1%) 7917 kB ( 0%)
remove unused locals : 0.17 ( 0%) 0.00 ( 0%) 0.16 ( 0%) 416 kB ( 0%)
address taken : 0.11 ( 0%) 0.01 ( 0%) 0.14 ( 0%) 0 kB ( 0%)
rebuild frequencies : 0.05 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 138 kB ( 0%)
repair loop structures : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
TOTAL : 40.32 2.26 42.69 1857845 kB
My problem is that I don't understand a thing about all the terms in this report (ipa cp, tree eh,...). I would like at least to understand what is the phase opt and generate stage because it takes 94% of the compile time so it's definitely what I should tackle.
In gcc documentation, there's almost no information about this command https://gcc.gnu.org/onlinedocs/gcc/Developer-Options.html#Developer-Options
-ftime-report
Makes the compiler print some statistics about the time consumed by each pass when it finishes.
It's a bit surprising to have a description that non exhaustive for a so complicated command.

Compilation time profiling: What is the "phase opt and generate" stage and how can I speed it up (-ftime-report)

I am profiling the compilation time of my code to determine why the compile time is so slow. I am using gcc (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0 and have added the compiler flag -ftime-report.
What I notice is that the compilation units that are slow to compile spend a majority of time on the phase opt and generate stage. What exactly is this stage? How can I reduce the time taken by this phase.
For reference, this is what the output for one of the compilation units looks like.
Time variable usr sys wall GGC
phase setup : 0.00 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 1579 kB ( 0%)
phase parsing : 1.74 ( 20%) 0.71 ( 44%) 2.46 ( 24%) 311927 kB ( 36%)
phase lang. deferred : 1.33 ( 15%) 0.34 ( 21%) 1.67 ( 16%) 259524 kB ( 30%)
phase opt and generate : 5.68 ( 65%) 0.58 ( 36%) 6.26 ( 60%) 301021 kB ( 34%)
phase last asm : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 2 kB ( 0%)
|name lookup : 0.44 ( 5%) 0.12 ( 7%) 0.49 ( 5%) 15499 kB ( 2%)
|overload resolution : 0.76 ( 9%) 0.22 ( 13%) 0.92 ( 9%) 130607 kB ( 15%)
garbage collection : 0.33 ( 4%) 0.01 ( 1%) 0.34 ( 3%) 0 kB ( 0%)
dump files : 0.18 ( 2%) 0.04 ( 2%) 0.10 ( 1%) 0 kB ( 0%)
callgraph construction : 0.12 ( 1%) 0.03 ( 2%) 0.14 ( 1%) 6318 kB ( 1%)
callgraph optimization : 0.16 ( 2%) 0.04 ( 2%) 0.19 ( 2%) 82 kB ( 0%)
ipa function summary : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 2289 kB ( 0%)
ipa dead code removal : 0.01 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
ipa inheritance graph : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 29 kB ( 0%)
ipa virtual call target : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 3 kB ( 0%)
ipa cp : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 1140 kB ( 0%)
ipa inlining heuristics : 0.04 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 2438 kB ( 0%)
ipa function splitting : 0.00 ( 0%) 0.01 ( 1%) 0.01 ( 0%) 451 kB ( 0%)
ipa profile : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
ipa pure const : 0.02 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 40 kB ( 0%)
ipa icf : 0.01 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 4 kB ( 0%)
ipa SRA : 0.10 ( 1%) 0.00 ( 0%) 0.05 ( 0%) 9838 kB ( 1%)
cfg cleanup : 0.08 ( 1%) 0.01 ( 1%) 0.08 ( 1%) 1621 kB ( 0%)
trivially dead code : 0.03 ( 0%) 0.00 ( 0%) 0.06 ( 1%) 0 kB ( 0%)
df scan insns : 0.02 ( 0%) 0.01 ( 1%) 0.05 ( 0%) 18 kB ( 0%)
df multiple defs : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
df reaching defs : 0.06 ( 1%) 0.00 ( 0%) 0.04 ( 0%) 0 kB ( 0%)
df live regs : 0.19 ( 2%) 0.01 ( 1%) 0.25 ( 2%) 0 kB ( 0%)
df live&initialized regs : 0.05 ( 1%) 0.00 ( 0%) 0.06 ( 1%) 0 kB ( 0%)
df use-def / def-use chains : 0.03 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
df reg dead/unused notes : 0.08 ( 1%) 0.00 ( 0%) 0.07 ( 1%) 2152 kB ( 0%)
register information : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
alias analysis : 0.03 ( 0%) 0.00 ( 0%) 0.09 ( 1%) 5413 kB ( 1%)
alias stmt walking : 0.08 ( 1%) 0.00 ( 0%) 0.13 ( 1%) 738 kB ( 0%)
register scan : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 167 kB ( 0%)
rebuild jump labels : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
preprocessing : 0.15 ( 2%) 0.21 ( 13%) 0.39 ( 4%) 11918 kB ( 1%)
parser (global) : 0.29 ( 3%) 0.21 ( 13%) 0.51 ( 5%) 105494 kB ( 12%)
parser struct body : 0.18 ( 2%) 0.04 ( 2%) 0.22 ( 2%) 39504 kB ( 5%)
parser enumerator list : 0.01 ( 0%) 0.01 ( 1%) 0.00 ( 0%) 1305 kB ( 0%)
parser function body : 0.18 ( 2%) 0.04 ( 2%) 0.15 ( 1%) 9096 kB ( 1%)
parser inl. func. body : 0.27 ( 3%) 0.02 ( 1%) 0.39 ( 4%) 33105 kB ( 4%)
parser inl. meth. body : 0.21 ( 2%) 0.06 ( 4%) 0.25 ( 2%) 23541 kB ( 3%)
template instantiation : 1.61 ( 18%) 0.43 ( 26%) 2.05 ( 20%) 346006 kB ( 40%)
constant expression evaluation : 0.05 ( 1%) 0.03 ( 2%) 0.02 ( 0%) 1470 kB ( 0%)
early inlining heuristics : 0.00 ( 0%) 0.01 ( 1%) 0.03 ( 0%) 3751 kB ( 0%)
inline parameters : 0.06 ( 1%) 0.02 ( 1%) 0.05 ( 0%) 12991 kB ( 1%)
integration : 0.12 ( 1%) 0.04 ( 2%) 0.26 ( 3%) 53810 kB ( 6%)
tree gimplify : 0.06 ( 1%) 0.02 ( 1%) 0.11 ( 1%) 20691 kB ( 2%)
tree eh : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 2821 kB ( 0%)
tree CFG construction : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 8987 kB ( 1%)
tree CFG cleanup : 0.11 ( 1%) 0.02 ( 1%) 0.13 ( 1%) 208 kB ( 0%)
tree tail merge : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 880 kB ( 0%)
tree VRP : 0.17 ( 2%) 0.00 ( 0%) 0.18 ( 2%) 7001 kB ( 1%)
tree Early VRP : 0.05 ( 1%) 0.00 ( 0%) 0.05 ( 0%) 7256 kB ( 1%)
tree copy propagation : 0.00 ( 0%) 0.00 ( 0%) 0.05 ( 0%) 104 kB ( 0%)
tree PTA : 0.13 ( 1%) 0.05 ( 3%) 0.25 ( 2%) 1906 kB ( 0%)
tree PHI insertion : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 736 kB ( 0%)
tree SSA rewrite : 0.06 ( 1%) 0.01 ( 1%) 0.04 ( 0%) 6289 kB ( 1%)
tree SSA other : 0.00 ( 0%) 0.02 ( 1%) 0.03 ( 0%) 940 kB ( 0%)
tree SSA incremental : 0.08 ( 1%) 0.00 ( 0%) 0.03 ( 0%) 1717 kB ( 0%)
tree operand scan : 0.08 ( 1%) 0.00 ( 0%) 0.08 ( 1%) 19096 kB ( 2%)
dominator optimization : 0.18 ( 2%) 0.01 ( 1%) 0.15 ( 1%) 5240 kB ( 1%)
backwards jump threading : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 244 kB ( 0%)
tree SRA : 0.03 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1712 kB ( 0%)
tree CCP : 0.10 ( 1%) 0.02 ( 1%) 0.10 ( 1%) 1097 kB ( 0%)
tree reassociation : 0.00 ( 0%) 0.01 ( 1%) 0.00 ( 0%) 50 kB ( 0%)
tree PRE : 0.15 ( 2%) 0.01 ( 1%) 0.18 ( 2%) 4977 kB ( 1%)
tree FRE : 0.13 ( 1%) 0.02 ( 1%) 0.12 ( 1%) 2498 kB ( 0%)
tree linearize phis : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 563 kB ( 0%)
tree forward propagate : 0.09 ( 1%) 0.00 ( 0%) 0.10 ( 1%) 1071 kB ( 0%)
tree phiprop : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 11 kB ( 0%)
tree conservative DCE : 0.04 ( 0%) 0.01 ( 1%) 0.02 ( 0%) 133 kB ( 0%)
tree aggressive DCE : 0.04 ( 0%) 0.01 ( 1%) 0.04 ( 0%) 7238 kB ( 1%)
tree DSE : 0.00 ( 0%) 0.01 ( 1%) 0.03 ( 0%) 254 kB ( 0%)
tree loop invariant motion : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 17 kB ( 0%)
scev constant prop : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 112 kB ( 0%)
tree loop unswitching : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 349 kB ( 0%)
complete unrolling : 0.01 ( 0%) 0.01 ( 1%) 0.03 ( 0%) 1141 kB ( 0%)
tree slp vectorization : 0.01 ( 0%) 0.02 ( 1%) 0.03 ( 0%) 5032 kB ( 1%)
tree iv optimization : 0.02 ( 0%) 0.00 ( 0%) 0.04 ( 0%) 2110 kB ( 0%)
predictive commoning : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 302 kB ( 0%)
gimple CSE reciprocals : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
dominance computation : 0.14 ( 2%) 0.03 ( 2%) 0.16 ( 2%) 0 kB ( 0%)
out of ssa : 0.05 ( 1%) 0.00 ( 0%) 0.01 ( 0%) 55 kB ( 0%)
expand vars : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 1422 kB ( 0%)
expand : 0.03 ( 0%) 0.01 ( 1%) 0.10 ( 1%) 14790 kB ( 2%)
post expand cleanups : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 1273 kB ( 0%)
varconst : 0.00 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 8 kB ( 0%)
jump : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
forward prop : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 1330 kB ( 0%)
CSE : 0.13 ( 1%) 0.00 ( 0%) 0.08 ( 1%) 664 kB ( 0%)
dead code elimination : 0.00 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 0 kB ( 0%)
dead store elim1 : 0.02 ( 0%) 0.00 ( 0%) 0.06 ( 1%) 1230 kB ( 0%)
dead store elim2 : 0.05 ( 1%) 0.00 ( 0%) 0.03 ( 0%) 1584 kB ( 0%)
loop init : 0.11 ( 1%) 0.02 ( 1%) 0.07 ( 1%) 8638 kB ( 1%)
loop versioning : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 40 kB ( 0%)
loop invariant motion : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 8 kB ( 0%)
CPROP : 0.12 ( 1%) 0.00 ( 0%) 0.06 ( 1%) 3321 kB ( 0%)
PRE : 0.08 ( 1%) 0.00 ( 0%) 0.05 ( 0%) 935 kB ( 0%)
CSE 2 : 0.07 ( 1%) 0.00 ( 0%) 0.08 ( 1%) 333 kB ( 0%)
branch prediction : 0.02 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 1178 kB ( 0%)
combiner : 0.21 ( 2%) 0.00 ( 0%) 0.15 ( 1%) 7070 kB ( 1%)
if-conversion : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 464 kB ( 0%)
integrated RA : 0.25 ( 3%) 0.01 ( 1%) 0.30 ( 3%) 20626 kB ( 2%)
LRA non-specific : 0.10 ( 1%) 0.00 ( 0%) 0.09 ( 1%) 1243 kB ( 0%)
LRA virtuals elimination : 0.02 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 834 kB ( 0%)
LRA reload inheritance : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 195 kB ( 0%)
LRA create live ranges : 0.11 ( 1%) 0.01 ( 1%) 0.13 ( 1%) 234 kB ( 0%)
LRA hard reg assignment : 0.02 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
LRA rematerialization : 0.04 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 0 kB ( 0%)
reload : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 0 kB ( 0%)
reload CSE regs : 0.09 ( 1%) 0.00 ( 0%) 0.06 ( 1%) 2212 kB ( 0%)
load CSE after reload : 0.06 ( 1%) 0.00 ( 0%) 0.05 ( 0%) 559 kB ( 0%)
ree : 0.00 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 71 kB ( 0%)
thread pro- & epilogue : 0.03 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 939 kB ( 0%)
peephole 2 : 0.01 ( 0%) 0.00 ( 0%) 0.02 ( 0%) 170 kB ( 0%)
hard reg cprop : 0.00 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 15 kB ( 0%)
scheduling 2 : 0.15 ( 2%) 0.00 ( 0%) 0.16 ( 2%) 894 kB ( 0%)
machine dep reorg : 0.00 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 502 kB ( 0%)
reorder blocks : 0.04 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 1015 kB ( 0%)
shorten branches : 0.02 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 0 kB ( 0%)
final : 0.04 ( 0%) 0.00 ( 0%) 0.03 ( 0%) 3408 kB ( 0%)
straight-line strength reduction : 0.00 ( 0%) 0.00 ( 0%) 0.01 ( 0%) 21 kB ( 0%)
tree loop if-conversion : 0.01 ( 0%) 0.00 ( 0%) 0.00 ( 0%) 203 kB ( 0%)
rest of compilation : 0.10 ( 1%) 0.01 ( 1%) 0.13 ( 1%) 3241 kB ( 0%)
remove unused locals : 0.02 ( 0%) 0.00 ( 0%) 0.08 ( 1%) 3 kB ( 0%)
address taken : 0.04 ( 0%) 0.01 ( 1%) 0.04 ( 0%) 0 kB ( 0%)
TOTAL : 8.75 1.63 10.40 874064 kB
Edit
I had a few people comment asking for the compiler flags, here they are:
-std=c++17 -Wall -Ofast -DNDEBUG -Wno-deprecated-declarations

unexpected minflt(minor page fault)

minor page fault means:need virtual memory map real memory, but I found my test code has some min page fault when access used memory .
my test code :
#include <iostream>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
using namespace std;
int main() {
uint64_t len = 1024ll * 128; //128k
uint32_t times = 1000000000;
char *p = new char[len];
for (uint32_t t = 0; t < times; ++t) {
for (uint64_t i = 0; i < len; ++i) {
*(p + i) = 100;
}
}
delete[] p;
return 0;
}
pidstat:
03:40:05 PM UID PID minflt/s majflt/s VSZ RSS %MEM Command
03:40:06 PM 0 42379 34.00 0.00 12672 1196 0.00 main
03:40:07 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:08 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:09 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:10 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:11 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:12 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:13 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:14 PM 0 42379 34.00 0.00 12672 1196 0.00 main
03:40:15 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:16 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:17 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:18 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:19 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:20 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:21 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:22 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:23 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:24 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:25 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:26 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:27 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:28 PM 0 42379 34.00 0.00 12672 1196 0.00 main
03:40:29 PM 0 42379 0.00 0.00 12672 1196 0.00 main
03:40:30 PM 0 42379 0.00 0.00 12672 1196 0.00 main
minflt/s:34 = 34 * 4K = 136K
my array in test code = 128K
Why dost my test code product minflt when access used memory?

c++: Compilation of huge amount expressions failed with -O2 optimization?

I was using Eigen library to do some matrix calculation. I have to define a large matrix(actually not that large, just 300x300) with each element composed of long complex exponential expressions.
To give an impression of what I mean, I copied a small part of my matrix definition
#include <iostream>
#include <complex>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{
typedef std::complex<double> cd;
MatrixXcd h(300,300);
double kx,ky;
kx=1.;
ky=1.;
h.setZero(300,300);
h(0,0)=cd(6.942755,0.) + 0.043986/exp(cd(0,1)*(0. - 2.0238820899708214*kx - 7.55323078829979*ky)) - 0.010802/exp(cd(0,1)*(0. + 5.529348698328969*kx - 5.529348698328969*ky)) + 0.043986/exp(cd(0,1)*(0. - 7.55323078829979*kx - 2.0238820899708214*ky)) + 0.043986/exp(cd(0,1)*(0. + 7.55323078829979*kx + 2.0238820899708214*ky)) - 0.010802/exp(cd(0,1)*(0. - 5.529348698328969*kx + 5.529348698328969*ky)) + 0.043986/exp(cd(0,1)*(0. + 2.0238820899708214*kx + 7.55323078829979*ky));
h(0,2)=cd(0.,0.) + 0.095916/exp(cd(0,1)*(0. - 7.55323078829979*kx - 2.0238820899708214*ky)) - 0.131689/exp(cd(0,1)*(0. + 7.55323078829979*kx + 2.0238820899708214*ky));
h(0,3)=cd(-0.10825,0.) - 0.011519/exp(cd(0,1)*(0. - 7.55323078829979*kx - 2.0238820899708214*ky));
...
...//6000 more lines omitted here
}
I am using mingw-w64 on windows, the compiler was set up fine. But when I compile the above code with
g++ -O2 code.cpp
the compilation fails with popup dialog!
and if I look at the task manager carefully, the compilation stop at memory usage about 1GB.
However, if I compile the code with -O0 option again, that is to disable all optimization, the compilation succeeded, even though the memory usage reached a peak close to 2GB. So the failure definite not due to memory.
What is more, I can confirm this behavior is nothing to do with Eigen library. Even if I don't use Eigen and replace all assignment to the same variable, like this
#include <iostream>
#include <complex>
int main()
{
typedef std::complex<double> cd;
cd tmp;
double kx,ky;
kx=1.;
ky=1.;
tmp=cd(6.942755,0.) + 0.043986/exp(cd(0,1)*(0. - 2.0238820899708214*kx - 7.55323078829979*ky)) - 0.010802/exp(cd(0,1)*(0. + 5.529348698328969*kx - 5.529348698328969*ky)) + 0.043986/exp(cd(0,1)*(0. - 7.55323078829979*kx - 2.0238820899708214*ky)) + 0.043986/exp(cd(0,1)*(0. + 7.55323078829979*kx + 2.0238820899708214*ky)) - 0.010802/exp(cd(0,1)*(0. - 5.529348698328969*kx + 5.529348698328969*ky)) + 0.043986/exp(cd(0,1)*(0. + 2.0238820899708214*kx + 7.55323078829979*ky));
tmp=cd(0.,0.) + 0.095916/exp(cd(0,1)*(0. - 7.55323078829979*kx - 2.0238820899708214*ky)) - 0.131689/exp(cd(0,1)*(0. + 7.55323078829979*kx + 2.0238820899708214*ky));
tmp=cd(-0.10825,0.) - 0.011519/exp(cd(0,1)*(0. - 7.55323078829979*kx - 2.0238820899708214*ky));
... //6000 more lines omitted
}
The compilation also fails for -O2 option.
Also, the problem is not limited to mingw compiler. I also tried intel parallel studio icl.exe. The situation is even worse, the compilation takes more than 30 minutes and seems that it will go on and on, and I have no patience to wait it finished, and probably it may also failed at the end.
So my question is what cause the failure of compilation with -O2? How to make -O2 work for my code (which has huge amount of expressions)? And also what surprise me is that though there are many expressions, they are simply composed of elementary exp function, why the compilation takes so much time and memory? Any trick to make the compilation faster?
update
According to Marc Glisse's suggestion, I run the following. -O1 works, but what I want is at least O2, because the code is for scientific computing purpose. Speed is important.
R:\>g++ -O1 -ftime-report eigen.cpp
Execution times (seconds)
phase setup : 0.01 ( 0%) usr 1540 kB ( 0%) ggc
phase parsing : 6.06 ( 5%) usr 412774 kB (25%) ggc
phase lang. deferred : 0.18 ( 0%) usr 6491 kB ( 0%) ggc
phase opt and generate : 122.65 (95%) usr 1203926 kB (74%) ggc
|name lookup : 0.61 ( 0%) usr 39968 kB ( 2%) ggc
|overload resolution : 2.18 ( 2%) usr 151685 kB ( 9%) ggc
garbage collection : 1.48 ( 1%) usr 0 kB ( 0%) ggc
callgraph construction : 0.65 ( 1%) usr 28545 kB ( 2%) ggc
callgraph optimization : 0.41 ( 0%) usr 6 kB ( 0%) ggc
ipa dead code removal : 0.02 ( 0%) usr 0 kB ( 0%) ggc
ipa inlining heuristics : 0.58 ( 0%) usr 6172 kB ( 0%) ggc
ipa reference : 0.02 ( 0%) usr 0 kB ( 0%) ggc
ipa profile : 0.11 ( 0%) usr 0 kB ( 0%) ggc
ipa pure const : 0.20 ( 0%) usr 0 kB ( 0%) ggc
cfg cleanup : 0.04 ( 0%) usr 0 kB ( 0%) ggc
trivially dead code : 0.05 ( 0%) usr 0 kB ( 0%) ggc
df scan insns : 0.09 ( 0%) usr 0 kB ( 0%) ggc
df multiple defs : 0.03 ( 0%) usr 0 kB ( 0%) ggc
df live regs : 0.13 ( 0%) usr 0 kB ( 0%) ggc
df live&initialized regs: 0.04 ( 0%) usr 0 kB ( 0%) ggc
df reg dead/unused notes: 0.17 ( 0%) usr 2440 kB ( 0%) ggc
register information : 0.01 ( 0%) usr 0 kB ( 0%) ggc
alias analysis : 0.05 ( 0%) usr 1546 kB ( 0%) ggc
alias stmt walking : 27.43 (21%) usr 19006 kB ( 1%) ggc
rebuild jump labels : 0.03 ( 0%) usr 0 kB ( 0%) ggc
preprocessing : 0.63 ( 0%) usr 8732 kB ( 1%) ggc
parser (global) : 0.30 ( 0%) usr 80513 kB ( 5%) ggc
parser struct body : 0.36 ( 0%) usr 20184 kB ( 1%) ggc
parser enumerator list : 0.03 ( 0%) usr 1004 kB ( 0%) ggc
parser function body : 3.52 ( 3%) usr 253532 kB (16%) ggc
parser inl. func. body : 0.16 ( 0%) usr 6243 kB ( 0%) ggc
parser inl. meth. body : 0.24 ( 0%) usr 12261 kB ( 1%) ggc
template instantiation : 0.75 ( 1%) usr 36791 kB ( 2%) ggc
early inlining heuristics: 0.74 ( 1%) usr 78738 kB ( 5%) ggc
inline parameters : 0.60 ( 0%) usr 3273 kB ( 0%) ggc
integration : 34.96 (27%) usr 421223 kB (26%) ggc
tree gimplify : 0.93 ( 1%) usr 78917 kB ( 5%) ggc
tree eh : 1.81 ( 1%) usr 147729 kB ( 9%) ggc
tree CFG construction : 0.26 ( 0%) usr 47487 kB ( 3%) ggc
tree CFG cleanup : 0.92 ( 1%) usr 0 kB ( 0%) ggc
tree copy propagation : 0.03 ( 0%) usr 0 kB ( 0%) ggc
tree PTA : 1.80 ( 1%) usr 167 kB ( 0%) ggc
tree PHI insertion : 0.07 ( 0%) usr 519 kB ( 0%) ggc
tree SSA rewrite : 1.63 ( 1%) usr 97983 kB ( 6%) ggc
tree SSA other : 0.13 ( 0%) usr 17 kB ( 0%) ggc
tree SSA incremental : 28.75 (22%) usr 5 kB ( 0%) ggc
tree operand scan : 2.13 ( 2%) usr 65917 kB ( 4%) ggc
dominator optimization : 0.08 ( 0%) usr 2043 kB ( 0%) ggc
tree SRA : 2.65 ( 2%) usr 56210 kB ( 3%) ggc
tree CCP : 2.42 ( 2%) usr 37765 kB ( 2%) ggc
tree split crit edges : 0.11 ( 0%) usr 2953 kB ( 0%) ggc
tree reassociation : 0.04 ( 0%) usr 0 kB ( 0%) ggc
tree FRE : 3.35 ( 3%) usr 35524 kB ( 2%) ggc
tree code sinking : 0.01 ( 0%) usr 0 kB ( 0%) ggc
tree linearize phis : 0.01 ( 0%) usr 6 kB ( 0%) ggc
tree backward propagate : 0.02 ( 0%) usr 0 kB ( 0%) ggc
tree forward propagate : 0.38 ( 0%) usr 8 kB ( 0%) ggc
tree conservative DCE : 0.13 ( 0%) usr 1 kB ( 0%) ggc
tree aggressive DCE : 0.33 ( 0%) usr 2 kB ( 0%) ggc
tree DSE : 0.45 ( 0%) usr 4 kB ( 0%) ggc
tree SSA uncprop : 0.01 ( 0%) usr 0 kB ( 0%) ggc
dominance frontiers : 0.06 ( 0%) usr 0 kB ( 0%) ggc
dominance computation : 0.65 ( 1%) usr 0 kB ( 0%) ggc
out of ssa : 0.09 ( 0%) usr 1 kB ( 0%) ggc
expand vars : 0.02 ( 0%) usr 765 kB ( 0%) ggc
expand : 0.13 ( 0%) usr 13796 kB ( 1%) ggc
post expand cleanups : 0.03 ( 0%) usr 2868 kB ( 0%) ggc
forward prop : 0.08 ( 0%) usr 156 kB ( 0%) ggc
CSE : 0.08 ( 0%) usr 304 kB ( 0%) ggc
dead code elimination : 0.03 ( 0%) usr 0 kB ( 0%) ggc
dead store elim1 : 0.09 ( 0%) usr 763 kB ( 0%) ggc
dead store elim2 : 0.08 ( 0%) usr 613 kB ( 0%) ggc
loop init : 0.15 ( 0%) usr 65 kB ( 0%) ggc
branch prediction : 0.12 ( 0%) usr 19 kB ( 0%) ggc
combiner : 0.10 ( 0%) usr 216 kB ( 0%) ggc
if-conversion : 0.01 ( 0%) usr 0 kB ( 0%) ggc
integrated RA : 0.43 ( 0%) usr 9659 kB ( 1%) ggc
LRA non-specific : 0.26 ( 0%) usr 305 kB ( 0%) ggc
LRA virtuals elimination: 0.03 ( 0%) usr 304 kB ( 0%) ggc
LRA create live ranges : 0.03 ( 0%) usr 152 kB ( 0%) ggc
LRA hard reg assignment : 0.02 ( 0%) usr 0 kB ( 0%) ggc
reload CSE regs : 0.19 ( 0%) usr 916 kB ( 0%) ggc
thread pro- & epilogue : 0.04 ( 0%) usr 14 kB ( 0%) ggc
hard reg cprop : 0.07 ( 0%) usr 0 kB ( 0%) ggc
shorten branches : 0.08 ( 0%) usr 0 kB ( 0%) ggc
final : 0.16 ( 0%) usr 279 kB ( 0%) ggc
initialize rtl : 0.01 ( 0%) usr 12 kB ( 0%) ggc
rest of compilation : 0.31 ( 0%) usr 879 kB ( 0%) ggc
remove unused locals : 2.24 ( 2%) usr 0 kB ( 0%) ggc
address taken : 1.00 ( 1%) usr 37564 kB ( 2%) ggc
rebuild frequencies : 0.02 ( 0%) usr 0 kB ( 0%) ggc
TOTAL : 128.90 1624743 kB
I see some redundancy in expression like the term :
exp(cd(0,1)*(0. - 7.55323078829979*kx - 2.0238820899708214*ky)) seen in h(0,2) and h(0,3).
The -O2 forces the compile to detect and reuse patterns. It seems the complexity is too high with 6k lines of expressions. You could help gcc with tmp variables. This is equivalent of building a dependancy graph and then generate the code.

Code slower despite what gprof says

I've been given a c++ code to optimize, and the first step is to introduce parallelism with OpenMP. I was able to identify several functions that badly needed optimization, so I focused on them.
The problem is that the execution time has been multiplied by about 2, when the profiling files seems to tell me that it should be much faster ..
Here are the gprof profile I get without using OpenMP :
38.07 5.55 5.55 __tcf_0
20.99 8.61 3.06 86196302 0.04 0.04 is_neighbor(int, int, int, int, double)
13.24 10.54 1.93 425940 4.53 4.53 Ellips::data_fiting(double*, int, int, double) const
9.05 11.86 1.32 _fu51___ZSt4cout
5.90 12.72 0.86 5645243 0.15 0.15 Ellips::Ellips()
3.70 13.26 0.54 4013067 0.13 0.13 intersect(Ellips&, Ellips&)
2.40 13.61 0.35 dgemv_
1.44 13.82 0.21 ddot_
1.23 14.00 0.18 141257881 0.00 0.00 Configuration::get_position(int)
1.03 14.15 0.15 __tcf_0
0.82 14.27 0.12 594893 0.20 0.20 Ellips::Ellips(double, double, int, int)
0.41 14.33 0.06 7099 8.45 400.75 Configuration::Configuration(double, double, int, int, int, int, double*, double)
0.34 14.38 0.05 3203279 0.02 0.02 Ellips::operator=(Ellips const&)
0.34 14.43 0.05 ceil
0.21 14.46 0.03 dnrm2_
0.14 14.48 0.02 _fu32___ZSt4cout
0.14 14.50 0.02 dcopy_
0.14 14.52 0.02 dscal_
0.07 14.53 0.01 7775127 0.00 0.00 Configuration::get_Ellips(int)
0.07 14.54 0.01 6239588 0.00 0.00 Ellips::~Ellips()
0.07 14.55 0.01 4349523 0.00 0.00 Configuration::get_data_fit(int)
0.07 14.56 0.01 7097 1.41 1.41 Graph<float, float, float>::maxflow(bool, Block<int>*)
0.07 14.57 0.01 _fu53___ZNSs4_Rep20_S_empty_rep_storageE
0.07 14.58 0.01 floor
0.00 14.58 0.00 432232036 0.00 0.00 Configuration::save_config(std::string)
0.00 14.58 0.00 1180034 0.00 0.00 Ellips::data_fiting(double, double*, double*, double, int, int, double) const
0.00 14.58 0.00 1173980 0.00 0.00 Ellips::get_cx() const
0.00 14.58 0.00 1164513 0.00 0.02 Configuration::add_Ellips(Ellips const&, int, double)
0.00 14.58 0.00 1157360 0.00 0.00 Ellips::get_cy() const
0.00 14.58 0.00 425940 0.00 0.00 shift_cost_exp1(double, double)
0.00 14.58 0.00 23625 0.00 0.00 Graph<float, float, float>::augment(Graph<float, float, float>::arc*)
0.00 14.58 0.00 22504 0.00 0.00 Graph<float, float, float>::process_sink_orphan(Graph<float, float, float>::node*)
0.00 14.58 0.00 21293 0.00 27.35 Configuration::operator=(Configuration const&)
0.00 14.58 0.00 14203 0.00 0.23 Configuration::~Configuration()
0.00 14.58 0.00 14196 0.00 0.00 Configuration::get_nb_Ellipses()
0.00 14.58 0.00 7097 0.00 34.30 Configuration::Configuration(Ellips const&, int, double, int)
0.00 14.58 0.00 7097 0.00 0.00 Graph<float, float, float>::maxflow_init()
0.00 14.58 0.00 7097 0.00 0.00 Graph<float, float, float>::reset()
0.00 14.58 0.00 2406 0.00 0.00 Ellips::get_a() const
0.00 14.58 0.00 2406 0.00 0.00 Ellips::get_b() const
0.00 14.58 0.00 2406 0.00 0.00 Ellips::get_theta() const
0.00 14.58 0.00 1137 0.00 0.00 Graph<float, float, float>::process_source_orphan(Graph<float, float, float>::node*)
0.00 14.58 0.00 7 0.00 38.00 Configuration::Configuration(Configuration const&)
0.00 14.58 0.00 3 0.00 0.32 Configuration::Configuration()
0.00 14.58 0.00 2 0.00 0.00 min_max_val(_IplImage*, double&, double&)
0.00 14.58 0.00 1 0.00 0.00 convert_char_to_double(_IplImage*, double*)
0.00 14.58 0.00 1 0.00 0.00 Graph<float, float, float>::reallocate_nodes(int)
0.00 14.58 0.00 1 0.00 0.00 Graph<float, float, float>::Graph(int, int, void (*)(char*))
And here is the one I get with OpenMP (The code is a recursive algorithm with no real "ending", the two profiles have been obtained after about 7000 iterations of the main loop).
36.57 4.45 4.45 __tcf_0
25.72 7.58 3.13 86434458 0.04 0.04 is_neighbor(int, int, int, int, double)
12.41 9.09 1.51 _fu51___ZSt4cout
7.97 10.06 0.97 5646276 0.17 0.17 Ellips::Ellips()
4.35 10.59 0.53 4020048 0.13 0.13 intersect(Ellips&, Ellips&)
2.47 10.89 0.30 dgemv_
1.73 11.10 0.21 ddot_
1.64 11.30 0.20 141852099 0.00 0.00 Configuration::get_position(int)
1.15 11.44 0.14 7038 19.89 164.95 Configuration::Configuration(double, double, int, int, int, int, double*, double)
1.07 11.57 0.13 589659 0.22 0.22 Ellips::Ellips(double, double, int, int)
0.99 11.69 0.12 __tcf_0
0.90 11.80 0.11 422280 0.26 0.33 Ellips::data_fiting(double*, int, int, double) const
0.74 11.89 0.09 3208793 0.03 0.03 Ellips::operator=(Ellips const&)
0.41 11.94 0.05 ceil
0.25 11.97 0.03 422280 0.07 0.07 shift_cost_exp1(double, double)
0.25 12.00 0.03 GOMP_parallel_end
0.25 12.03 0.03 _fu53___ZNSs4_Rep20_S_empty_rep_storageE
0.16 12.05 0.02 21110 0.95 32.56 Configuration::operator=(Configuration const&)
0.16 12.07 0.02 7036 2.84 2.84 Graph<float, float, float>::maxflow(bool, Block<int>*)
0.16 12.09 0.02 _fu32___ZSt4cout
0.16 12.11 0.02 daxpy_
0.16 12.13 0.02 dnrm2_
0.08 12.14 0.01 1171018 0.01 0.04 Configuration::add_Ellips(Ellips const&, int, double)
0.08 12.15 0.01 GOMP_parallel_start
0.08 12.16 0.01 dcopy_
0.08 12.17 0.01 dgemm_
0.00 12.17 0.00 432088679 0.00 0.00 Configuration::save_config(std::string)
0.00 12.17 0.00 7813683 0.00 0.00 Configuration::get_Ellips(int)
0.00 12.17 0.00 6235383 0.00 0.00 Ellips::~Ellips()
0.00 12.17 0.00 4360587 0.00 0.00 Configuration::get_data_fit(int)
0.00 12.17 0.00 1187310 0.00 0.00 Ellips::data_fiting(double, double*, double*, double, int, int, double) const
0.00 12.17 0.00 1163572 0.00 0.00 Ellips::get_cx() const
0.00 12.17 0.00 1147536 0.00 0.00 Ellips::get_cy() const
0.00 12.17 0.00 35748 0.00 0.00 Graph<float, float, float>::augment(Graph<float, float, float>::arc*)
0.00 12.17 0.00 33436 0.00 0.00 Graph<float, float, float>::process_sink_orphan(Graph<float, float, float>::node*)
0.00 12.17 0.00 14081 0.00 0.00 Configuration::~Configuration()
0.00 12.17 0.00 14074 0.00 0.00 Configuration::get_nb_Ellipses()
0.00 12.17 0.00 7036 0.00 39.10 Configuration::Configuration(Ellips const&, int, double, int)
0.00 12.17 0.00 7036 0.00 0.00 Graph<float, float, float>::maxflow_init()
0.00 12.17 0.00 7036 0.00 0.00 Graph<float, float, float>::reset()
0.00 12.17 0.00 2424 0.00 0.00 Ellips::get_a() const
0.00 12.17 0.00 2424 0.00 0.00 Ellips::get_b() const
0.00 12.17 0.00 2424 0.00 0.00 Ellips::get_theta() const
0.00 12.17 0.00 2355 0.00 0.00 Graph<float, float, float>::process_source_orphan(Graph<float, float, float>::node*)
0.00 12.17 0.00 7 0.00 44.91 Configuration::Configuration(Configuration const&)
0.00 12.17 0.00 3 0.00 0.37 Configuration::Configuration()
0.00 12.17 0.00 2 0.00 0.00 min_max_val(_IplImage*, double&, double&)
0.00 12.17 0.00 1 0.00 0.00 convert_char_to_double(_IplImage*, double*)
0.00 12.17 0.00 1 0.00 0.00 Graph<float, float, float>::reallocate_nodes(int)
0.00 12.17 0.00 1 0.00 0.00 Graph<float, float, float>::Graph(int, int, void (*)(char*))
Is there a problem with how I'm using the profiler ? Or does this come from the code itself ? It takes about 12 seconds to complete 1000 iterations with OpenMP, whereas it takes about 31 seconds with OpenMP (using omp_get_wtime() and not clock())