I use the Spec 2006 benchmarks to measure the performance of a few private passes and the generated code. So far, all my work was based on LLVM 3.1.
Last week I wanted to move forward with LLVM, and I rebased all of my passes to LLVM 3.4. This required only adjusting a few #include lines. However, after that change I measured a huge performance regression. Next I disabled all of my custom passes, but still performance was down. Here are the numbers of a few Spec benchmarks compiled with LLVM 3.1 and 3.4
LLVM 3.1 LLVM 3.4
-------- --------
401.bzip2 7.50 17.5
429.mcf 4.72 8.10
456.hmmer 4.18 10.1
458.sjeng 4.85 8.86
433.milc 13.4 26.0
470.lbm 13.4 12.6
I also noticed that the number of functions called by e.g. 456.hmmer went from 6 to 45009; similar numbers show for most other benchmarks. Inlining failed?
I compile the Spec benchmarks with
CC=clang -g -std=c89 -D_GNU_SOURCE -c -emit-llvm
and then call opt with
opt -simplifycfg -mem2reg -break-constgeps <my-passes> benchmark.bc -o benchmark.schnufte.bc
Once opt is done, the bitcode is lowered with llc and linked into an executable.
The performance regression is very noticable, and I doubt that 3.4 would have been released with a regression as obvious as that. So my question is: did I miss something? Are optimizations now invoked more explicitly? I added -std-compile-opts to my opt call after my passes, but that didn't help at all.
EDIT As suggested, I posed the question to the LLVMdev mailing list, but the thread receives little attention.
Related
Problem
llc is giving me the following error:
LLVM ERROR: unsupported relocation on symbol
Detailed compilation flow
I am implementing an LLVM frontend for a middle-level IR (MIR) of a compiler, and after I convert various methods to many bitcode files, I link them (llvm-link), optimize them (opt), convert them to machine code (llc), make them a shared library (clang for it's linker wrapper), and dynamically load them.
llc step fails for some of the methods that I am compiling!
Step 1: llvm-link: Merge many bitcode files
I may have many functions calling each other, so I llvm-link the different bitcode files that might interact with each other. This step has no issues. Example:
llvm-link function1.bc function2.bc -o lnk.bc
Step 2: opt: Run optimization passes
For now I am using the following:
opt -O3 lnk.bc -o opt.bc
This step proceeds with no issues, but that's the one that CAUSES the problem!
Also, it's necessary because in the future I will need this step to pass extra passes, e.g. loop-unroll
Step 3: llc: Generate machine code (PIC)
I am using the following command:
llc -march=thumb -arm-reserve-r9 -mcpu=cortex-a9 -filetype=obj -relocation-model pic opt.bc -o obj.o
I have kept the arch specific flags I've set just in case they contribute to the issue. I am using Position Independent Code because on next step I will be building a shared object.
This command fails with the error I've written on top of this answer.
Step 4: clang: Generate Shared Object
For the cases that Step 3 fails, this step isn't reached.
If llc succeeds, this step will succeed too!
Additional information
Configuration
The following run on an llvm3.6, which runs on an arm device.
Things I've noticed
If I omit -O3 (or any other level) with the opt step, then llc would work.
If I don't, and instead I omit them from llc, llc would still fail. Which makes me think that opt -O<level> is causing the issue.
If I use llc directly it will work, but I won't be able to run specific passes that opt allows me, so this is not a option for me.
I've faced this issue ONLY with 2 functions that I've compiled so far (from their original MIR), which use loops. The others produce working code!
If I don't use pic model at llc, it can generate obj.o, but then I'll have problems creating an .so from it!
Questions
Why is this happening??!!
Why opt has -relocation-model option? Isn't that supposed to be just an llc thing? I've tried setting this both at opt and llc to pic, but still fails.
I am using clang because it has a wrapper to a linker to get the the .so. Is there a way to do this step with an LLVM tool instead?
First of all, do not use neither llc nor opt. These are developer-side tools that should be never used in any production environment. Instead of this, implement your own proper optimization and codegeneration runtime via LLVM libraries.
As for this particular bug - the thumb codegenerator might contain some bugs. Please reduce the problem and report it. Or don't use Thumb mode at all :)
The Situation
We have a board with a TI DM3730 processor (also known from the Beagleboard) with a Cortex A8 core (r3p2) in use with the following parameters:
Beagleboard Reference Design: Beagleboard-xM Rev-C
Kernel version: 3.2.8
Open CV library: 2.4.6
U-Boot: uboot-2013.04
Toolchain: Sourcery CodeBench ARM 2011.03
Buildroot: 2012.02
The setup is derived from this blog
Now we have written a program (written in C++ and compiled with GCC Version 4.5.2.) which uses the OpenCV library (to calculate some scores using support vector machines) and which behaves in some strange way:
The program runs on the board in its own process using defined test data: It produces repeatedly correct results.
The program runs in two or more processes (with the same defined test data): The results start to become wrong for each process, processes die with segfaults. The last remaining process runs correctly again.
The program runs in its own process (with the same defined test data again). Additionally, another process changes some exposure settings of an attached camera: The program starts to produce wrong results.
So we assume this is a very low level floating point problem.
What we tried
The complete system (all libraries, kernel, boot loader, etc.) have been compiled with compiler flags as suggested on the pandorawiki.org regarding Floating_Point_Optimization
-O3 -mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp
-ffast-math -fsingle-precision-constant
We tried to enable L1NEON in Cortex-A8 aux ctrl register according to the Beagle board FAQ and tried the other options mentioned there as well, but unfortunately to no avail.
All three different behaviors are reproducible, but not in the form of a minimal working example.
The same program source and the first and second scenario run correctly on Windows (using Visual Studio) and on a desktop running Linux (GCC), so it's probably not something our code does.
So the questions are now:
Are there any other known bugs with this setup and floating point operations which we are not aware of?
Are there any known compiler options which should be set or omitted which can lead to the observed results?
If a MWE would be helpful, we will look into providing one.
Any clues are welcome.
Ok, we now use an up-to-date buildroot (2014.08) with the included toolchain (arm-buildroot-linux-uclibcgueabi-), Linux-kernel 3.9.11, boost 1.55, Qt 4.8.6, and still OpenCV 2.4.6.
When compiling, we optimize for size (–Os) and for target-optimization we only use –pipe.
The following compiler-flags are currently not used anymore:
-mcpu=cortex-a8 -mfpu=neon -ftree-vectorize -mfloat-abi=softfp -ffast-math -fsingle-precision-constant
Unfortunately, we still don't know the exact reason for the original problem, but we are quite happy that the problem went away with this setup.
So maybe this answer helps some poor soul in the future... ;)
I know that frontend (such as llvm-clang or llvm-gcc ) has also done some optimizations from native code to IR level.
But what's optimizations that frontend has done ? Is there a list or a document I can check.
Thanks.
You can print all the passes which the code goes through by using:
clang -O2 -Rpass=.* code.cc -o code
This will also print the information from each of the optimization passes that were used to process the code when O2 level is used with clang, for example.
See this link for more details: http://clang.llvm.org/docs/UsersManual.html#options-to-emit-optimization-reports
Wondering if it is possible to generate interleaved source and assembly from clang?
I am looking for something equivalent to gcc command (as demonstrated at http://www.fclose.com/240/generate-a-mixed-source-and-assembly-listing-using-gcc/)
gcc -Wa,-adhln -g source_code.c > assembly_list.s
I have visited Link: How do you get assembler output from C/C++ source in gcc? but it gets so far as to list the assembly - but no interleaving.
Also Visual Studio does give you pretty nice interleaved assembly output, details here: How to view the assembly behind the code using Visual C++?
Thank you for all the help.
Sarang
There seems to be a bug reported sometimes last year stating exactly this: http://llvm.org/bugs/show_bug.cgi?id=16647
Bug 16647 - No option to produce mixed source + assembly listing?
So since it is still NEW I guess clang does not have this supported yet.
As an alternative, how about compiling your code and then use objdump -S ? The output format is somewhat similar ...
As of August 2016, the bug that #dragosht mentioned still is open. However, there is a workaround offered by the linked bug 17465: clang -no-integrated-as -Xassembler -adhln. It disables the clang-integrated assembler and calls an external assembler, which hopefully supports the listing-generating options.
That works OK in Linux, but it doesn't work in Mac OS X (as of 10.11.6). The problem is that even the external assembler in OS X does not support the listing-generating options - you can check that with man as.
objdump -S is an alternative that also works well in Linux, but Mac OS X's alternative to objdump is otool, which does provide disassembly but not source interlacing. Hopefully that will change soon-ish, because otool seems to be on its way out while llvm grows its own objdump. See man llvm-otool.
Finally, for OS X the best option seems to be using gobjdump -S, from binutils. It can be installed with MacPorts or brew.
You can Generate Assembly Code from a .cc/.cpp source file by using this command
clang++ -c -S test-function.cc
Which information does GCC collect when I enable -fprofile-generate and which optimization does in fact uses the collected information (when setting the -fprofile-use flag) ?
I need citations here. I've searched for a while but didn't found anything documented.
Information regarding link-time optimization (LTO) would be a plus! =D
-fprofile-generate enables -fprofile-arcs, -fprofile-values and -fvpt.
-fprofile-use enables -fbranch-probabilities, -fvpt, -funroll-loops, -fpeel-loops and -ftracer
Source: http://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Optimize-Options.html#Optimize-Options
PS. Information about LTO also on that page.
"What Every Programmer Should Know About Memory" by Ulrich Drepper
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
http://www.akkadia.org/drepper/cpumemory.pdf
In section 7.4
compilation with --profile-generate generates .gcno file for each object file. (the same file that is used for gcov coverage reports)
then you must run a few tests, during runtime it records coverage data into .gcda files
recompile with --profile-use : it will gather the coverage data and infer if an branch is likely (__builtin_expect( .. , 1 ) or unlikely (__builtin_expect( .. , 0)
The result should run faster as it should be better at prefetching code into the processor instruction cache.