Using gprof with LULESH benchmark - c++

I've been trying to compile and run LULESH benchmark
with gprof but I always get a segmentation fault. I updated these instructions in the Makefile:
CXXFLAGS = -g -pg -O3 -I. -Wall
LDFLAGS = -g -pg -O3
[andrestoga#n01 lulesh2.0.3]$ mpirun -np 8 ./lulesh2.0 -s 16 -p -i 10
mpirun noticed that process rank 2 with PID 30557 on node n01 exited on signal 11 (Segmentation fault).
In the webpage of gprof says the following:
If you are running the program on a system which supports shared
libraries you may run into problems with the profiling support code in
a shared library being called before that library has been fully
initialised. This is usually detected by the program encountering a
segmentation fault as soon as it is run. The solution is to link
against a static version of the library containing the profiling
support code, which for gcc users can be done via the -static' or
-static-libgcc' command line option. For example:
gcc -g -pg -static-libgcc myprog.c utils.c -o myprog
I added the -static command line option and I also got segmentation fault.
I found a pdf where they profiled LULESH by updating the Makefile by adding the command line option -pg. Although they didn't say the changes they made.
Page 11
Make sure all libraries are loaded:
openmpi (which you have done already)
You can try with parameters that will allow you to identify if the problem is your machine in terms of resources. if the machine does not support such number of processes see how many processes (MPI or not MPI) it supports by looking at the architecture topology. This will allow you to identify what is the correct amount of jobs/processes you can launch into the system.
Very quick run:
mpirun -np 1 ./lulesh2.0 -s 1 -p -i 1
Running problem size 1^3 per domain until completion
Num processors: 1
Num threads: 2
Total number of elements: 1
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
cycle = 1, time = 1.000000e-02, dt=1.000000e-02
Run completed:
Problem size = 1
MPI tasks = 1
Iteration count = 1
Final Origin Energy = 4.333329e+02
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 0.000000e+00
TotalAbsDiff = 0.000000e+00
MaxRelDiff = 0.000000e+00
Elapsed time = 0.00 (s)
Grind time (us/z/c) = 518 (per dom) ( 518 overall)
I'm trying compile a large model using ifort and with the -qopenmp flag:
FC = mpif90 and FCFLAGS = -g -m64 -qopenmp -O3 -xHost -fp-model \
precise -convert big_endian -traceback -r8
LFLAGS = -qopenmp
CC = mpicc
CFLAGS = -g -O3
ulimit -s unlimited
mpprun -n 192 master.exe -e "exp1" -f d1 -t 2700
However, when I try and run the model I get:
mpprun info: Starting impi run on 12 node ( 192 rank X 1 th ) for 22
= PID 19619 RUNNING AT n457
(signal 11)
mpprun info: Job terminated with error
Now the thing is, if I compile this model without the OpenMP flag and run it with TotalView, there are no errors and the model executes without error.
I'm trying to find a way to track down what is going wrong. Does anyone have any ideas? Where do I start? how can I do simple tests to see why OpenMP exited with a segmentation fault?
I am having a C code that compiles and runs properly locally on my machine.
But when I am trying to compile with the icc and the -mmic flag and test it on Intel Xeon Phi, I am getting the following message:
/cm/local/apps/sge/current/spool/node079/job_scripts/5438755: line 14: ./sequential.mic: cannot execute binary file
I run all my tests in a cluster which uses SGE job submission system.
My makefile contains these lines:
sequential: Makefile
icc -mmic -o sequential.mic sequential.c
qsub sequential.job
The job file for submitting the job is:
#$ -S /bin/sh
#$ -l h_rt=00:10:00
#$ -j y
#$ -l fat,accel=XeoPhi
#$ -cwd
. /etc/bashrc
module load intel/compiler/64/13.3/2013.3.163
If i compile it with gcc and submit it to a regular node (XEON 5620)
everything works as expected.
Also, i tried the file command to examine to the mic executable and the output is : sequential.mic: ELF 64-bit LSB executable, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.4.0, not stripped
Any suggestions are more than welcome.
As the code needs to run natively on the Intel Xeon Phi, the binary also need t be loaded on the machine ahead of execution.
Therefore, the simplest way to do that is with the following command which loads the binary and then executes.
I have written a websocket++ server on Ubuntu 13.10 and am trying to execute it on Linux Mint 16.
I have installed all dependencies, and the first line under main is a cout which never fires.
This is the compile command:
g++ -o Dgn Dgn.cpp ed25519-donna-master/ed25519.o
-lboost_regex -lboost_system -L/usr/lib -lssl -lcrypto -pthread -lpqxx
-lboost_thread -ljson_spirit -lgmp -lgmpxx
If I execute with sudo to use restricted ports, it fails immediately without error returning to the command line.
If I execute without sudo, is prints Segmentation Fault and fails immediately to the command line.
The directories in ~/Dgn are present on the new system.
I did a quick, simple test and checked to see if a basic websocket++ example could compile and execute normally, and it was successful.
Both systems are 64-bit. The only difference are the distros, but Linux Mint 16 is based upon Ubuntu 13.10, and all commands to setup were identical.
How can this be compiled so that it can execute on another system?
As a further test, I compiled it on the new system, and it works.
Is it not possible to compile on one system and run on another?
Program received signal SIGSEGV, Segmentation fault.
0x00007ffff7de58da in ?? () from /lib64/
Has anyone tried to use gold instead of ld?
gold promises to be much faster than ld, so it may help speeding up test cycles for large C++ applications, but can it be used as drop-in replacement for ld?
Can gcc/g++ directly call gold.?
Are there any know bugs or problems?
Although gold is part of the GNU binutils since a while, I have found almost no "success stories" or even "Howtos" in the Web.
(Update: added links to gold and blog entry explaining it)
At the moment it is compiling bigger projects on Ubuntu 10.04. Here you can install and integrate it easily with the binutils-gold package (if you remove that package, you get your old ld). Gcc will automatically use gold then.
Some experiences:
gold doesn't search in /usr/local/lib
gold doesn't assume libs like pthread or rt, had to add them by hand
it is faster and needs less memory (the later is important on big C++ projects with a lot of boost etc.)
What does not work: It cannot compile kernel stuff and therefore no kernel modules. Ubuntu does this automatically via DKMS if it updates proprietary drivers like fglrx. This fails with ld-gold (you have to remove gold, restart DKMS, reinstall ld-gold.
As it took me a little while to find out how to selectively use gold (i.e. not system-wide using a symlink), I'll post the solution here. It's based on .
Make a directory where you can put a gold glue script. I am using ~/bin/gold/.
Put the following glue script there and name it ~/bin/gold/ld:
gold "$#"
Obviously, make it executable, chmod a+x ~/bin/gold/ld.
Change your calls to gcc to gcc -B$HOME/bin/gold which makes gcc look in the given directory for helper programs like ld and thus uses the glue script instead of the system-default ld.
Can gcc/g++ directly call gold.?
Just to complement the answers: there is a gcc's option -fuse-ld=gold (see gcc doc). Though, AFAIK, it is possible to configure gcc during the build in a way that the option will not have any effect.
Minimal synthetic benchmark: LD vs gold vs LLVM LLD
gold was about 3x to 4x faster for all values I've tried when using -Wl,--threads -Wl,--thread-count=$(nproc) to enable multithreading
LLD was about 2x faster than gold!
Tested on:
Ubuntu 20.04, GCC 9.3.0, binutils 2.34, sudo apt install lld LLD 10
Lenovo ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB), Samsung MZVLB512HAJQ-000L7 SSD (3,000 MB/s).
Simplified description of the benchmark parameters:
1: number of object files providing symbols
2: number of symbols per symbol provider object file
3: number of object files using all provided symbols symbols
Results for different benchmark parameters:
10000 10 10
nogold: wall=4.35s user=3.45s system=0.88s 876820kB
gold: wall=1.35s user=1.72s system=0.46s 739760kB
lld: wall=0.73s user=1.20s system=0.24s 625208kB
1000 100 10
nogold: wall=5.08s user=4.17s system=0.89s 924040kB
gold: wall=1.57s user=2.18s system=0.54s 922712kB
lld: wall=0.75s user=1.28s system=0.27s 664804kB
100 1000 10
nogold: wall=5.53s user=4.53s system=0.95s 962440kB
gold: wall=1.65s user=2.39s system=0.61s 987148kB
lld: wall=0.75s user=1.30s system=0.25s 704820kB
10000 10 100
nogold: wall=11.45s user=10.14s system=1.28s 1735224kB
gold: wall=4.88s user=8.21s system=0.95s 2180432kB
lld: wall=2.41s user=5.58s system=0.74s 2308672kB
1000 100 100
nogold: wall=13.58s user=12.01s system=1.54s 1767832kB
gold: wall=5.17s user=8.55s system=1.05s 2333432kB
lld: wall=2.79s user=6.01s system=0.85s 2347664kB
100 1000 100
nogold: wall=13.31s user=11.64s system=1.62s 1799664kB
gold: wall=5.22s user=8.62s system=1.03s 2393516kB
lld: wall=3.11s user=6.26s system=0.66s 2386392kB
This is the script that generates all the objects for the link tests:
#!/usr/bin/env bash
set -eu
# CLI args.
# Each of those files contains n_ints_per_file ints.
# Each function adds all ints from all files.
# This leads to n_int_files x n_ints_per_file x n_funcs relocations.
# Do a debug build, since it is for debug builds that link time matters the most,
# as the user will be recompiling often.
cflags='-ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic'
# Cleanup previous generated files objects.
# Generate i_*.c, ints.h and int_sum.h
rm -f ints.h
echo 'return' > int_sum.h
while [ "$int_file_i" -lt "$n_int_files" ]; do
rm -f "$int_file"
while [ "$int_i" -lt "$n_ints_per_file" ]; do
echo "${int_file_i} ${int_i}"
echo "unsigned int ${int_sym} = ${int_file_i};" >> "$int_file"
echo "extern unsigned int ${int_sym};" >> ints.h
echo "${int_sym} +" >> int_sum.h
int_i=$((int_i + 1))
int_file_i=$((int_file_i + 1))
echo '1;' >> int_sum.h
# Generate funcs.h and main.c.
rm -f funcs.h
cat <<EOF >main.c
#include "funcs.h"
int main(void) {
while [ "$i" -lt "$n_funcs" ]; do
echo "${func_sym}() +" >> main.c
echo "int ${func_sym}(void);" >> funcs.h
cat <<EOF >"${func_sym}.c"
#include "ints.h"
int ${func_sym}(void) {
#include "int_sum.h"
i=$((i + 1))
cat <<EOF >>main.c
# Generate *.o
ls | grep -E '\.c$' | parallel --halt now,fail=1 -t --will-cite "gcc $cflags -c -o '{.}.o' '{}'"
GitHub upstream.
Note that the object file generation can be quite slow, since each C file can be quite large.
Given an input of type:
./generate-objects [n_int_files [n_ints_per_file [n_funcs]]]
it generates:
#include "funcs.h"
int main(void) {
return f_0() + f_1() + ... + f_<n_funcs>();
f_0.c, f_1.c, ..., f_<n_funcs>.c
extern unsigned int i_0_0;
extern unsigned int i_0_1;
extern unsigned int i_1_0;
extern unsigned int i_1_1;
extern unsigned int i_<n_int_files>_<n_ints_per_file>;
int f_0(void) {
i_0_0 +
i_0_1 +
i_1_0 +
i_1_1 +
0.c, 1.c, ..., <n_int_files>.c
unsigned int i_0_0 = 0;
unsigned int i_0_1 = 0;
unsigned int i_0_<n_ints_per_file> = 0;
which leads to:
n_int_files x n_ints_per_file x n_funcs
relocations on the link.
Then I compared:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main *.o
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -fuse-ld=gold -Wl,--threads -Wl,--thread-count=`nproc` -o main *.o
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -fuse-ld=lld -o main *.o
Some limits I've been trying to mitigate when selecting the test parameters:
at 100k C files, both methods get failed mallocs occasionally
GCC cannot compile a function with 1M additions
I have also observed a 2x in the debug build of gem5:
Similar question:
Phoronix benchmarks
Phoronix did some benchmarking in 2017 for some real world projects, but for the projects they examined, the gold gains were not so significant: (archive).
Known incompatibilities
gold gold failed if I do a partial link with LD and then try the final link with gold. lld worked on the same test case. my debug symbols appeared broken in some places
LLD benchmarks
At they give build times for a few well known projects. with similar results to my synthetic benchmarks. Project/linker versions are not given unfortunately. In their results:
gold was about 3x/4x faster than LD
LLD was 3x/4x faster than gold, so a greater speedup than in my synthetic benchmark
They comment:
This is a link time comparison on a 2-socket 20-core 40-thread Xeon E5-2680 2.80 GHz machine with an SSD drive. We ran gold and lld with or without multi-threading support. To disable multi-threading, we added -no-threads to the command lines.
and results look like:
Program | Size | GNU ld | gold -j1 | gold | lld -j1 | lld
ffmpeg dbg | 92 MiB | 1.72s | 1.16s | 1.01s | 0.60s | 0.35s
mysqld dbg | 154 MiB | 8.50s | 2.96s | 2.68s | 1.06s | 0.68s
clang dbg | 1.67 GiB | 104.03s | 34.18s | 23.49s | 14.82s | 5.28s
chromium dbg | 1.14 GiB | 209.05s | 64.70s | 60.82s | 27.60s | 16.70s
As a Samba developer, I have been using the gold linker almost exclusively on Ubuntu, Debian, and Fedora since several years now. My assessment:
gold is many times (felt: 5-10 times) faster than the classical linker.
Initially, there were a few problems, but they have gone since roughly around Ubuntu 12.04.
The gold linker even found some dependency problems in our code, since it seems to be more correct than the classical one with respect to some details. See, e.g. this Samba commit.
I have not used gold selectively, but have been using symlinks or the alternatives mechanism if the distribution provides it.
You could link ld to gold (in a local binary directory if you have ld installed to avoid overwriting):
ln -s `which gold` ~/bin/ld
ln -s `which gold` /usr/local/bin/ld
Some projects seem to be incompatible with gold, because of some incompatible differences between ld and gold. Example: OpenFOAM, see .
DragonFlyBSD switched over to gold as their default linker. So it seems to be ready for a variety of tools.
I just built a cross compiler using crosstools "mips-unknown-linux-gnu-gcc" and I compiled a hello world program. The compilation went fine using the command: "mips-unknown-linux-gnu-g++ hello.cpp -o hello" but when I run the command "./hello" I get the following error:
babbage-dasnyder 50% mips-unknown-linux-gnu-g++ hello.cpp -o hello
babbage-dasnyder 51% ./hello
./hello: Exec format error. Wrong Architecture.
Why is this? Did I make the wrong cross-compiler? I'm running this on a linux machine.
Just as a note, crosstools did say it could run a trivial program:
+ /home/seas/grad/dasnyder/opt/crosstool/gcc-3.4.5-glibc-2.3.6/mips-unknown-linux-gnu/bin/mips-unknown-linux-gnu-gcc -static hello.c -o mips-unknown-linux-gnu-hello-static
+ /home/seas/grad/dasnyder/opt/crosstool/gcc-3.4.5-glibc-2.3.6/mips-unknown-linux-gnu/bin/mips-unknown-linux-gnu-gcc hello.c -o mips-unknown-linux-gnu-hello
+ test -x /home/seas/grad/dasnyder/opt/crosstool/gcc-3.4.5-glibc-2.3.6/mips-unknown-linux-gnu/bin/mips-unknown-linux-gnu-g++
+ cat
+ /home/seas/grad/dasnyder/opt/crosstool/gcc-3.4.5-glibc-2.3.6/mips-unknown-linux-gnu/bin/mips-unknown-linux-gnu-g++ -static -o mips-unknown-linux-gnu-hello2-static
+ /home/seas/grad/dasnyder/opt/crosstool/gcc-3.4.5-glibc-2.3.6/mips-unknown-linux-gnu/bin/mips-unknown-linux-gnu-g++ -o mips-unknown-linux-gnu-hello2
+ echo testhello: C compiler can in fact build a trivial program.
testhello: C compiler can in fact build a trivial program.
+ test '' = 1
+ test '' = 1
+ test '' = 1
+ test 1 = ''
+ echo Done.
When you cross-compile to a different architecture, you are generating instructions for the new architecture and thus you may not be able to run these instructions on your current architecture. You are cross-compiling to be able to compile the code on a more powerful machine and then transfer it to the device for testing. If you are wanting to test the code directly on your machine you need to compile with your native architecture's compiler.