I'm processing RGB images and doing the same thing for each channel (R+G+B) so I've been looking for parallel functions that could help me improve my code and run it (3*?) faster. Now Im using forEach function like so:
unsigned char lut[256];
for (int i = 0; i < 256; i++)
lut[i] = cv::saturate_cast<uchar>(pow((float)(i / 255.0), fGamma) * 255.0f); //pow: power exponent
dst.forEach<cv::Vec3b> //dst is an RGB image
(
[&lut](cv::Vec3b &pixel, const int* po) -> void
{
pixel[0] = lut[(pixel[0])];
pixel[1] = lut[(pixel[1])];
pixel[2] = lut[(pixel[2])];
}
);
But when I use htop to see the number of threads running, I only find one or two threads on..
Am I doing something wrong or forEach isn't suppose to run on multi-threading? Do you have any resources to help me with to get to multi-threading computations?
I'm running my code on ubuntu with this:
g++ -std=c++1z -Wall -Ofast -march=native test3.cpp -o test3 `pkg-config --cflags --libs opencv`
Have you taken a look in TBB already? Threading Building Blocks is an apache licensed lib for parallel computing which you can use just compiling OpenCV with the flag -D WITH_TBB=ON
See this example of parallel_for: http://www.jayrambhia.com/blog/opencv-with-tbb
If you decide to adopt TBB follow these steps:
1 - Rebuild OpenCV with TBB support. If you are running into a Linux machine just do:
cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D WITH_TBB=ON BUILD_TBB=ON ..
2 - Rewrite your program to use TBB
See the answers there: Simplest TBB example focusing in the most recent ones.
Related
I'm trying to use packages that require Rcpp in R on my M1 Mac, which I was never able to get up and running after purchasing this computer. I updated it to Monterey in the hope that this would fix some installation issues but it hasn't. I tried running the Rcpp check from this page but I get the following error:
> Rcpp::sourceCpp("~/github/helloworld.cpp")
ld: warning: directory not found for option '-L/opt/R/arm64/gfortran/lib/gcc/aarch64-apple-darwin20.2.0/11.0.0'
ld: warning: directory not found for option '-L/opt/R/arm64/gfortran/lib'
ld: library not found for -lgfortran
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make: *** [sourceCpp_4.so] Error 1
clang++ -arch arm64 -std=gnu++14 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I../inst/include -I"/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/library/RcppArmadillo/include" -I"/Users/afredston/github" -I/opt/R/arm64/include -fPIC -falign-functions=64 -Wall -g -O2 -c helloworld.cpp -o helloworld.o
clang++ -arch arm64 -std=gnu++14 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/opt/R/arm64/lib -o sourceCpp_4.so helloworld.o -L/Library/Frameworks/R.framework/Resources/lib -lRlapack -L/Library/Frameworks/R.framework/Resources/lib -lRblas -L/opt/R/arm64/gfortran/lib/gcc/aarch64-apple-darwin20.2.0/11.0.0 -L/opt/R/arm64/gfortran/lib -lgfortran -lemutls_w -lm -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
Error in Rcpp::sourceCpp("~/github/helloworld.cpp") :
Error 1 occurred building shared library.
I get that it can't "find" gfortran. I installed this release of gfortran for Monterey. When I type which gfortran into Terminal, it returns /opt/homebrew/bin/gfortran. (Maybe this version of gfortran requires Xcode tools that are too new—it says something about 13.2 and when I run clang --version it says 13.0—but I don't see another release of gfortran for Monterey?)
I also appended /opt/homebrew/bin: to PATH in R so it looks like this now:
> Sys.getenv("PATH")
[1] "/opt/homebrew/bin:/usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/Library/TeX/texbin:/Applications/RStudio.app/Contents/MacOS/postback"
Other things I checked:
Xcode command line tools is installed (which clang returns /usr/bin/clang).
Files ~/.R/Makevars and ~/.Renviron don't exist.
Here's my session info:
R version 4.1.1 (2021-08-10)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_4.1.1 tools_4.1.1 RcppArmadillo_0.10.7.5.0
[4] Rcpp_1.0.7
Background
Currently (2023-02-20), CRAN builds R 4.2 binaries for Apple silicon using Apple Clang from Command Line Tools for Xcode 13.1 and using an experimental fork of GNU Fortran 12.
If you obtain R from CRAN (i.e., here), then you need to replicate CRAN's compiler setup on your system before building R packages that contain C/C++/Fortran code from their sources (and before using Rcpp, etc.). This requirement ensures that your package builds are compatible with R itself.
A further complication is the fact that Apple Clang doesn't support OpenMP, so you need to do even more work to compile programs that make use of multithreading. You could circumvent the issue by building R itself, all R packages, and all external libraries from sources with LLVM Clang, which does support OpenMP, but that approach is onerous and "for experts only".
There is another approach that has been tested by a few people, including Simon Urbanek, the maintainer of R for macOS. It is experimental and also "for experts only", but it works on my machine and is much simpler than learning to build R and other libraries yourself.
Instructions for obtaining a working toolchain
Warning: These come with no warranty and could break at any time. Some level of familiarity with C/C++/Fortran program compilation, Makefile syntax, and Unix shells is assumed. Everyone is encouraged to consult official documentation, which is more likely to be maintained than answers on SO. As usual, sudo at your own risk.
I will try to address compilers and OpenMP support at the same time. I am going to assume that you are starting from nothing. Feel free to skip steps you've already taken, though you might find a fresh start helpful.
I've tested these instructions on a machine running Big Sur, but they should also work on Monterey and Ventura.
Download an R 4.2 binary from CRAN here and install. Be sure to select the binary built for Apple silicon.
Run
$ sudo xcode-select --install
in Terminal to install the latest release version of Apple's Command Line Tools for Xcode, which includes Apple Clang. You can obtain earlier versions from your browser here. However, the version that you install should not be older than the one that CRAN used to build your R binary.
Download the GNU Fortran binary provided here and install by unpacking to root:
$ curl -LO https://mac.r-project.org/tools/gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz
$ sudo tar xvf gfortran-12.0.1-20220312-is-darwin20-arm64.tar.xz -C /
$ sudo ln -sfn $(xcrun --show-sdk-path) /opt/R/arm64/gfortran/SDK
The last command updates a symlink inside of the installation so that it points to the SDK inside of your Command Line Tools installation.
Download an OpenMP runtime suitable for your Apple Clang version here and install by unpacking to root. You can query your Apple Clang version with clang --version. For example, I have version 1300.0.29.3, so I did:
$ curl -LO https://mac.r-project.org/openmp/openmp-12.0.1-darwin20-Release.tar.gz
$ sudo tar xvf openmp-12.0.1-darwin20-Release.tar.gz -C /
After unpacking, you should find these files on your system:
/usr/local/lib/libomp.dylib
/usr/local/include/ompt.h
/usr/local/include/omp.h
/usr/local/include/omp-tools.h
Add the following lines to $(HOME)/.R/Makevars, creating the file if necessary.
CPPFLAGS += -I/usr/local/include -Xclang -fopenmp
LDFLAGS += -L/usr/local/lib -lomp
Test that you are able to use R to compile a C or C++ program with OpenMP support while linking relevant libraries from the GNU Fortran installation (indicated by the -l flags in the output of R CMD CONFIG FLIBS).
The most transparent approach is to use R CMD SHLIB directly. In a temporary directory, create an empty source file omp_test.c and add the following lines:
#ifdef _OPENMP
# include <omp.h>
#endif
#include <Rinternals.h>
SEXP omp_test(void)
{
#ifdef _OPENMP
Rprintf("OpenMP threads available: %d\n", omp_get_max_threads());
#else
Rprintf("OpenMP not supported\n");
#endif
return R_NilValue;
}
Compile it:
$ R CMD SHLIB omp_test.c $(R CMD CONFIG FLIBS)
Then call the compiled C function from R:
$ R -e 'dyn.load("omp_test.so"); invisible(.Call("omp_test"))'
OpenMP threads available: 8
If the compiler or linker throws an error, or if you find that OpenMP is still not supported, then one of us has made a mistake. Please report any issues.
Note that you can implement the same test using Rcpp, if you don't mind installing it:
library(Rcpp)
registerPlugin("flibs", Rcpp.plugin.maker(libs = "$(FLIBS)"))
sourceCpp(code = '
#ifdef _OPENMP
# include <omp.h>
#endif
#include <Rcpp.h>
// [[Rcpp::plugins(flibs)]]
// [[Rcpp::export]]
void omp_test()
{
#ifdef _OPENMP
Rprintf("OpenMP threads available: %d\\n", omp_get_max_threads());
#else
Rprintf("OpenMP not supported\\n");
#endif
return;
}
')
omp_test()
OpenMP threads available: 8
References
Everything is a bit scattered:
R Installation and Administration manual [link]
Writing R Extensions manual [link]
R for macOS Developers web page [link]
I resolved this issue by adding a path to the homebrew installation of gfortran to my ~/.R/Makevars following these instructions: https://pat-s.me/transitioning-from-x86-to-arm64-on-macos-experiences-of-an-r-user/#gfortran
I just avoided the issue until MacOS had things working more smoothly. so I either Windows Developer Virtual Machine (VM) or run my code development in another environment. I'm not too impressed with the updated and "faster" chipset, but that it doesn't work with much. Slow to implement and work-a-rounds often are a must.
Tested the following process for making multithread data.table work in a M2 MacBook Pro (macOS Monterey)
Steps are mostly the same with this answer by the user inferator.
Download and install R from CRAN
Download and install RStudio with developer tools
Run the following commands in terminal to install OpenMP
curl -O https://mac.r-project.org/openmp/openmp-12.0.1-darwin20-Release.tar.gz
sudo tar fvxz openmp-12.0.1-darwin20-Release.tar.gz -C /
Add compiler flags to connect clan w/ OpenMP. In terminal, write the following:
cd ~
mkdir .R
nano .R/Makevars
Inside the opened Makevars file paste the following lines. Once finished, hit command+O and then Enter to save. Do a command+X to close the editor.
CPPFLAGS += -Xclang -fopenmp
LDFLAGS += -lomp
Download and run the installer for gfortran by downloading gfortran-ARM-12.1-Monterey.dmg from the respective GitHub repo
This concludes the steps regarding enabling OpenMP and (hopefully) Rcpp in R under a M2 chip system.
Now, for testing that everything works with data.table I did the following
Open RStudio and run
install.packages("data.table", type = "source")
If everything is done correctly, the package should compile without any errors and return the following when running getDTthreads(verbose = TRUE):
OpenMP version (_OPENMP) 201811
omp_get_num_procs() 8
R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
R_DATATABLE_NUM_THREADS unset
R_DATATABLE_THROTTLE unset (default 1024)
omp_get_thread_limit() 2147483647
omp_get_max_threads() 8
OMP_THREAD_LIMIT unset
OMP_NUM_THREADS unset
RestoreAfterFork true
data.table is using 4 threads with throttle==1024. See ?setDTthreads.
[1] 4
I am running OpenCV 2.4.13.
I am using CvRTrees.train() and CvRTrees.predict() for a classification application.
Currently, the CvRTrees.train() function only uses one CPU core. I can see this in htop: one CPU is at 100% and the others hover around 3% when train() is called.
The docs says you have to explicitly enable parallelization support, so I have rebuilt the libaries with OpenMP support and reinstalled. However, the behavior is the same. Only one CPU is utilized by train().
Does anyone know what I am doing wrong? Are my expectations inaccurate?
Here are the steps I use to build and install:
Enabled OpenMP in cmake
$ mkdir release && cd release
$ cmake -D CMAKE_BUILD_TYPE=RELEASE -D CMAKE_INSTALL_PREFIX=/usr/local -D WITH_OPENMP=ON -D WITH_FFMPEG=OFF ..
Make and install
$ make && sudo make install
Run ldconfig so that the linker cache gets updated
$ sudo ldconfig
Rebuild my program and run ldd on the executable to ensure that it's using my newly installed libraries in /usr/local/lib
$ ldd ./build/release/classify | grep opencv
libopencv_core.so.2.4 => /usr/local/lib/libopencv_core.so.2.4 (0x00007f6e9d525000)
libopencv_ml.so.2.4 => /usr/local/lib/libopencv_ml.so.2.4 (0x00007f6e9d29e000)
Another note: If I add syntax errors around line 261 below, the build breaks, so I know I have the correct compiler flags set.
modules/core/src/parallel.cpp
-----------------------------
259 #elif defined HAVE_OPENMP
260
261 #pragma omp parallel for schedule(dynamic)
262 for (int i = stripeRange.start; i < stripeRange.end; ++i)
263 pbody(Range(i, i + 1));
Another note: ldd tells me that my application links libgomp.so, so I am reasonably sure that OpenMP is enabled.
I was trying to do linear algebra numerical computation in C++. I used Python Numpy for quick model and I would like to find a C++ linear algebra pack for some further speed up. Eigen seems to be quite a good point to start.
I wrote a small performance test using large dense matrix multiplication to test the processing speed. In Numpy I was doing this:
import numpy as np
import time
a = np.random.uniform(size = (5000, 5000))
b = np.random.uniform(size = (5000, 5000))
start = time.time()
c = np.dot(a, b)
print (time.time() - start) * 1000, 'ms'
In C++ Eigen I was doing this:
#include <time.h>
#include "Eigen/Dense"
using namespace std;
using namespace Eigen;
int main() {
MatrixXf a = MatrixXf::Random(5000, 5000);
MatrixXf b = MatrixXf::Random(5000, 5000);
time_t start = clock();
MatrixXf c = a * b;
cout << (double)(clock() - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl;
return 0;
}
I have done some search in the documents and on stackoverflow on the compilation optimization flags. I tried to compile the program using this command:
g++ -g test.cpp -o test -Ofast -msse2
The C++ executable compiled with -Ofast optimization flags runs about 30x or more faster than a simple no optimization compilation. It will return the result in roughly 10000ms on my 2015 macbook pro.
Meanwhile Numpy will return the result in about 1800ms.
I am expecting a boost of performance in using Eigen compared with Numpy. However, this failed my expectation.
Is there any compile flags I missed that will further boost the Eigen performance in this? Or is there any multithread switch that can be turn on to give me extra performance gain? I am just curious about this.
Thank you very much!
Edit on April 17, 2016:
After doing some search according to #ggael 's answer, I have come up with the answer to this question.
Best solution to this is compile with link to Intel MKL as backend for Eigen. for osx system the library can be found at here. With MKL installed I tried to use the Intel MKL link line advisor to enable MKL backend support for Eigen.
I compile in this manner for all MKL enablement:
g++ -DEIGEN_USE_MKL_ALL -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl -m64 -I${MKLROOT}/include -I. -Ofast -DNDEBUG test.cpp -o test
If there is any environment variable error for MKLROOT just run the environment setup script provided in the MKL package which is installed default at /opt/intel/mkl/bin on my device.
With MKL as Eigen backend the matrix multiplication for two 5000x5000 operation will be finished in about 900ms on my 2.5Ghz Macbook Pro. This is much faster than Python Numpy on my device.
To answer on the OSX side, first of all recall that on OSX g++ is actually an alias to clang++, and the current Apple's version of clang does not support openmp. Nonetheless, using Eigen3.3-beta-1, and default clang++, I get on a macbookpro 2.6Ghz:
$ clang++ -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG && ./a.out
2954.91ms
Then to get support for multithreading, you need a recent clang of gcc compiler, for instance using homebrew or macport. Here using gcc 5 from macport, I get:
$ g++-mp-5 -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG -fopenmp -Wa,-q && ./a.out
804.939ms
and with clang 3.9:
$ clang++-mp-3.9 -mfma -I ../eigen so_gemm_perf.cpp -O3 -DNDEBUG -fopenmp && ./a.out
806.16ms
Remark that gcc on osx does not knowhow to properly assemble AVX/FMA instruction,so you need to tell it to use the native assembler with the -Wa,-q flag.
Finally, with the devel branch, you can also tell Eigen to use whatever BLAS as a backend, for instance the one from Apple's Accelerate as follows:
$ g++ -framework Accelerate -DEIGEN_USE_BLAS -O3 -DNDEBUG so_gemm_perf.cpp -I ../eigen && ./a.out
802.837ms
Compiling your little program with VC2013:
/fp:precise - 10.5s
/fp:strict - 10.4s
/fp:fast - 10.3s
/fp:fast /arch:AVX2 - 6.6s
/fp:fast /arch:AVX2 /openmp - 2.7s
So using AVX/AVX2 and enabling OpenMP is going to help a lot. You can also try linking against MKL (http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html).
I've been trying to compile and run LULESH benchmark
https://codesign.llnl.gov/lulesh.php
https://codesign.llnl.gov/lulesh/lulesh2.0.3.tgz
with gprof but I always get a segmentation fault. I updated these instructions in the Makefile:
CXXFLAGS = -g -pg -O3 -I. -Wall
LDFLAGS = -g -pg -O3
[andrestoga#n01 lulesh2.0.3]$ mpirun -np 8 ./lulesh2.0 -s 16 -p -i 10
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 30557 on node n01 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
In the webpage of gprof says the following:
If you are running the program on a system which supports shared
libraries you may run into problems with the profiling support code in
a shared library being called before that library has been fully
initialised. This is usually detected by the program encountering a
segmentation fault as soon as it is run. The solution is to link
against a static version of the library containing the profiling
support code, which for gcc users can be done via the -static' or
-static-libgcc' command line option. For example:
gcc -g -pg -static-libgcc myprog.c utils.c -o myprog
I added the -static command line option and I also got segmentation fault.
I found a pdf where they profiled LULESH by updating the Makefile by adding the command line option -pg. Although they didn't say the changes they made.
http://periscope.in.tum.de/releases/latest/pdf/PTF_Best_Practices_Guide.pdf
Page 11
Could someone help me out please?
Best,
Make sure all libraries are loaded:
openmpi (which you have done already)
gcc
You can try with parameters that will allow you to identify if the problem is your machine in terms of resources. if the machine does not support such number of processes see how many processes (MPI or not MPI) it supports by looking at the architecture topology. This will allow you to identify what is the correct amount of jobs/processes you can launch into the system.
Very quick run:
mpirun -np 1 ./lulesh2.0 -s 1 -p -i 1
Running problem size 1^3 per domain until completion
Num processors: 1
Num threads: 2
Total number of elements: 1
To run other sizes, use -s <integer>.
To run a fixed number of iterations, use -i <integer>.
To run a more or less balanced region set, use -b <integer>.
To change the relative costs of regions, use -c <integer>.
To print out progress, use -p
To write an output file for VisIt, use -v
See help (-h) for more options
cycle = 1, time = 1.000000e-02, dt=1.000000e-02
Run completed:
Problem size = 1
MPI tasks = 1
Iteration count = 1
Final Origin Energy = 4.333329e+02
Testing Plane 0 of Energy Array on rank 0:
MaxAbsDiff = 0.000000e+00
TotalAbsDiff = 0.000000e+00
MaxRelDiff = 0.000000e+00
Elapsed time = 0.00 (s)
Grind time (us/z/c) = 518 (per dom) ( 518 overall)
FOM = 1.9305019 (z/s)
Has anyone tried to use gold instead of ld?
gold promises to be much faster than ld, so it may help speeding up test cycles for large C++ applications, but can it be used as drop-in replacement for ld?
Can gcc/g++ directly call gold.?
Are there any know bugs or problems?
Although gold is part of the GNU binutils since a while, I have found almost no "success stories" or even "Howtos" in the Web.
(Update: added links to gold and blog entry explaining it)
At the moment it is compiling bigger projects on Ubuntu 10.04. Here you can install and integrate it easily with the binutils-gold package (if you remove that package, you get your old ld). Gcc will automatically use gold then.
Some experiences:
gold doesn't search in /usr/local/lib
gold doesn't assume libs like pthread or rt, had to add them by hand
it is faster and needs less memory (the later is important on big C++ projects with a lot of boost etc.)
What does not work: It cannot compile kernel stuff and therefore no kernel modules. Ubuntu does this automatically via DKMS if it updates proprietary drivers like fglrx. This fails with ld-gold (you have to remove gold, restart DKMS, reinstall ld-gold.
As it took me a little while to find out how to selectively use gold (i.e. not system-wide using a symlink), I'll post the solution here. It's based on http://code.google.com/p/chromium/wiki/LinuxFasterBuilds#Linking_using_gold .
Make a directory where you can put a gold glue script. I am using ~/bin/gold/.
Put the following glue script there and name it ~/bin/gold/ld:
#!/bin/bash
gold "$#"
Obviously, make it executable, chmod a+x ~/bin/gold/ld.
Change your calls to gcc to gcc -B$HOME/bin/gold which makes gcc look in the given directory for helper programs like ld and thus uses the glue script instead of the system-default ld.
Can gcc/g++ directly call gold.?
Just to complement the answers: there is a gcc's option -fuse-ld=gold (see gcc doc). Though, AFAIK, it is possible to configure gcc during the build in a way that the option will not have any effect.
Minimal synthetic benchmark: LD vs gold vs LLVM LLD
Outcome:
gold was about 3x to 4x faster for all values I've tried when using -Wl,--threads -Wl,--thread-count=$(nproc) to enable multithreading
LLD was about 2x faster than gold!
Tested on:
Ubuntu 20.04, GCC 9.3.0, binutils 2.34, sudo apt install lld LLD 10
Lenovo ThinkPad P51 laptop, Intel Core i7-7820HQ CPU (4 cores / 8 threads), 2x Samsung M471A2K43BB1-CRC RAM (2x 16GiB), Samsung MZVLB512HAJQ-000L7 SSD (3,000 MB/s).
Simplified description of the benchmark parameters:
1: number of object files providing symbols
2: number of symbols per symbol provider object file
3: number of object files using all provided symbols symbols
Results for different benchmark parameters:
10000 10 10
nogold: wall=4.35s user=3.45s system=0.88s 876820kB
gold: wall=1.35s user=1.72s system=0.46s 739760kB
lld: wall=0.73s user=1.20s system=0.24s 625208kB
1000 100 10
nogold: wall=5.08s user=4.17s system=0.89s 924040kB
gold: wall=1.57s user=2.18s system=0.54s 922712kB
lld: wall=0.75s user=1.28s system=0.27s 664804kB
100 1000 10
nogold: wall=5.53s user=4.53s system=0.95s 962440kB
gold: wall=1.65s user=2.39s system=0.61s 987148kB
lld: wall=0.75s user=1.30s system=0.25s 704820kB
10000 10 100
nogold: wall=11.45s user=10.14s system=1.28s 1735224kB
gold: wall=4.88s user=8.21s system=0.95s 2180432kB
lld: wall=2.41s user=5.58s system=0.74s 2308672kB
1000 100 100
nogold: wall=13.58s user=12.01s system=1.54s 1767832kB
gold: wall=5.17s user=8.55s system=1.05s 2333432kB
lld: wall=2.79s user=6.01s system=0.85s 2347664kB
100 1000 100
nogold: wall=13.31s user=11.64s system=1.62s 1799664kB
gold: wall=5.22s user=8.62s system=1.03s 2393516kB
lld: wall=3.11s user=6.26s system=0.66s 2386392kB
This is the script that generates all the objects for the link tests:
generate-objects
#!/usr/bin/env bash
set -eu
# CLI args.
# Each of those files contains n_ints_per_file ints.
n_int_files="${1:-10}"
n_ints_per_file="${2:-10}"
# Each function adds all ints from all files.
# This leads to n_int_files x n_ints_per_file x n_funcs relocations.
n_funcs="${3:-10}"
# Do a debug build, since it is for debug builds that link time matters the most,
# as the user will be recompiling often.
cflags='-ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic'
# Cleanup previous generated files objects.
./clean
# Generate i_*.c, ints.h and int_sum.h
rm -f ints.h
echo 'return' > int_sum.h
int_file_i=0
while [ "$int_file_i" -lt "$n_int_files" ]; do
int_i=0
int_file="${int_file_i}.c"
rm -f "$int_file"
while [ "$int_i" -lt "$n_ints_per_file" ]; do
echo "${int_file_i} ${int_i}"
int_sym="i_${int_file_i}_${int_i}"
echo "unsigned int ${int_sym} = ${int_file_i};" >> "$int_file"
echo "extern unsigned int ${int_sym};" >> ints.h
echo "${int_sym} +" >> int_sum.h
int_i=$((int_i + 1))
done
int_file_i=$((int_file_i + 1))
done
echo '1;' >> int_sum.h
# Generate funcs.h and main.c.
rm -f funcs.h
cat <<EOF >main.c
#include "funcs.h"
int main(void) {
return
EOF
i=0
while [ "$i" -lt "$n_funcs" ]; do
func_sym="f_${i}"
echo "${func_sym}() +" >> main.c
echo "int ${func_sym}(void);" >> funcs.h
cat <<EOF >"${func_sym}.c"
#include "ints.h"
int ${func_sym}(void) {
#include "int_sum.h"
}
EOF
i=$((i + 1))
done
cat <<EOF >>main.c
1;
}
EOF
# Generate *.o
ls | grep -E '\.c$' | parallel --halt now,fail=1 -t --will-cite "gcc $cflags -c -o '{.}.o' '{}'"
GitHub upstream.
Note that the object file generation can be quite slow, since each C file can be quite large.
Given an input of type:
./generate-objects [n_int_files [n_ints_per_file [n_funcs]]]
it generates:
main.c
#include "funcs.h"
int main(void) {
return f_0() + f_1() + ... + f_<n_funcs>();
}
f_0.c, f_1.c, ..., f_<n_funcs>.c
extern unsigned int i_0_0;
extern unsigned int i_0_1;
...
extern unsigned int i_1_0;
extern unsigned int i_1_1;
...
extern unsigned int i_<n_int_files>_<n_ints_per_file>;
int f_0(void) {
return
i_0_0 +
i_0_1 +
...
i_1_0 +
i_1_1 +
...
i_<n_int_files>_<n_ints_per_file>
}
0.c, 1.c, ..., <n_int_files>.c
unsigned int i_0_0 = 0;
unsigned int i_0_1 = 0;
...
unsigned int i_0_<n_ints_per_file> = 0;
which leads to:
n_int_files x n_ints_per_file x n_funcs
relocations on the link.
Then I compared:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main *.o
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -fuse-ld=gold -Wl,--threads -Wl,--thread-count=`nproc` -o main *.o
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -fuse-ld=lld -o main *.o
Some limits I've been trying to mitigate when selecting the test parameters:
at 100k C files, both methods get failed mallocs occasionally
GCC cannot compile a function with 1M additions
I have also observed a 2x in the debug build of gem5: https://gem5.googlesource.com/public/gem5/+/fafe4e80b76e93e3d0d05797904c19928587f5b5
Similar question: https://unix.stackexchange.com/questions/545699/what-is-the-gold-linker
Phoronix benchmarks
Phoronix did some benchmarking in 2017 for some real world projects, but for the projects they examined, the gold gains were not so significant: https://www.phoronix.com/scan.php?page=article&item=lld4-linux-tests&num=2 (archive).
Known incompatibilities
gold
https://sourceware.org/bugzilla/show_bug.cgi?id=23869 gold failed if I do a partial link with LD and then try the final link with gold. lld worked on the same test case.
https://github.com/cirosantilli/linux-kernel-module-cheat/issues/109 my debug symbols appeared broken in some places
LLD benchmarks
At https://lld.llvm.org/ they give build times for a few well known projects. with similar results to my synthetic benchmarks. Project/linker versions are not given unfortunately. In their results:
gold was about 3x/4x faster than LD
LLD was 3x/4x faster than gold, so a greater speedup than in my synthetic benchmark
They comment:
This is a link time comparison on a 2-socket 20-core 40-thread Xeon E5-2680 2.80 GHz machine with an SSD drive. We ran gold and lld with or without multi-threading support. To disable multi-threading, we added -no-threads to the command lines.
and results look like:
Program | Size | GNU ld | gold -j1 | gold | lld -j1 | lld
-------------|----------|---------|----------|---------|---------|-------
ffmpeg dbg | 92 MiB | 1.72s | 1.16s | 1.01s | 0.60s | 0.35s
mysqld dbg | 154 MiB | 8.50s | 2.96s | 2.68s | 1.06s | 0.68s
clang dbg | 1.67 GiB | 104.03s | 34.18s | 23.49s | 14.82s | 5.28s
chromium dbg | 1.14 GiB | 209.05s | 64.70s | 60.82s | 27.60s | 16.70s
As a Samba developer, I have been using the gold linker almost exclusively on Ubuntu, Debian, and Fedora since several years now. My assessment:
gold is many times (felt: 5-10 times) faster than the classical linker.
Initially, there were a few problems, but they have gone since roughly around Ubuntu 12.04.
The gold linker even found some dependency problems in our code, since it seems to be more correct than the classical one with respect to some details. See, e.g. this Samba commit.
I have not used gold selectively, but have been using symlinks or the alternatives mechanism if the distribution provides it.
You could link ld to gold (in a local binary directory if you have ld installed to avoid overwriting):
ln -s `which gold` ~/bin/ld
or
ln -s `which gold` /usr/local/bin/ld
Some projects seem to be incompatible with gold, because of some incompatible differences between ld and gold. Example: OpenFOAM, see http://www.openfoam.org/mantisbt/view.php?id=685 .
DragonFlyBSD switched over to gold as their default linker. So it seems to be ready for a variety of tools.
More details:
http://phoronix.com/scan.php?page=news_item&px=DragonFlyBSD-Gold-Linker