Consider following scheme. We have 3 files:
main.cpp:
int main() {
clock_t begin = clock();
int a = 0;
for (int i = 0; i < 1000000000; ++i) {
a += i;
}
clock_t end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
begin = clock();
C b(0);
for (int i = 0; i < 1000000000; ++i) {
b += C(i);
}
end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
return 0;
}
class.h:
#include <iostream>
struct C {
public:
int m_number;
C(int number);
void operator+=(const C & rhs);
};
class.cpp
C::C(int number)
: m_number(number)
{
}
void
C::operator+=(const C & rhs) {
m_number += rhs.m_number;
}
Files are compiled using clang++ with flags -std=c++11 -O3.
What I expected were very similar performance results, since I thought that compiler will optimize the operators not to be called as functions. The reality though was a bit different, here is the result:
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 5.375751
I played around a bit and found out, that if I paste all of the code from class.* into the main.cpp the speed dramatically improves and results are very similar.
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 0.000003
Than I realized that this behavior is probably caused by the fact, that compilation of main.cpp and class.cpp is completely separated and therefore compiler is unable to perform adequate optimizations.
My question: Is there any way of keeping the 3-file scheme and still achieve the optimization level as if the files were merged into one and than compiled? I have read something about 'unity builds' but that seems like an overkill.
Short answer
What you want is link time optimization. Try the answer from this question. I.e., try:
clang++ -O4 -emit-llvm main.cpp -c -o main.bc
clang++ -O4 -emit-llvm class.cpp -c -o class.bc
llvm-link main.bc class.bc -o all.bc
opt -std-compile-opts -std-link-opts -O3 all.bc -o optimized.bc
clang++ optimized.bc -o yourExecutable
You should see that your performance reaches the one that you had when pasting everything into main.cpp.
Long answer
The problem is that the compiler cannot inline your overloaded operator during linking, because it no longer has its definition in a form which it can use to inline it (it cannot inline bare machine code). Thus, the operator call in main.cpp will stay a real function call to the function declared in class.cpp. A function call is very expensive in comparison to a simple inlined addition which can be optimized further (e.g., vectorized).
When you enable link time optimization, the compiler is able to do this. As you see above, you first create llvm intermediate representation byte code (the .bc files, which I will simply call llvm code hereinafter) instead of machine code.
You then link these files to a new .bc file which still contains llvm code instead of machine code. In contrast to machine code, the compiler is able to perform inlining on llvm code. opt is the llvm optimizer (be sure to install llvm), which performs the inlining and further link time optimizations. Then, we call clang++ a final time to generate executable machine code from the optimized llvm code.
For People with GCC
The answer above is only for clang. GCC (g++) users must use the -flto flag during compilation and during linking to enable link time optimization. It is simpler than with clang, simply add -flto everywhere:
g++ -c -O2 -flto main.cpp
g++ -c -O2 -flto class.cpp
g++ -o myprog -flto -O2 main.o class.o
The technique what you are looking for is called Link Time Optimization.
From the timing data, it is obvious that the compiler doesn't just generate better code for the trivial case, but that it doesn't perform any code at all to sum up a billion number. That doesn't happen in real life. You are not performing a useful benchmark. You want to test code that is at least complicated enough to avoid stupid/clever things like this.
I'd re-run the test, but change the loop to
for (int i = 0; i < 1000000000; ++i) if (i != 1000000) {
// ...
}
so that the compiler is forced to actually add up the numbers.
Related
I am writing a modulo function and want to optimize the number of instructions called. Currently it looks like this
#include <cstdlib>
constexpr long long mod = 1e9 + 7;
static __attribute__((always_inline)) long long modulo(long long x) noexcept
{
return lldiv(x, mod).rem;
}
and even though I am compiling with -static there is a call to lldiv that isn't being inlined. I want to get around this by adding inline assembly to my function to call the idiv instruction directly. How can I do this?
I am using g++ 9.4.0 and I'm targeting x86.
Since this is related to a competetive programming contest, the code will be compiled with g++ -std=c++17 -O3 -static.
I have encountered a weird problem with LLVM coverage when using constant expressions in an if-statement:
template<typename T>
int foo(const T &val)
{
int idx = 0;
if constexpr(std::is_trivially_copyable<T>::value && sizeof(T) <= sizeof(int)
{
memcpy(&idx, &v, sizeof(T));
}
else
{
//store val and assign its index to idx
}
return idx;
}
The instantiations executed:
int idx1 = foo<int>(10);
int idx2 = foo<long long>(10);
int idx3 = foo<std::string>(std::string("Hello"));
int idx4 = foo<std::vector<int>>(std::vector<int>{1,2,3,4,5});
In none of these is the sizeof(T) <= sizeof(int) ever shown as executed. And yet in the first case instantiation (int) the body of the first if is indeed executed as it should. In none other it is shown as executed.
Relevant part of the compilation command line:
/usr/bin/clang++ -g -O0 -Wall -Wextra -fprofile-instr-generate -fcoverage-mapping -target x86_64-pc-linux-gnu -pipe -fexceptions -fvisibility=default -fPIC -DQT_CORE_LIB -DQT_TESTLIB_LIB -I(...) -std=c++17 -o test.o -c test.cpp
Relevant part of the linker command line:
/usr/bin/clang++ -Wl,-m,elf_x86_64,-rpath,/home/michael/Qt/5.11.2/gcc_64/lib -L/home/michael/Qt/5.11.2/gcc_64/lib -fprofile-instr-generate -fcoverage-mapping -target x86_64-pc-linux-gnu -o testd test.o -lpthread -fuse-ld=lld
When the condition is extracted to its own fucnction both int and long long instantiations are shown correctly in the coverage as executing the sizeof(T) <= sizeof(int) part. What might be causing such behaviour and how to solve it? Is it a bug in the Clang/LLVM cov?
Any ideas?
EDIT: This seems to be a known bug in LLVM (not clear yet if LLVM-cov or Clang though):
https://bugs.llvm.org/show_bug.cgi?id=36086
https://bugs.chromium.org/p/chromium/issues/detail?id=845575
First of all, sizeof(T) <= sizeof(int) should be executed at compile-time in your code, so chances are, compilation is not profiled for coverage.
Next, of those three types only long long looks trivially_copyable, but its size is (highly likely) more than that of int, so then-clause is not executed for them, nor even compiled. Since everything happens inside a templated function, the non-executed branch is not compiled.
We're just in the process of porting our codebase over to Eigen 3.3 (quite an undertaking with all the 32-byte alignment issues). However, there's a few places where performance seems to have been badly affected, contrary to expectations (I was looking forward to some speedup given the extra support for FMA and AVX...). These include eigenvalue decomposition, and matrix*matrix.transpose()*vector products. I've written two minimal working examples to demonstrate.
All tests run on an up to date Arch Linux system, using an Intel Core i7-4930K CPU (3.40GHz), and compiled with g++ version 6.2.1.
1. Eigen value decomposition:
A straightforward self-adjoint eigenvalue decomposition takes twice as long with Eigen 3.3.0 as it does with 3.2.10.
File test_eigen_EVD.cpp:
#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#include <Eigen/Eigenvalues>
#define SIZE 200
using namespace Eigen;
int main (int argc, char* argv[])
{
MatrixXf mat = MatrixXf::Random(SIZE,SIZE);
SelfAdjointEigenSolver<MatrixXf> eig;
for (int n = 0; n < 1000; ++n)
eig.compute (mat);
return 0;
}
Test results:
eigen-3.2.10:
g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD
real 0m5.136s
user 0m5.133s
sys 0m0.000s
eigen-3.3.0:
g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD
real 0m11.008s
user 0m11.007s
sys 0m0.000s
Not sure what might be causing this, but if anyone can see a way of maintaining performance with Eigen 3.3, I'd like to know about it!
2. matrix*matrix.transpose()*vector product:
This particular example takes a whopping 200× longer with Eigen 3.3.0...
File test_eigen_products.cpp:
#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#define SIZE 200
using namespace Eigen;
int main (int argc, char* argv[])
{
MatrixXf mat = MatrixXf::Random(SIZE,SIZE);
VectorXf vec = VectorXf::Random(SIZE);
for (int n = 0; n < 50; ++n)
vec = mat * mat.transpose() * VectorXf::Random(SIZE);
return vec[0] == 0.0;
}
Test results:
eigen-3.2.10:
g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
real 0m0.040s
user 0m0.037s
sys 0m0.000s
eigen-3.3.0:
g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
real 0m8.112s
user 0m7.700s
sys 0m0.410s
Adding brackets to the line in the loop like this:
vec = mat * ( mat.transpose() * VectorXf::Random(SIZE) );
makes a huge difference, with both Eigen versions then performing equally well (actually 3.3.0 is slightly better), and faster than the unbracketed 3.2.10 case. So there is a fix. Still, it's odd that 3.3.0 would struggle so much with this.
I don't know whether this is a bug, but I guess it's worth reporting in case this is something that needs to be fixed. Or maybe I was just doing it wrong...
Any thoughts appreciated.
Cheers,
Donald.
EDIT
As pointed out by ggael, the EVD in Eigen 3.3 is faster if compiled using clang++, or with -O3 with g++. So that's problem 1 fixed.
Problem 2 isn't really a problem since I can just put brackets to force the most efficient order of operations. But just for completeness: there does seems to be a flaw somewhere in the evaluation of these operations. Eigen is an incredible piece of software, I think this probably deserves to be fixed. Here's a modified version of the MWE, just to show that it's unlikely to be related to the first temporary product being taken out of the loop (at least as far as I can tell):
#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#include <iostream>
#define SIZE 200
using namespace Eigen;
int main (int argc, char* argv[])
{
VectorXf vec (SIZE), vecsum (SIZE);
MatrixXf mat (SIZE,SIZE);
for (int n = 0; n < 50; ++n) {
mat = MatrixXf::Random(SIZE,SIZE);
vec = VectorXf::Random(SIZE);
vecsum += mat * mat.transpose() * VectorXf::Random(SIZE);
}
std::cout << vecsum.norm() << std::endl;
return 0;
}
In this example, the operands are all initialised within the loop, and the results accumulated in vecsum, so there's no way the compiler can precompute anything, or optimise away unnecessary computations. This shows the exact same behaviour (this time testing with clang++ -O3 (version 3.9.0):
$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
5467.82
real 0m0.060s
user 0m0.057s
sys 0m0.000s
$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
5467.82
real 0m4.225s
user 0m3.873s
sys 0m0.350s
So same result, but vastly different execution times. Thankfully, this is is easily resolved by placing brackets in the right places, but there does seem to be a regression somewhere in Eigen 3.3's evaluation of operations. With brackets around the mat.transpose() * VectorXf::Random(SIZE) part, the execution times are reduced for both Eigen versions to around 0.020s (so Eigen 3.2.10 clearly also benefits in this case). At least this means we can keep getting awesome performance out of Eigen!
In the meantime, I'll accept ggael's answer, it's all I needed to know to move forward.
For the EVD, I cannot reproduce with clang. With gcc, you need -O3 to avoid an inlining issue. Then, with both compiler, Eigen 3.3 will deliver a 33% speedup.
EDIT my previous answer regarding the matrix*matrix*vector product was wrong. This is a shortcoming in Eigen 3.3.0, and will be fixed in Eigen 3.3.1. For the record I leave here my previous analysis which is still partly valid:
As you noticed you should really add the parenthesis to perform two
matrix*vector products instead of a big matrix*matrix product.
Then the speed difference is easily explained by the fact that in 3.2,
the nested matrix*matrix product is immediately evaluated (at
nesting time), whereas in 3.3 it is evaluated at evaluation time, that
is in operator=. This means that in 3.2, the loop is equivalent to:
for (int n = 0; n < 50; ++n) {
MatrixXf tmp = mat * mat.transpose();
vec = tmp * VectorXf::Random(SIZE);
}
and thus the compiler can move tmp out of the loop. Production code
should not rely on the compiler for this kind of task and rather
explicitly moves constant expression outside loops.
This is true, except that in practice the compiler was not smart enough to move the temporary out of the loop.
I've been reading again Scott Meyers' Effective C++ and more specifically Item 30 about inlining.
So I wrote the following, trying to induce that optimization with gcc 4.6.3
// test.h
class test {
public:
inline int max(int i) { return i > 5 ? 1 : -1; }
int foo(int);
private:
int d;
};
// test.cpp
int test::foo(int i) { return max(i); }
// main.cpp
#include "test.h"
int main(int argc, const char *argv[]) {
test t;
return t.foo(argc);
}
and produced the relevant assembly using alternatively the following:
g++ -S -I. test.cpp main.cpp
g++ -finline-functions -S -I. test.cpp main.cpp
Both commands produced the same assembly as far as the inline method is concerned;
I can see both the max() method body (also having a cmpl statement and the relevant jumps) and its call from foo().
Am I missing something terribly obvious? I can't say that I combed through the gcc man page, but couldn't find anything relevant standing out.
So, I just increased the optimization level to -O3 which has the inline optimizations on by default, according to:
g++ -c -Q -O3 --help=optimizers | grep inline
-finline-functions [enabled]
-finline-functions-called-once [enabled]
-finline-small-functions [enabled]
unfortunately, this optimized (as expected) the above code fragment almost out of existence.
max() is no longer there (at least as an explicitly tagged assembly block) and foo() has been reduced to:
_ZN4test3fooEi:
.LFB7:
.cfi_startproc
rep
ret
.cfi_endproc
which I cannot clearly understand at the moment (and is out of research scope).
Ideally, what I would like to see, would have been the assembly code for max() inside the foo() block.
Is there a way (either through cmd-line options or using a different (non-trivial?) code fragment) to produce such an output?
The compiler is entirely free to inline functiones even if you don't ask it to - both when you use inline keyword or not, or whether you use -finline-functions or not (although probably not if you use -fnoinline-functions - that would be contrary to what you asked for, and although the C++ standard doesn't say so, the flag becomes pretty pointless if it doesn't do something like what it says).
Next, the compiler is also not always certain that your function won't be used "somewhere else", so it will produce an out-of-line copy of most inline functions, unless it's entirely clear that it "can not possibly be called from somewhere else [for example the class is declared such that it can't be reached elsewhere].
And if you don't use the result of a function, and the function doesn't have side-effects (e.g. writing to a global variable, performing I/O or calling a function the compiler "doesn't know what it does"), then the compiler will eliminate that code as "dead" - because you don't really want unnecessary code, do you? Adding a return in front of max(i) in your foo function should help.
I was writing some templated code to benchmark a numeric algorithm using both floats and doubles, in order to compare against a GPU implementation.
I discovered that my floating point code was slower and after investigating using Vtune Amplifier from Intel I discovered that g++ was generating extra x86 instructions (cvtps2pd/cvtpd2ps and unpcklps/unpcklpd) to convert some intermediate results from float to double and then back again. The performance degradation is almost 10% for this application.
After compiling with the flag -Wdouble-promotion (which BTW is not included with -Wall or -Wextra), sure enough g++ warned me that the results were being promoted.
I reduced this to a simple test case shown below. Note that the ordering of the c++ code affects the generated code. The compound statement (T d1 = log(r)/r;) produces a warning, whilst the separated version does not (T d = log(r); d/=r;).
The following was compiled with both g++-4.6.3-1ubuntu5 and g++-4.7.3-2ubuntu1~12.04 with the same results.
Compile flags are:
g++-4.7 -O2 -Wdouble-promotion -Wextra -Wall -pedantic -Werror -std=c++0x test.cpp -o test
#include <cstdlib>
#include <iostream>
#include <cmath>
template <typename T>
T f()
{
T r = static_cast<T>(0.001);
// Gives no double promotion warning
T d = log(r);
d/=r;
// Promotes to double
T d1 = log(r)/r;
return d+d1;
}
int main()
{
float f1 = f<float>();
std::cout << f1 << std::endl;
}
I realise that the c++11 standard allows the compiler discretion here. But why does the order matter?
Can I explicitly instruct g++ to use floats only for this calculation?
EDIT: SOLVED by Mike Seymour. Needed to use std::log to ensure picking up the overloaded version of log instead of calling the C double log(double). The warning was not generated for the separated statement because this is a conversion and not a promotion.
The problem is
log(r)
In this implementation, it seems that the only log in the global namespace is the C library function, double log(double). Remember that it's not specified whether or not the C-library headers in the C++ library dump their definitions into the global namespace as well as namespace std.
You want
std::log(r)
to ensure that the extra overloads defined by the C++ library are available.