performance regression with Eigen 3.3.0 vs. 3.2.10? - c++

We're just in the process of porting our codebase over to Eigen 3.3 (quite an undertaking with all the 32-byte alignment issues). However, there's a few places where performance seems to have been badly affected, contrary to expectations (I was looking forward to some speedup given the extra support for FMA and AVX...). These include eigenvalue decomposition, and matrix*matrix.transpose()*vector products. I've written two minimal working examples to demonstrate.
All tests run on an up to date Arch Linux system, using an Intel Core i7-4930K CPU (3.40GHz), and compiled with g++ version 6.2.1.
1. Eigen value decomposition:
A straightforward self-adjoint eigenvalue decomposition takes twice as long with Eigen 3.3.0 as it does with 3.2.10.
File test_eigen_EVD.cpp:
#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#include <Eigen/Eigenvalues>
#define SIZE 200
using namespace Eigen;
int main (int argc, char* argv[])
{
MatrixXf mat = MatrixXf::Random(SIZE,SIZE);
SelfAdjointEigenSolver<MatrixXf> eig;
for (int n = 0; n < 1000; ++n)
eig.compute (mat);
return 0;
}
Test results:
eigen-3.2.10:
g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD
real 0m5.136s
user 0m5.133s
sys 0m0.000s
eigen-3.3.0:
g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_EVD.cpp -o test_eigen_EVD && time ./test_eigen_EVD
real 0m11.008s
user 0m11.007s
sys 0m0.000s
Not sure what might be causing this, but if anyone can see a way of maintaining performance with Eigen 3.3, I'd like to know about it!
2. matrix*matrix.transpose()*vector product:
This particular example takes a whopping 200× longer with Eigen 3.3.0...
File test_eigen_products.cpp:
#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#define SIZE 200
using namespace Eigen;
int main (int argc, char* argv[])
{
MatrixXf mat = MatrixXf::Random(SIZE,SIZE);
VectorXf vec = VectorXf::Random(SIZE);
for (int n = 0; n < 50; ++n)
vec = mat * mat.transpose() * VectorXf::Random(SIZE);
return vec[0] == 0.0;
}
Test results:
eigen-3.2.10:
g++ -march=native -O2 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
real 0m0.040s
user 0m0.037s
sys 0m0.000s
eigen-3.3.0:
g++ -march=native -O2 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
real 0m8.112s
user 0m7.700s
sys 0m0.410s
Adding brackets to the line in the loop like this:
vec = mat * ( mat.transpose() * VectorXf::Random(SIZE) );
makes a huge difference, with both Eigen versions then performing equally well (actually 3.3.0 is slightly better), and faster than the unbracketed 3.2.10 case. So there is a fix. Still, it's odd that 3.3.0 would struggle so much with this.
I don't know whether this is a bug, but I guess it's worth reporting in case this is something that needs to be fixed. Or maybe I was just doing it wrong...
Any thoughts appreciated.
Cheers,
Donald.
EDIT
As pointed out by ggael, the EVD in Eigen 3.3 is faster if compiled using clang++, or with -O3 with g++. So that's problem 1 fixed.
Problem 2 isn't really a problem since I can just put brackets to force the most efficient order of operations. But just for completeness: there does seems to be a flaw somewhere in the evaluation of these operations. Eigen is an incredible piece of software, I think this probably deserves to be fixed. Here's a modified version of the MWE, just to show that it's unlikely to be related to the first temporary product being taken out of the loop (at least as far as I can tell):
#define EIGEN_DONT_PARALLELIZE
#include <Eigen/Dense>
#include <iostream>
#define SIZE 200
using namespace Eigen;
int main (int argc, char* argv[])
{
VectorXf vec (SIZE), vecsum (SIZE);
MatrixXf mat (SIZE,SIZE);
for (int n = 0; n < 50; ++n) {
mat = MatrixXf::Random(SIZE,SIZE);
vec = VectorXf::Random(SIZE);
vecsum += mat * mat.transpose() * VectorXf::Random(SIZE);
}
std::cout << vecsum.norm() << std::endl;
return 0;
}
In this example, the operands are all initialised within the loop, and the results accumulated in vecsum, so there's no way the compiler can precompute anything, or optimise away unnecessary computations. This shows the exact same behaviour (this time testing with clang++ -O3 (version 3.9.0):
$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.2.10 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
5467.82
real 0m0.060s
user 0m0.057s
sys 0m0.000s
$ clang++ -march=native -O3 -DNDEBUG -isystem eigen-3.3.0 test_eigen_products.cpp -o test_eigen_products && time ./test_eigen_products
5467.82
real 0m4.225s
user 0m3.873s
sys 0m0.350s
So same result, but vastly different execution times. Thankfully, this is is easily resolved by placing brackets in the right places, but there does seem to be a regression somewhere in Eigen 3.3's evaluation of operations. With brackets around the mat.transpose() * VectorXf::Random(SIZE) part, the execution times are reduced for both Eigen versions to around 0.020s (so Eigen 3.2.10 clearly also benefits in this case). At least this means we can keep getting awesome performance out of Eigen!
In the meantime, I'll accept ggael's answer, it's all I needed to know to move forward.

For the EVD, I cannot reproduce with clang. With gcc, you need -O3 to avoid an inlining issue. Then, with both compiler, Eigen 3.3 will deliver a 33% speedup.
EDIT my previous answer regarding the matrix*matrix*vector product was wrong. This is a shortcoming in Eigen 3.3.0, and will be fixed in Eigen 3.3.1. For the record I leave here my previous analysis which is still partly valid:
As you noticed you should really add the parenthesis to perform two
matrix*vector products instead of a big matrix*matrix product.
Then the speed difference is easily explained by the fact that in 3.2,
the nested matrix*matrix product is immediately evaluated (at
nesting time), whereas in 3.3 it is evaluated at evaluation time, that
is in operator=. This means that in 3.2, the loop is equivalent to:
for (int n = 0; n < 50; ++n) {
MatrixXf tmp = mat * mat.transpose();
vec = tmp * VectorXf::Random(SIZE);
}
and thus the compiler can move tmp out of the loop. Production code
should not rely on the compiler for this kind of task and rather
explicitly moves constant expression outside loops.
This is true, except that in practice the compiler was not smart enough to move the temporary out of the loop.

Related

Trivial Eigen3 Tensor program does not build without -On

I'm trying to build a write of software with the Tensor module provided as unsupported from eigen3. I've written a simple piece of code that will build with a simple application of VectorXd (just printing it to stdout), and will also build with an analogous application of Tensor in place of the VectorXd, but WILL NOT build when I do not throw an optimization flag (-On). Note that my build is from within a conda enviromnent that is using conda-forge compilers, so the g++ in what follows is the g++ obtained from conda forge for ubuntu. It says its name in the error messages following, if that is perceived to be the issue.
I have a feeling this is not about the program I'm trying to write, but just in case I've included an mwe.cpp that seems to produce the error. The code follows:
#include <eigen3/Eigen/Dense>
#include <eigen3/unsupported/Eigen/CXX11/Tensor>
#include <iostream>
using namespace Eigen;
using namespace std;
int main(int argc, char const *argv[])
{
VectorXd v(6);
v << 1, 2, 3, 4, 5, 6;
cout << v.cwiseSqrt() << "\n";
Tensor<double, 1> t(6);
for (auto i=0; i<v.size(); i++){
t(i) = v(i);
}
cout << "\n";
for (auto i=0; i<t.size(); i++){
cout << t(i) << " ";
}
cout << "\n";
return 0;
}
If the above code is compiled without any optimizations, like:
g++ -I ~/miniconda3/envs/myenv/include/ mwe.cpp -o mwe
I get the following compiler error:
/home/myname/miniconda3/envs/myenv/bin/../lib/gcc/x86_64-conda_cos6-linux-gnu/7.3.0/../../../../x86_64-conda_cos6-linux-gnu/bin/ld: /tmp/cc2q8gj4.o: in function `Eigen::internal::(anonymous namespace)::get_random_seed()':
mwe.cpp:(.text+0x15): undefined reference to `clock_gettime'
collect2: error: ld returned 1 exit status
If instead I ask for 'n' optimization level, like the following:
g++ -I ~/miniconda3/envs/loos/include/ -On mwe.cpp -o mwe
The program builds without complaint and I get expected output:
$ ./mwe
1
1.41421
1.73205
2
2.23607
2.44949
1 2 3 4 5 6
I have no clue why this little program, or the real program I'm trying to write, would be trying to get a random seed for anything. Any advice would be appreciated. The reason why I would like to build without optimization is so that debugging is easier. I actually thought all this was being caused by debug flags, but I realized that my build tool's debug setting didn't ask for optimization and narrowed that down to the apparent cause. If I throw -g -O1 I do not see the error.
Obviously, if one were to comment out all the code that has to do with the Tensor module, that is everthing in main above 'return' and below the cwiseSqrt() line, and also the include statement, the code builds and produces expected output.
Technically, this is a linker error (g++ calls the compiler as well as the linker, depending on the command line arguments). And you get linker-errors if an externally defined function is called from somewhere, even if the code is never reached.
When compiling with optimizations enabled, g++ will optimize away uncalled functions (outside the global namespace), thus you get no linker errors. You may want to try -Og instead of -O1 for better debugging experience.
The following code should produce similar behavior:
int foo(); // externally defined
namespace { // anonymous namespace
// defined inside this module, but never called
int bar() {
return foo();
}
}
int main() {
// if you un-comment this line, the
// optimized version will fail as well:
// ::bar();
}
According to man clock_gettime you need to link with -lrt if your glibc version is older than 2.17 -- maybe that is the case for your setup:
g++ -I ~/miniconda3/envs/myenv/include/ mwe.cpp -o mwe -lrt

32b multiplication in 8b uC with avr-g++ vs 32b multiplication on X86 with gcc

PROBLEM:
I'm doing a fixed point c++ class to perform some closed loop control system on an 8b microcontroller.
I wrote a C++ class to encapsulate the PID and tested the algorithm on an X86 desktop with a modern gcc compiler. All good.
When I compiled the same code on an 8b microcontroller with a modern avr-g++ compiler, I had weird artefacts. After some debugging, the problem was that the 16b*16b multiplication was truncated to 16b. Below some minimal code to show what I'm trying to do.
I used -O2 optimization on the desktop system and -OS optimization on the embedded system, with no other compiler flag.
#include <cstdio>
#include <stdint.h>
#define TEST_16B true
#define TEST_32B true
int main( void )
{
if (TEST_16B)
{
int16_t op1 = 9000;
int16_t op2 = 9;
int32_t res;
//This operation gives the correct result on X86 gcc (81000)
//This operation gives the wrong result on AVR avr-g++ (15464)
res = (int32_t)0 +op1 *op2;
printf("op1: %d | op2: %d | res: %d\n", op1, op2, res );
}
if (TEST_32B)
{
int16_t op1 = 9000;
int16_t op2 = 9;
int32_t res;
//Promote first operand
int32_t promoted_op1 = op1;
//This operation gives the correct result on X86 gcc (81000)
//This operation gives the correct result on AVR avr-g++ (81000)
res = promoted_op1 *op2;
printf("op1: %d | op2: %d | res: %d\n", promoted_op1, op2, res );
}
return 0;
}
SOLUTION:
Just promoting one operand to 32b with a local variable is enough to solve the problem.
My expectation was that C++ would garantuee that a math operation would be performed at the same width as the first operand, so in my mind res = (int32_t)0 +... should have told the compiler that whatever came after should be performed at int32_t resolution.
This is not what happened. The (int16_t)*(int16_t) operation got truncated to (int16_t).
gcc has an internal word width of at least 32b in an X86 machine, so that might be the reason I didn't see artefacts on my desktop.
AVR Command Line
E:\Programs\AVR\7.0\toolchain\avr8\avr8-gnu-toolchain\bin\avr-g++.exe$(QUOTE) -funsigned-char -funsigned-bitfields -DNDEBUG -I"E:\Programs\AVR\7.0\Packs\atmel\ATmega_DFP\1.3.300\include" -Os -ffunction-sections -fdata-sections -fpack-struct -fshort-enums -Wall -pedantic -mmcu=atmega4809 -B "E:\Programs\AVR\7.0\Packs\atmel\ATmega_DFP\1.3.300\gcc\dev\atmega4809" -c -std=c++11 -fno-threadsafe-statics -fkeep-inline-functions -v -MD -MP -MF "$(#:%.o=%.d)" -MT"$(#:%.o=%.d)" -MT"$(#:%.o=%.o)" -o "$#" "$<"
QUESTION:
Is this the actual expected behaviour of a compliant C++ compiler, meaning I did it wrong, or is this a quirk of the avr-g++ compiler?
UPDATE:
Debugger output of various solutions
This is expected behavior of the compiler.
When you write A + B * C, that is equivalent to A + (B * C) because of operator precedence. The B * C term is evaluated on its own, without regard to how it is going to be used later. (Otherwise, it would be really hard to look at C/C++ code and understand what is actually going to happen.)
There are integer promotion rules in the C/C++ standards that sometimes help you out by promoting B and C to be of type int or maybe unsigned int before performing the multiplication. That is why you get the expected result on x86 gcc, where an int has 32 bits. However, since an int in avr-gcc only has 16 bits, the integer promotion is not good enough for you. So you need to cast either B or C to an int32_t to ensure the result of the multiplication will be an int32_t as well. For example, you can do:
A + (int32_t)B * C

Why is std::vector<char> faster than std::string?

I have written a small test where I'm trying to compare the run speed of resizing a container and then subsequently using std::generate_n to fill it up. I'm comparing std::string and std::vector<char>. Here is the program:
#include <algorithm>
#include <iostream>
#include <iterator>
#include <random>
#include <vector>
int main()
{
std::random_device rd;
std::default_random_engine rde(rd());
std::uniform_int_distribution<int> uid(0, 25);
#define N 100000
#ifdef STRING
std::cout << "String.\n";
std::string s;
s.resize(N);
std::generate_n(s.begin(), N,
[&]() { return (char)(uid(rde) + 65); });
#endif
#ifdef VECTOR
std::cout << "Vector.\n";
std::vector<char> v;
v.resize(N);
std::generate_n(v.begin(), N,
[&]() { return (char)(uid(rde) + 65); });
#endif
return 0;
}
And my Makefile:
test_string:
g++ -std=c++11 -O3 -Wall -Wextra -pedantic -pthread -o test test.cpp -DSTRING
valgrind --tool=callgrind --log-file="test_output" ./test
cat test_output | grep "refs"
test_vector:
g++ -std=c++11 -O3 -Wall -Wextra -pedantic -pthread -o test test.cpp -DVECTOR
valgrind --tool=callgrind --log-file="test_output" ./test
cat test_output | grep "refs"
And the comparisons for certain values of N:
N=10000
String: 1,865,367
Vector: 1,860,906
N=100000
String: 5,295,213
Vector: 5,290,757
N=1000000
String: 39,593,564
Vector: 39,589,108
std::vector<char> comes out ahead everytime. Since it seems to be more performant, what is even the point of using std::string?
I used #define N 100000000. Tested 3 times for each scenario and in all scenarios string is faster. Not using Valgrind, it does not make sense.
OS: Ubuntu 14.04. Arch:x86_64 CPU: Intel(R) Core(TM) i5-4670 CPU # 3.40GHz.
$COMPILER -std=c++11 -O3 -Wall -Wextra -pedantic -pthread -o test x.cc -DVECTOR
$COMPILER -std=c++11 -O3 -Wall -Wextra -pedantic -pthread -o test x.cc -DSTRING
Times:
compiler/variant | time(1) | time(2) | time(3)
---------------------------+---------+---------+--------
g++ 4.8.2/vector Times: | 1.724s | 1.704s | 1.669s
g++ 4.8.2/string Times: | 1.675s | 1.678s | 1.674s
clang++ 3.5/vector Times: | 1.929s | 1.934s | 1.905s
clang++ 3.5/string Times: | 1.616s | 1.612s | 1.619s
std::vector comes out ahead everytime. Since it seems to be more
performant, what is even the point of using std::string?
Even if we suppose that your observation holds true for a wide range of different systems and different application contexts, it would still make sense to use std::string for various reasons, which are all rooted in the fact that a string has different semantics than a vector. A string is a piece of text (at least simple, non-internationalised English text), a vector is a collection of characters.
Two things come to mind:
Ease of use. std::string can be constructed from string literals, has a lot of convenient operators and can be subject to string-specific algorithms. Try std::string x = "foo" + ("bar" + boost::algorithm::replace_all_copy(f(), "abc", "ABC").substr(0, 10) with a std::vector<char>...
std::string is implemented with Small-String Optimization (SSO) in MSVC, eliminating heap allocation entirely in many cases. SSO is based on the observation that strings are often very short, which certainly cannot be said about vectors.
Try the following:
#include <iostream>
#include <vector>
#include <string>
int main()
{
char const array[] = "short string";
#ifdef STRING
std::cout << "String.\n";
for (int i = 0; i < 10000000; ++i) {
std::string s = array;
}
#endif
#ifdef VECTOR
std::cout << "Vector.\n";
for (int i = 0; i < 10000000; ++i) {
std::vector<char> v(std::begin(array), std::end(array));
}
#endif
}
The std::string version should outperform the std::vector version, at least with MSVC. The difference is about 2-3 seconds on my machine. For longer strings, the results should be different.
Of course, this does not really prove anything either, except two things:
Performance tests depend a lot on the environment.
Performance tests should test what will realistically be done in a real program. In the case of strings, your program may deal with many small strings rather than a single huge one, so test small strings.

C++ operator overload performance issue

Consider following scheme. We have 3 files:
main.cpp:
int main() {
clock_t begin = clock();
int a = 0;
for (int i = 0; i < 1000000000; ++i) {
a += i;
}
clock_t end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
begin = clock();
C b(0);
for (int i = 0; i < 1000000000; ++i) {
b += C(i);
}
end = clock();
printf("Number: %d, Elapsed time: %f\n",
a, double(end - begin) / CLOCKS_PER_SEC);
return 0;
}
class.h:
#include <iostream>
struct C {
public:
int m_number;
C(int number);
void operator+=(const C & rhs);
};
class.cpp
C::C(int number)
: m_number(number)
{
}
void
C::operator+=(const C & rhs) {
m_number += rhs.m_number;
}
Files are compiled using clang++ with flags -std=c++11 -O3.
What I expected were very similar performance results, since I thought that compiler will optimize the operators not to be called as functions. The reality though was a bit different, here is the result:
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 5.375751
I played around a bit and found out, that if I paste all of the code from class.* into the main.cpp the speed dramatically improves and results are very similar.
Number: -1243309312, Elapsed time: 0.000003
Number: -1243309312, Elapsed time: 0.000003
Than I realized that this behavior is probably caused by the fact, that compilation of main.cpp and class.cpp is completely separated and therefore compiler is unable to perform adequate optimizations.
My question: Is there any way of keeping the 3-file scheme and still achieve the optimization level as if the files were merged into one and than compiled? I have read something about 'unity builds' but that seems like an overkill.
Short answer
What you want is link time optimization. Try the answer from this question. I.e., try:
clang++ -O4 -emit-llvm main.cpp -c -o main.bc
clang++ -O4 -emit-llvm class.cpp -c -o class.bc
llvm-link main.bc class.bc -o all.bc
opt -std-compile-opts -std-link-opts -O3 all.bc -o optimized.bc
clang++ optimized.bc -o yourExecutable
You should see that your performance reaches the one that you had when pasting everything into main.cpp.
Long answer
The problem is that the compiler cannot inline your overloaded operator during linking, because it no longer has its definition in a form which it can use to inline it (it cannot inline bare machine code). Thus, the operator call in main.cpp will stay a real function call to the function declared in class.cpp. A function call is very expensive in comparison to a simple inlined addition which can be optimized further (e.g., vectorized).
When you enable link time optimization, the compiler is able to do this. As you see above, you first create llvm intermediate representation byte code (the .bc files, which I will simply call llvm code hereinafter) instead of machine code.
You then link these files to a new .bc file which still contains llvm code instead of machine code. In contrast to machine code, the compiler is able to perform inlining on llvm code. opt is the llvm optimizer (be sure to install llvm), which performs the inlining and further link time optimizations. Then, we call clang++ a final time to generate executable machine code from the optimized llvm code.
For People with GCC
The answer above is only for clang. GCC (g++) users must use the -flto flag during compilation and during linking to enable link time optimization. It is simpler than with clang, simply add -flto everywhere:
g++ -c -O2 -flto main.cpp
g++ -c -O2 -flto class.cpp
g++ -o myprog -flto -O2 main.o class.o
The technique what you are looking for is called Link Time Optimization.
From the timing data, it is obvious that the compiler doesn't just generate better code for the trivial case, but that it doesn't perform any code at all to sum up a billion number. That doesn't happen in real life. You are not performing a useful benchmark. You want to test code that is at least complicated enough to avoid stupid/clever things like this.
I'd re-run the test, but change the loop to
for (int i = 0; i < 1000000000; ++i) if (i != 1000000) {
// ...
}
so that the compiler is forced to actually add up the numbers.

Why does g++ (4.6 and 4.7) promote the result of this division to a double? Can I stop it?

I was writing some templated code to benchmark a numeric algorithm using both floats and doubles, in order to compare against a GPU implementation.
I discovered that my floating point code was slower and after investigating using Vtune Amplifier from Intel I discovered that g++ was generating extra x86 instructions (cvtps2pd/cvtpd2ps and unpcklps/unpcklpd) to convert some intermediate results from float to double and then back again. The performance degradation is almost 10% for this application.
After compiling with the flag -Wdouble-promotion (which BTW is not included with -Wall or -Wextra), sure enough g++ warned me that the results were being promoted.
I reduced this to a simple test case shown below. Note that the ordering of the c++ code affects the generated code. The compound statement (T d1 = log(r)/r;) produces a warning, whilst the separated version does not (T d = log(r); d/=r;).
The following was compiled with both g++-4.6.3-1ubuntu5 and g++-4.7.3-2ubuntu1~12.04 with the same results.
Compile flags are:
g++-4.7 -O2 -Wdouble-promotion -Wextra -Wall -pedantic -Werror -std=c++0x test.cpp -o test
#include <cstdlib>
#include <iostream>
#include <cmath>
template <typename T>
T f()
{
T r = static_cast<T>(0.001);
// Gives no double promotion warning
T d = log(r);
d/=r;
// Promotes to double
T d1 = log(r)/r;
return d+d1;
}
int main()
{
float f1 = f<float>();
std::cout << f1 << std::endl;
}
I realise that the c++11 standard allows the compiler discretion here. But why does the order matter?
Can I explicitly instruct g++ to use floats only for this calculation?
EDIT: SOLVED by Mike Seymour. Needed to use std::log to ensure picking up the overloaded version of log instead of calling the C double log(double). The warning was not generated for the separated statement because this is a conversion and not a promotion.
The problem is
log(r)
In this implementation, it seems that the only log in the global namespace is the C library function, double log(double). Remember that it's not specified whether or not the C-library headers in the C++ library dump their definitions into the global namespace as well as namespace std.
You want
std::log(r)
to ensure that the extra overloads defined by the C++ library are available.