Matrix multiplication performance numpy and eigen c++ - c++

I am trying to compare matrix multiplication performance of eigen using C++ and numpy.
Here is c++ code for matrix multiplication
#include<iostream>
#include <Eigen/Dense>
#include <ctime>
#include <iomanip>
using namespace Eigen;
using namespace std;
int main()
{
time_t begin,end;
double difference=0;
time (&begin);
for(int i=0;i<500;++i)
{
MatrixXd m1 = MatrixXd::Random(500,500);
MatrixXd m2 = MatrixXd::Random(500,500);
MatrixXd m3 = MatrixXd::Zero(500,500);
m3=m1*m2;
}
time (&end);
difference = difftime (end,begin);
std::cout<<"time = "<<std::setprecision(10)<<(difference/500.)<<" seconds"<<std::endl;
return 0;
}
compiling using g++ -Wall -Wextra -I "path-to-eigen-directory" prog5.cpp -o prog5 -O3 -std=gnu++0x
Output:
time = 0.116 seconds
Here is python code.
import timeit
import numpy as np
start_time = timeit.default_timer()
for i in range(500):
m1=np.random.rand(500,500)
m2=np.random.rand(500,500)
m3=np.zeros((500,500))
m3=np.dot(m1,m2)
stop_time = timeit.default_timer()
print('Time = {} seconds'.format((stop_time-start_time)/500))
Output:
Time = 0.01877937281645333 seconds
It looks like C++ code is 6 times slower as compared to python. Can someone give insights whether I am missing here anything?
I am using Eigen 3.3.4, g++ compiler (MinGW.org GCC-6.3.0-1) 6.3.0, python 3.6.1, numpy 1.11.3. Python running with spyder ide. Using Windows.
Update:
As per answer and comments, I updated the code.
C++ code compiled with g++ -Wall -Wextra -I "path-to-eigen-directory" prog5.cpp -o prog5 -O3 -std=gnu++0x -march=native. I couldn't get -fopenmp to work - there seems no output if I use this flag.
#include<iostream>
#include <Eigen/Dense>
#include <ctime>
#include <iomanip>
using namespace Eigen;
using namespace std;
int main()
{
time_t begin,end;
double difference=0;
time (&begin);
for(int i=0;i<10000;++i)
{
MatrixXd m1 = MatrixXd::Random(500,500);
MatrixXd m2 = MatrixXd::Random(500,500);
MatrixXd m3 = MatrixXd::Zero(500,500);
m3=m1*m2;
}
time (&end); // note time after execution
difference = difftime (end,begin);
std::cout<<"Total time = "<<difference<<" seconds"<<std::endl;
std::cout<<"Average time = "<<std::setprecision(10)<<(difference/10000.)<<" seconds"<<std::endl;
return 0;
}
Output:
Total time = 328 seconds
Average time = 0.0328 seconds
Python code:
import timeit
import numpy as np
start_time = timeit.default_timer()
for i in range(10000):
m1=np.random.rand(500,500)
m2=np.random.rand(500,500)
m3=np.zeros((500,500))
m3=np.dot(m1,m2)
stop_time = timeit.default_timer()
print('Total time = {} seconds'.format(stop_time-start_time))
print('Average time = {} seconds'.format((stop_time-start_time)/10000))
Running with runfile('filename.py') command using spyder IDE.
Output:
Total time = 169.35587796526667 seconds
Average time = 0.016935587796526666 seconds
Now the performance with eigen is better, but not equal to or faster than numpy. May be -fopenmp can do the trick, but not sure. However, I am not using any parallelization in numpy, unless it is doing that implicitly.

There are several issues with your benchmark:
You are benchmarking the system rand() function which is very costly!
You're missing the compiler -march=native to get AVX/FMA boosts
You're missing -fopenmp to enable multithreading.
On my quad i7 2.6GHz CPU I get:
initial code: 0.024s
after replacing `Random` by `Ones`: 0.018s
adding `-march=native`: 0.006s
adding `-fopenmp`: 0.003s
The matrix is a bit too small to get good multithreading benefits.

Related

How to print the current time with fractional seconds, using fmt library

This code
#include <chrono>
#include <fmt/format.h>
#include <fmt/chrono.h>
...
auto now = std::chrono::system_clock::now();
fmt::print("The time is: {:%Y-%m-%d %H:%M:%S}\n", now);
Prints this:
The time is: 2023-01-02 15:51:23
How do I get it to print sub-second precision, for example, milliseconds? Something like:
The time is: 2023-01-02 15:51:23.753
I think you have to do this in two steps -- YMD and HM followed by the high-res seconds. Here is a worked example, using fmt version 9.0 on Ubuntu:
#include <chrono>
#include <fmt/format.h>
#include <fmt/chrono.h>
int main() {
auto now = std::chrono::system_clock::now();
auto sse = now.time_since_epoch();
fmt::print("{:%FT%H:%M:}{:%S}\n", now, sse);
exit(0);
}
for which I get
$ g++ -o answer answer.cpp -lfmt; ./answer
2023-01-02T16:26:17.009359309
$
(Initial, earlier, slightly off attempt to reason based on spdlog follows. But spdlog wraps fmt and overlays its format.)
It is on the format page, you want something like %Y-%m-%d %H:%M:%S.%e. Note that %e, %f and %F give you milli, micro and nanoseconds. Edit: That was the spdlog format though.
Here is a quick example, for simplicity from my R package RcppSpdlog accessing fmt via spdlog:
> library(RcppSpdlog)
> RcppSpdlog::log_set_pattern("[%Y-%m-%d %H:%M:%S.%f] [%L] %v")
> RcppSpdlog::log_warn("Hello")
[2023-01-02 15:00:49.746031] [W] Hello
> RcppSpdlog::log_set_pattern("[%Y-%m-%d %H:%M:%S.%e] [%L] %v")
> RcppSpdlog::log_warn("World")
[2023-01-02 15:01:02.137] [W] World
>
Note the display of micro- and milliseconds as stated.
Edit: There is something else going on with your example. Here is a simplified answer, printing only seconds and it comes by default fractionally down to nanoseconds (for me on Linux):
$ cat answer.cpp
#include <chrono>
#include <fmt/format.h>
#include <fmt/chrono.h>
int main() {
fmt::print("{:%S}\n", std::chrono::system_clock::now().time_since_epoch());
exit(0);
}
$ g++ -o answer answer.cpp -lfmt
$ ./answer
02.461241690
$
Just share my way to output fractional seconds in logging:
std::string Timestamp() {
namespace c = std::chrono;
auto tp = c::time_point_cast<c::microseconds>(c::system_clock::now());
auto us = tp.time_since_epoch().count() % std::micro::den;
return fmt::format("{:%Y-%m-%d_%H:%M:%S}.{:06d}", tp, us);
}
There is a pull request to output fractional seconds for formatting time_point, (but closed, the author did not continue the work):
https://github.com/fmtlib/fmt/pull/2292
My concern is that change %S to include fractional part is some-what breaking change (strftime defines it as seconds from 00 to 60). I hope fmtlib would consider some other way to tackle this requirement.

How to measure the execution time of C math.h library functions?

By using time.h header, I'm getting execution time of sqrt() as 2 nanoseconds (with the gcc command in a Linux terminal) and 44 nanoseconds (with the g++ command in Ubuntu terminal). Can anyone tell me any other method to measure the execution time of the math.h library functions?
Below is the code:
#include <time.h>
#include <stdio.h>
#include<math.h>
int main()
{
time_t begin,end; // time_t is a datatype to store time values.
time (&begin); // note time before execution
for(int i=0;i<1000000000;i++) //using for loop till 10^9 times to make the execution time in nanoseconds
{
cbrt(9999999); // calling the cube root function from math library
}
time (&end); // note time after execution
double difference = difftime (end,begin);
printf ("time taken for function() %.2lf in Nanoseconds.\n", difference );
printf(" cube root is :%f \t",cbrt(9999999));
return 0;
}
OUTPUT:
by using **gcc**: time taken for function() 2.00 seconds.
cube root is :215.443462
by using **g++**: time taken for function() 44.00 in Nanoseconds.
cube root is:215.443462
Linux terminal result
Give or take the length of the prompt:
$ g++ t1.c
$ ./a.out
time taken for function() 44.00 in Nanoseconds.
cube root is :215.443462
$ gcc t1.c
$ ./a.out
time taken for function() 2.00 in Nanoseconds.
cube root is :215.443462
$
how to measure the execution time of c math.h library functions?
C compilers are often allowed to analyze well known standard library functions and replace such fix code like cbrt(9999999); with 215.443462.... Further, since dropping the function in the loop does not affect the function of the code, that loop may be optimized out.
Use of volatile prevents much of this as the compiler cannot assume no impact when the function is replaced, removed.
for(int i=0;i<1000000000;i++) {
// cbrt(9999999);
volatile double x = 9999999.0;
volatile double y = cbrt(x);
}
The granularity of time() is often only 1 second and if the billion loops only results in a few seconds, consider more loops.
Code could use below to factor out the loop overhead.
time_t begin,middle,end;
time (&begin);
for(int i=0;i<1000000000;i++) {
volatile double x = 9999999.0;
volatile double y = x;
}
time (&middle);
for(int i=0;i<1000000000;i++) {
volatile double x = 9999999.0;
volatile double y = cbrt(x);
}
time (&end);
double difference = difftime(end,middle) - difftime(middle,begin);
Timing code is an art, and one part of the art is making sure that the compiler doesn't optimize your code away. For standard library functions, the compiler may well be aware of what it is/does and be able to evaluate a constant at compile time. In your example, the call cbrt(9999999); gives two opportunities for optimization. The value from cbrt() can be evaluated at compile-time because the argument is a constant. Secondly, the return value is not used, and the standard function has no side-effects, so the compiler can drop it altogether. You can avoid those problems by capturing the result (for example, by evaluating the sum of the cube roots from 0 to one billion (minus one) and printing that value after the timing code.
tm97.c
When I compiled your code, shorn of comments, I got:
$ cat tm97.c
#include <time.h>
#include <stdio.h>
#include <math.h>
int main(void)
{
time_t begin, end;
time(&begin);
for (int i = 0; i < 1000000000; i++)
{
cbrt(9999999);
}
time(&end);
double difference = difftime(end, begin);
printf("time taken for function() %.2lf in Nanoseconds.\n", difference );
printf(" cube root is :%f \t", cbrt(9999999));
return 0;
}
$ make tm97
gcc -O3 -g -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm97.c -o tm97 -L../lib -lsoq
tm97.c: In function ‘main’:
tm97.c:11:9: error: statement with no effect [-Werror=unused-value]
11 | cbrt(9999999);
| ^~~~
cc1: all warnings being treated as errors
rmk: error code 1
$
I'm using GCC 9.3.0 on a 2017 MacBook Pro running macOS Mojave 10.14.6 with XCode 11.3.1 (11C504) and GCC 9.3.0 — XCode 11.4 requires Catalina 10.15.2, but work hasn't got around organizing support for that, yet. Interestingly, when the same code is compiled by g++, it compiles without warnings (errors):
$ ln -s tm97.c tm89.cpp
make tm89 SXXFLAGS=-std=c++17 CXX=g++
g++ -O3 -g -I../inc -std=c++17 -Wall -Wextra -Werror -L../lib tm89.cpp -lsoq -o tm89
$
I routinely use some timing code that is available in my SOQ (Stack Overflow Questions) repository on GitHub as files timer.c and timer.h in the src/libsoq sub-directory. The code is only compiled as C code in my library, so I created a simple wrapper header, timer2.h, so that the programs below could use #include "timer2.h" and it would work OK with both C and C++ compilations:
#ifndef TIMER2_H_INCLUDED
#define TIMER2_H_INCLUDED
#ifdef __cplusplus
extern "C" {
#endif
#include "timer.h"
#ifdef __cplusplus
}
#endif
#endif /* TIMER2_H_INCLUDED */
tm29.cpp and tm31.c
This code uses the sqrt() function for testing. It accumulates the sum of the square roots. It uses the timing code from timer.h/timer.c around your timing code — type Clock and functions clk_init(), clk_start(), clk_stop(), and clk_elapsed_us() to evaluate the elapsed time in microseconds between when the clock was started and last stopped.
The source code can be compiled by either a C compiler or a C++ compiler.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += sqrt(i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for sqrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of square roots from 0 to %d is: %f\n", i, sum);
return 0;
}
tm41.c and tm43.cpp
This code is almost identical to the previous code, but the tested function is the cbrt() (cube root) function.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += cbrt(i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for cbrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of cube roots from 0 to %d is: %f\n", i, sum);
return 0;
}
tm59.c and tm61.c
This code uses fabs() instead of either sqrt() or cbrt(). It's still a function call, but it might be inlined. It invokes the conversion from int to double explicitly; without that cast, GCC complains that it should be using the integer abs() function instead.
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
double sum = 0.0;
int i;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (i = 0; i < 1000000000; i++)
{
sum += fabs((double)i);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for fabs() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Sum of absolute values from 0 to %d is: %f\n", i, sum);
return 0;
}
tm73.cpp
This file uses the original code with my timing wrapper code too. The C version doesn't compile — the C++ version does:
#include <time.h>
#include <stdio.h>
#include <math.h>
#include "timer2.h"
int main(void)
{
time_t begin, end;
Clock clk;
clk_init(&clk);
clk_start(&clk);
time(&begin);
for (int i = 0; i < 1000000000; i++)
{
cbrt(9999999);
}
time(&end);
clk_stop(&clk);
double difference = difftime(end, begin);
char buffer[32];
printf("Time taken for cbrt() is %.2lf nanoseconds (%s ns).\n",
difference, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
printf("Cube root is: %f\n", cbrt(9999999));
return 0;
}
Timing
Using a command timecmd which reports start and stop time, and PID, of programs as well as the timing code built into the various commands (it's a variant on the theme of the time command), I got the following results. (rmk is just an alternative implementation of make.)
$ for prog in tm29 tm31 tm41 tm43 tm59 tm61 tm73
> do rmk $prog && timecmd -ur -- $prog
> done
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm29.cpp -o tm29 -L../lib -lsoq
2020-03-28 08:47:50.040227 [PID 19076] tm29
Time taken for sqrt() is 1.00 nanoseconds (1.700296 ns).
Sum of square roots from 0 to 1000000000 is: 21081851051977.781250
2020-03-28 08:47:51.747494 [PID 19076; status 0x0000] - 1.707267s - tm29
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm31.c -o tm31 -L../lib -lsoq
2020-03-28 08:47:52.056021 [PID 19088] tm31
Time taken for sqrt() is 1.00 nanoseconds (1.679867 ns).
Sum of square roots from 0 to 1000000000 is: 21081851051977.781250
2020-03-28 08:47:53.742383 [PID 19088; status 0x0000] - 1.686362s - tm31
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm41.c -o tm41 -L../lib -lsoq
2020-03-28 08:47:53.908285 [PID 19099] tm41
Time taken for cbrt() is 7.00 nanoseconds (6.697999 ns).
Sum of cube roots from 0 to 1000000000 is: 749999999499.628418
2020-03-28 08:48:00.613357 [PID 19099; status 0x0000] - 6.705072s - tm41
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm43.cpp -o tm43 -L../lib -lsoq
2020-03-28 08:48:00.817975 [PID 19110] tm43
Time taken for cbrt() is 7.00 nanoseconds (6.614539 ns).
Sum of cube roots from 0 to 1000000000 is: 749999999499.628418
2020-03-28 08:48:07.438298 [PID 19110; status 0x0000] - 6.620323s - tm43
gcc -O3 -g -I../inc -std=c11 -Wall -Wextra -Werror -Wmissing-prototypes -Wstrict-prototypes tm59.c -o tm59 -L../lib -lsoq
2020-03-28 08:48:07.598344 [PID 19121] tm59
Time taken for fabs() is 1.00 nanoseconds (1.114822 ns).
Sum of absolute values from 0 to 1000000000 is: 499999999067108992.000000
2020-03-28 08:48:08.718672 [PID 19121; status 0x0000] - 1.120328s - tm59
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm61.cpp -o tm61 -L../lib -lsoq
2020-03-28 08:48:08.918745 [PID 19132] tm61
Time taken for fabs() is 2.00 nanoseconds (1.117780 ns).
Sum of absolute values from 0 to 1000000000 is: 499999999067108992.000000
2020-03-28 08:48:10.042134 [PID 19132; status 0x0000] - 1.123389s - tm61
g++ -O3 -g -I../inc -std=c++11 -Wall -Wextra -Werror tm73.cpp -o tm73 -L../lib -lsoq
2020-03-28 08:48:10.236899 [PID 19143] tm73
Time taken for cbrt() is 0.00 nanoseconds (0.000004 ns).
Cube root is: 215.443462
2020-03-28 08:48:10.242322 [PID 19143; status 0x0000] - 0.005423s - tm73
$
I've run the programs many times; the times above are representative of what I got each time. There are a number of conclusions that can be drawn:
sqrt() (1.7 ns) is quicker than cbrt() (6.7 ns).
fabs() (1.1 ns) is quicker than sqrt() (1.7 ns).
However, fabs() gives a moderate approximation to the time taken with loop overhead and conversion from int to double.
When the result of cbrt() is not used, the compiler eliminates the loop.
When compiled with the C++ compiler, the code with from the question removes the loop altogether, leaving only the calls to time() to be measured. The result printed by clk_elapsed_us() is the time taken to execute the code between clk_start() and clk_stop() in seconds with microsecond resolution — 0.000004 is 4 microseconds elapsed time. The value is marked in ns because when the loop executes one billion times, the elapsed time in seconds also represents the time in nanoseconds for one loop — there are a billion nanoseconds in a second.
The times reported by timecmd are consistent with the times reported by the programs. There is the overhead of starting the process (fork() and exec()) and the I/O in the process that is included in the times reported by timecmd.
Although not shown, the timings with clang and clang++ (instead of GCC 9.3.0) are very comparable, though the cbrt() code takes about 7.5 ns per iteration instead of 6.7 ns. The timing differences for the others are basically noise.
The number suffixes are all 2-digit primes. They have no other significance except to keep the different programs separate.
As #Jonathan Leffler commented, compiler can optimize your C / c++ code. If the C code just loops from 0 to 1000 w/o doing anything with the counter i (I mean, w/o printing it or using the intermediate values in any other operation, indexes, etc), compiler may not even create the assembly code that corresponds to that loop. Possible arithmetic operations will even be pre-computed. For the code below;
int foo(int x) {
return x * 5;
}
int main() {
int x = 3;
int y = foo(x);
...
...
}
it is not surprising for the compiler to generate just two lines of assembly code (the compiler may even by-pass calling the function foo and generate an inline instruction) for function foo:
mov $15, %eax
; compiler will not bother multiplying 5 by 3
; but just move the pre-computed '15' to register
ret
; and then return

Exception throw when using Armadillo and qpOASES

I have a quadratic programming optimization problem that I am solving with qpOASES. In there exists a matrix X that I need to precondition, so I am using Armadillo and the routine arma::pinv from there in order to calculate the Moor-Penrose pseudoinverse.
The problem: I write the matrix X in a file , and then I read it in a separate program (say test.cpp) that does not depend in any way to qpOASES. The routine pinv runs fine.
#include <iostream>
#include <fstream>
#include <armadillo>
#include <string>
using namespace std;
using namespace arma;
int main(){
// Read design matrix.
int NRows = 199;
int NFields = 26;
string flname_in = "chol_out_2_data";
mat A (NRows,NFields);
for (int i=0; i < NRows; ++i)
for (int j=0; j < NFields; ++j)
myin >> A(i,j) ;
// Calculate pseudoinverse
mat M;
pinv(M,A); // <========= THIS fails when I use flag: -lqpOASES
}
When I include the same routine in the file where I perform the QP optimization (say true_QP.cpp), I get a runtime error, due to pinv not being able to calculate the pseudo inverse. I've done extensive tests, the file is read in OK and the values are the same.
I've tracked down the problem that is a conflict in the following way: I compiled the program that does not depend in any way on qpOASES (test.cpp - as described above) also with the flag -lqpOASES and then, the code gives run time error.
That is,compile:
g++ test.cpp -o test.xxx -larmadillo
runs fine:
./test.xxx
compile:
g++ test.cpp -o test.xxx -larmadillo -lqpOASES
throws exception (due to failure of calculating pinv):
./test.xxx
Therefore I suspect some conflict - it seems that using -lqpOASES affects some flag in armadillo also? Any ideas? Is there some dependency in LAPACK/BLAS or some flag internally that may change the setup of Armadillo? Thank you for your time.
Here is the documentation for the arma::pinv function:
http://arma.sourceforge.net/docs.html#pinv
I have resolved the issue by calculating pinv from Eigen, instead of Armadillo.
The function definition I used for Eigen, based on this bug report:
http://eigen.tuxfamily.org/bz/show_bug.cgi?id=257
is:
template<typename _Matrix_Type_>
Eigen::MatrixXd pinv(const _Matrix_Type_ &a, double epsilon =std::numeric_limits<double>::epsilon())
{
Eigen::JacobiSVD< _Matrix_Type_ > svd(a ,Eigen::ComputeThinU | Eigen::ComputeThinV);
double tolerance = epsilon * std::max(a.cols(), a.rows()) *svd.singularValues().array().abs()(0);
return
svd.matrixV() * (svd.singularValues().array().abs() > tolerance).select(svd.singularValues().array().inverse(), 0).matrix().asDiagonal() * svd.matrixU().adjoint();
}

Efficiency of STL algorithms with fixed size arrays

In general, I assume that the STL implementation of any algorithm is at least as efficient as anything I can come up with (with the additional benefit of being error free). However, I came to wonder whether the STL's focus on iterators might be harmful in some situations.
Lets assume I want to calculate the inner product of two fixed size arrays. My naive implementation would look like this:
std::array<double, 100000> v1;
std::array<double, 100000> v2;
//fill with arbitrary numbers
double sum = 0.0;
for (size_t i = 0; i < v1.size(); ++i) {
sum += v1[i] * v2[i];
}
As the number of iterations and the memory layout are known during compile time and all operations can directly be mapped to native processor instructions, the compiler should easily be able to generate the "optimal" machine code from this (loop unrolling, vectorization / FMA instructions ...).
The STL version
double sum = std::inner_product(cbegin(v1), cend(v1), cbegin(v2), 0.0);
on the other hand adds some additional indirections and even if everything is inlined, the compiler still has to deduce that it is working on a continuous memory region and where this region lies. While this is certainly possible in principle, I wonder, whether the typical c++ compiler will actually do it.
So my question is: Do you think, there can be a performance benefit of implementing standard algorithms that operate on fixed size arrays on my own, or will the STL Version always outperform a manual implementation?
As suggested I did some measurements and
for the code below
compiled with VS2013 for x64 in release mode
executed on a Win8.1 Machine with an i7-2640M,
the algorithm version is consistently slower by about 20% (15.6-15.7s vs 12.9-13.1s). The relative difference, also stays roughly constant over two orders of magnitude for N and REPS.
So I guess the answer is: Using standard library algorithms CAN hurt performance.
It would still be interesting, if this is a general problem or if it is specific to my platform, compiler and benchmark. You are welcome to post your own resutls or comment on the benchmark.
#include <iostream>
#include <numeric>
#include <array>
#include <chrono>
#include <cstdlib>
#define USE_STD_ALGORITHM
using namespace std;
using namespace std::chrono;
static const size_t N = 10000000; //size of the arrays
static const size_t REPS = 1000; //number of repitions
array<double, N> a1;
array<double, N> a2;
int main(){
srand(10);
for (size_t i = 0; i < N; ++i) {
a1[i] = static_cast<double>(rand())*0.01;
a2[i] = static_cast<double>(rand())*0.01;
}
double res = 0.0;
auto start=high_resolution_clock::now();
for (size_t z = 0; z < REPS; z++) {
#ifdef USE_STD_ALGORITHM
res = std::inner_product(a1.begin(), a1.end(), a2.begin(), res);
#else
for (size_t t = 0; t < N; ++t) {
res+= a1[t] * a2[t];
}
#endif
}
auto end = high_resolution_clock::now();
std::cout << res << " "; // <-- necessary, so that loop isn't optimized away
std::cout << duration_cast<milliseconds>(end - start).count() <<" ms"<< std::endl;
}
/*
* Update: Results (ubuntu 14.04 , haswell)
* STL: algorithm
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3551 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3567 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9378 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8505 ms
*
* loop:
* g++-4.8-2 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 3543 ms
* g++-4.8-2 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 3551 ms
* clang++-3.5 -march=native -std=c++11 -O3 main.cpp : 1.15287e+24 9613 ms
* clang++-3.5 -march=native -std=c++11 -ffast-math -O3 main.cpp : 1.15287e+24 8642 ms
*/
EDIT:
I did a quick check with g++-4.9.2 and clang++-3.5 with O3and std=c++11 on a fedora 21 Virtual Box VM on the same machine and apparently those compilers don't have the same problem (the time is almost the same for both versions). However, gcc's version is about twice as fast as clang's (7.5s vs 14s).

std::chrono multiplies durations

Consider this example of code:
#include <chrono>
#include <iostream>
int main ( )
{
using namespace std::chrono;
system_clock::time_point s = system_clock::now();
for (int i = 0; i < 1000000; ++i)
std::cout << duration_cast<duration<double>>(system_clock::now() - s).count() << "\n";
}
I expect this to print the elapsed time in seconds. But it actually prints time in thousands of second (the expected result multiplied by 0.001). Am I doing something wrong?
Edit
Since seconds is equivalent to duration<some-int-type>, duration_cast<seconds> gives the same result.
I used gcc-4.7.3-r1
You program works as expected using both gcc 4.8.2 and VS2013. I think it might be a compiler bug in your old gcc 4.7
maverik guessed this right: the problem was with binary incompatibility inside std::chrono - gcc-4.8 version of libstdc++ and gcc-4.7 did not agree on internal units of time.
You could use duration_cast < seconds > instead of duration_cast < duration < double > >