This is a rather theoretical question, but I'm quite interested in it and would be glad if someone has some expert knowledge on this which he or she is willing to share.
I have a matrix of floats with 2000 rows and 600 cols and want to subtract the mean of the columns from each row. I have tested the following two lines and compared their runtime:
MatrixXf centered = data.rowwise() - (data.colwise().sum() / data.cols());
MatrixXf centered = data.rowwise() - data.colwise().mean();
I thought, mean() would not do something different from dividing the sum of each column by the number of rows, but while the execution of the first line takes 12.3 seconds on my computer, the second line finishes in 0.09 seconds.
I'm using Eigen version 3.2.6, which currently is the latest version, and my matrices are stored in row-major order.
Does someone know something about the internals of Eigen which could explain this huge performance difference?
Edit: I should add that data in the code above actually is of type Eigen::Map< Eigen::MatrixXf<Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> > and maps Eigen's functionality to a raw buffer.
Edit 2: As suggested by GuyGreer, I'll provide some sample code to reproduce my findings:
#include <iostream>
#include <chrono>
#include <Eigen/Core>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main(int argc, char * argv[])
{
MatrixXf data(10000, 1000), centered;
data.setRandom();
auto start = high_resolution_clock::now();
if (argc > 1)
centered = data.rowwise() - data.colwise().mean();
else
centered = data.rowwise() - (data.colwise().sum() / data.rows());
auto stop = high_resolution_clock::now();
cout << duration_cast<milliseconds>(stop - start).count() << " ms" << endl;
return 0;
}
Compile with:
g++ -O3 -std=c++11 -o test test.cc
Running the resulting program without arguments, so that is uses sum(), takes 126 seconds on my machine, while running test 1 using mean() only takes 0.03 seconds!
Edit 3: As it turned out (see comments), it is not sum() which takes so long, but the division of the resulting vector by the number of rows. So the new question is: Why does Eigen take more than 2 minutes to divide a vector with 1000 columns by a single scalar?
Somehow, both the partial reduction (sum) and division are recomputed every time because some crucial information about the evaluation cost of the partial reduction are wrongly lost by operator/... Explicitly evaluating the mean fixes the issue:
centered = data.rowwise() - (data.colwise().sum() / data.cols()).eval();
Of course, this evaluation should be done by Eigen for you, as fixed by the changeset 42ab43a. This fix will be part of the next 3.2.7 and 3.3 releases.
Related
So i like to make my life hard, i've got a task to calculate the sum of
1 + 1/2 + 1/3 + 1/4 +.... + 1/n.
The conditions is to not use iterations but a closed formula. On this post : https://math.stackexchange.com/questions/3367037/sum-of-1-1-2-1-3-1-n
I've found a pretty neat looking solution: 1+1/2+1/3+⋯+1/n=γ+ψ(n+1)
where γ is Euler's constant and ψ is the digamma function.
For digamma I'm using the boost c++ libraries and I calculate the Euler's constant using exp(1.0).
The problem is that I don't get the right answer. Here is my code:
#include <iostream>
#include <cmath>
#include <boost/math/special_functions/digamma.hpp>
int main(){
int x;
const double g = std::exp(1.0);
std::cin >> x;
std::cout<<g + boost::math::digamma(x+1);
return 0;
}
Thanks in advance) !
Euler is known for having a lot of things named for him.
That can easily become confusing, as seems to be case here.
What you are adding to the digamma function result is Euler's number. You are supposed to add Euler's constant, which is a different number named after Euler.
You can find the correct number in boost as boost::math::constants::euler, e.g.:
const double g = boost::math::constants::euler<double>();
(Thanks #Eljay)
For some context on such how much is named after Leonhard Euler and how confusing it can get, here is the Wikipedia page's section on just numbers named after him, counting 11 different items: https://en.wikipedia.org/wiki/List_of_things_named_after_Leonhard_Euler#Numbers
Preamble
Some time ago I asked a question about performance of Matlab vs Python (Performance: Matlab vs Python). I was surprised that Matlab is faster than Python, especially in meshgrid. In the discussion of that question, it was pointed to me that I should use a wrapper in Python to call my C++ code because C++ code is also available to me. I have the same code in C++, Matlab and Python.
While doing that, I was surprised once again to find that Matlab is faster than C++ in matrix assembly and computation.I have a slightly larger code, from which I am investigating a segment of matrix-vector multiplication. The larger code performs such multiplications at multiple instances. Overall the code in C++ is much much faster than Matlab (because function calling in Matlab has an overhead etc.), but Matlab seems to be outperforming C++ in the matrix-vector multiplication (code snippet at the bottom).
Results
The table below shows the comparison of time it takes to assemble the kernel matrix and the time it takes to multiply the matrix with the vector. The results are compiled for a matrix size NxN where N varies from 10,000 to 40,000. Which is not that large. But the interesting thing is that Matlab outperforms C++ the larger the N gets. Matlab is 3.8 - 5.8 times faster in total time. Moreover it is also faster in both matrix assembly and computation.
___________________________________________
|N=10,000 Assembly Computation Total |
|MATLAB 0.3387 0.031 0.3697 |
|C++ 1.15 0.24 1.4 |
|Times faster 3.8 |
___________________________________________
|N=20,000 Assembly Computation Total |
|MATLAB 1.089 0.0977 1.187 |
|C++ 5.1 1.03 6.13 |
|Times faster 5.2 |
___________________________________________
|N=40,000 Assembly Computation Total |
|MATLAB 4.31 0.348 4.655 |
|C++ 23.25 3.91 27.16 |
|Times faster 5.8 |
-------------------------------------------
Question
Is there a faster way of doing this in C++? Am I missing something? I understand that C++ is using for loops but my understanding is that Matlab will also be doing something similar in meshgrid.
Code Snippets
Matlab Code:
%% GET INPUT DATA FROM DATA FILES ------------------------------------------- %
% Read data from input file
Data = load('Input/input.txt');
location = Data(:,1:2);
charges = Data(:,3:end);
N = length(location);
m = size(charges,2);
%% EXACT MATRIX VECTOR PRODUCT ---------------------------------------------- %
kex1=ex1;
tic
Q = kex1.kernel_2D(location , location);
fprintf('\n Assembly time: %f ', toc);
tic
potential_exact = Q * charges;
fprintf('\n Computation time: %f \n', toc);
Class (Using meshgrid):
classdef ex1
methods
function [kernel] = kernel_2D(obj, x,y)
[i1,j1] = meshgrid(y(:,1),x(:,1));
[i2,j2] = meshgrid(y(:,2),x(:,2));
kernel = sqrt( (i1 - j1) .^ 2 + (i2 - j2) .^2 );
end
end
end
C++ Code:
EDIT
Compiled using a make file with following flags:
CC=g++
CFLAGS=-c -fopenmp -w -Wall -DNDEBUG -O3 -march=native -ffast-math -ffinite-math-only -I header/ -I /usr/include
LDFLAGS= -g -fopenmp
LIB_PATH=
SOURCESTEXT= src/read_Location_Charges.cpp
SOURCESF=examples/matvec.cpp
OBJECTSF= $(SOURCESF:.cpp=.o) $(SOURCESTEXT:.cpp=.o)
EXECUTABLEF=./exec/mykernel
mykernel: $(SOURCESF) $(SOURCESTEXT) $(EXECUTABLEF)
$(EXECUTABLEF): $(OBJECTSF)
$(CC) $(LDFLAGS) $(KERNEL) $(INDEX) $(OBJECTSF) -o $# $(LIB_PATH)
.cpp.o:
$(CC) $(CFLAGS) $(KERNEL) $(INDEX) $< -o $#
`
# include"environment.hpp"
using namespace std;
using namespace Eigen;
class ex1
{
public:
void kernel_2D(const unsigned long M, double*& x, const unsigned long N, double*& y, MatrixXd& kernel) {
kernel = MatrixXd::Zero(M,N);
for(unsigned long i=0;i<M;++i) {
for(unsigned long j=0;j<N;++j) {
double X = (x[0*N+i] - y[0*N+j]) ;
double Y = (x[1*N+i] - y[1*N+j]) ;
kernel(i,j) = sqrt((X*X) + (Y*Y));
}
}
}
};
int main()
{
/* Input ----------------------------------------------------------------------------- */
unsigned long N = 40000; unsigned m=1;
double* charges; double* location;
charges = new double[N * m](); location = new double[N * 2]();
clock_t start; clock_t end;
double exactAssemblyTime; double exactComputationTime;
read_Location_Charges ("input/test_input.txt", N, location, m, charges);
MatrixXd charges_ = Map<MatrixXd>(charges, N, m);
MatrixXd Q;
ex1 Kex1;
/* Process ------------------------------------------------------------------------ */
// Matrix assembly
start = clock();
Kex1.kernel_2D(N, location, N, location, Q);
end = clock();
exactAssemblyTime = double(end-start)/double(CLOCKS_PER_SEC);
//Computation
start = clock();
MatrixXd QH = Q * charges_;
end = clock();
exactComputationTime = double(end-start)/double(CLOCKS_PER_SEC);
cout << endl << "Assembly time: " << exactAssemblyTime << endl;
cout << endl << "Computation time: " << exactComputationTime << endl;
// Clean up
delete []charges;
delete []location;
return 0;
}
As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:
-O3 -DNDEBUG -march=native
Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.
If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.
Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:
void kernel_2D(const unsigned long M, double* x, const unsigned long N, double* y, MatrixXd& kernel) {
kernel.resize(M,N);
auto x0 = ArrayXd::Map(x,M);
auto x1 = ArrayXd::Map(x+M,M);
auto y0 = ArrayXd::Map(y,N);
auto y1 = ArrayXd::Map(y+N,N);
#pragma omp parallel for
for(unsigned long j=0;j<N;++j)
kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}
With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.
You might be interested to look at the MATLAB Central contribution mtimesx.
Mtimesx is a mex function that optimizes matrix multiplications using the BLAS library, openMP and other methods. In my experience, when it was originally posted it could be beat stock MATLAB by 3 orders of magnitude in some cases. (Somewhat embarrassing for MATHWORKS, I presume.) These days MATLAB has improved its own methods (I suspect borrowing from this.) and the differences are less severe. MATLAB sometimes out-performs it.
I am working on speeding up software from my dissertation by utilizing Rcpp and RcppEigen. I have been very impressed with Rcpp and RcppEigen as the speed of my software has increased by upwards of 100 times. This is quite exciting to me because my R code had been parallelized using snow/doSNOW and the foreach package, so the actual speed gain is probably somewhere around 400x. However, the last time I attempeted to run my program in entirety to assess overall speed gains after translating some gradient/hessian calculations into Cpp, I see that the new Hessian matrix calculated using my C++ code differs from the old, much slower version which was calculated strictly in R. I had been very careful to check my results line by line, slowly increasing the complexity of my calculations while assuring the results were identical in R and C++. I realize now that I was only checking the first 11 or so digits.
The code for optimization has been very robust in R, but was dreadfully slow. All of the calculations in C++ have been checked and were virtually identical to previous versions in R (this was checked to 11 digits via specifying options(digits = 11) at the beginning of each session). However, deviations in long vectors or matrices representing particular quantities begin at 15 or so digits past the decimal point in some cells/elements. These differences become problematic when using matrix multiplication and summing over risk sets, as a small difference can lead to a large error (is it an error?) in the overall precision of the final estimate.
After looking back over my code and finding the first point of deviation in results between R and C++, I observed that this first occurs after taking the exponential of a matrix or vector in my Rcpp code. This led me to work out the examples below, which I hope illustrates the issue I am seeing. Has anyone observed this before, and is there a way to utilize the R exponential function within C++ or change the routine used within C++?
## A small example to illustrate issues with Rcppsugar exponentiate function
library(RcppEigen)
library(inline)
RcppsugarexpC <-
"
using Eigen::MatrixXd;
typedef Eigen::Map<Eigen::MatrixXd> MapMatd;
MapMatd A(as<MapMatd>(AA));
MatrixXd B = exp(A.array());
return wrap(B);
"
RcppexpC <-
"
using Eigen::MatrixXd;
using Eigen::VectorXd;
typedef Eigen::Map<Eigen::MatrixXd> MapMatd;
MapMatd A(as<MapMatd>(AA));
MatrixXd B = A.array().exp().matrix();
return wrap(B);
"
Rcppsugarexp <- cxxfunction(signature(AA = "NumericMatrix"), RcppsugarexpC, plugin = "RcppEigen")
Rcppexp <- cxxfunction(signature(AA = "NumericMatrix"), RcppexpC, plugin = "RcppEigen")
mat <- matrix(seq(-5.25, 10.25, by = 1), ncol = 4, nrow = 4)
RcppsugarC <- Rcppsugarexp(mat)
RcppexpC <- Rcppexp(mat)
exp <- exp(mat)
I then tested whether these exponentiated matrices were actually equal beyond the print standard (default is 7) that R uses via:
exp == RcppexpC ## inequalities in 3 cells
exp == RcppsugarC ## inequalities in 3 cells
RcppsugarC == RcppexpC ## these are equal!
sprintf("%.22f", exp)
Please forgive me if this is a dense question - my computer science skills are not as strong as they should be, but I am eager to learn how to do better. I appreciate any and all help or advice that can be given me. Special thanks to the creators of Rcpp, and all of the wonderful moderators/contributors at this site - your previous answers have saved me from posting questions on here well over a hundred times!
Edit:
It turns out that I didn't know what I was doing. I wanted to apply Rcppsugar to the MatrixXd or VectorXd, which I was attempting by using the .array() method, however calling exp(A.array()) or A.exp() computes what is referred to as the matrix exponential, rather than computing exp(A_ij) element by element. My friend pointed this out to me when he worked out a simple example using std::exp() on each element in a nested for loop and found that this result was identical to what was reported in R. I thus needed to use the .unaryExpr functionality of eigen, which meant changing the compiler settings to -std=c++0x. I was able to do this by specifying the following in R:
settings$env$PKG_CXXFLAGS='-std=c++0x'
I then made a file called Rcpptesting.cpp which is below:
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
using Eigen::Map; // 'maps' rather than copies
using Eigen::MatrixXd; // variable size matrix, double precision
using Eigen::VectorXd; // variable size vector, double precision
// [[Rcpp::export]]
MatrixXd expCorrect(Map<MatrixXd> M) {
MatrixXd M2 = M.unaryExpr([](double e){return(std::exp(e));});
return M2;
}
After this, I was able to call this function in with sourceCpp() in R as follows: (note that I used the option verbose = TRUE and rebuild = TRUE because this seems to give me info regarding what the settings are - I was trying to make sure that -std=c++0x was actually being used)
sourceCpp("~/testingRcpp.cpp", verbose = TRUE, rebuild = TRUE)
Then the following R code worked like a charm:
mat <- matrix(seq(-5.25, 10.25, by = 1), ncol = 4, nrow = 4)
exp(mat) == expCorrect(mat)
Pretty cool!
When I run the code from this page high_precision_timer, I got to know my system only support microsecond precision.
As per the document,
cout << chrono::high_resolution_clock::period::den << endl;
Note, that there isn’t a guarantee how many the ticks per seconds it
has, only that it’s the highest available. Hence, the first thing we
do is to get the precision, by printing how many many times a second
the clock ticks. My system provides 1000000 ticks per second, which is
a microsecond precision.
I am also getting exactly the same value 1000000 ticks per second . That means my system is also support microseconds precision.
Everytime I run any program , I always get value xyz microsecond and xyz000 nanosec . I think the above non-support of my system to nanosec may be the reason.
Is there any way to make my system nanosec supportive ?
It's not an answer. I cannot print long message in comment.
I just test your example.
And my system output result was:
chrono::high_resolution_clock::period::den = 1000000000.
My system provides 1000000000 ticks per second, which is a nanosecond precision.
Not 1000000 (microseconds).
Your system provides 1000000 ticks per second, which is a microsecond precision.
So, I don't know how to help you. Sorry.
#include <iostream>
#include <chrono>
using namespace std;
int main()
{
cout << chrono::high_resolution_clock::period::den << endl;
auto start_time = chrono::high_resolution_clock::now();
int temp;
for (int i = 0; i< 242000000; i++)
temp+=temp;
auto end_time = chrono::high_resolution_clock::now();
cout <<"sec = "<<chrono::duration_cast<chrono::seconds>(end_time - start_time).count() << ":"<<std::endl;
cout <<"micro = "<<chrono::duration_cast<chrono::microseconds>(end_time - start_time).count() << ":"<<std::endl;
cout <<"nano = "<<chrono::duration_cast<chrono::nanoseconds>(end_time - start_time).count() << ":"<<std::endl;
return 0;
}
Consider this,
Most processors today operate at a frequency of about 1 to 3 GHz i.e. say 2 * 10^9 Hz.
which means 1 tick every 0.5 nano seconds at the processor level. so i would guess your chances are very very slim.
Edit:
though the documentation is still sparse for this I remember reading that it accesses the RTC of the CPU(not sure), whose frequency is fixed.
Also as an advice i think measuring performance in nano second has little advantage compared to measuring in micro sec ( unless its for medical use ;) ).
and take a look at this question and its answer. I think it can make more sense
HPET's frequency vs CPU frequency for measuring time
Is there a better way of doing this ?
http://projecteuler.net/problem=8
I added a condition to check if the number is >6 (Eliminates small products and 0's)
#include <iostream>
#include <math.h>
#include "bada.h"
using namespace std;
int main()
{
int badanum[] { DATA };
int pro=0,highest=0;
for(int i=0;i<=996;++i)
{
if (badanum[i]>6 and badanum[i+1] > 6 and badanum[i+2] >6 and badanum[i+3]>6 and badanum[i+4]>6)
{
pro=badanum[i]*badanum[i+1]*badanum[i+2]*badanum[i+3]*badanum[i+4];
if(pro>highest)
{
cout << pro << " " << badanum[i] << badanum[i+1] << badanum[i+2] << badanum[i+3] << badanum[i+4] << endl;
highest = pro;
}
pro = 0;
}
}
}
bada.h is just a file containing the 1000 digit number.
#DEFINE DATA <1000 digit number>
http://projecteuler.net/problem=8
that if slows things down actually
causes branching the parallel pipeline of CPU execution
also as mentioned before it will invalidate the result
does not matter that your solution is the same as it should be (for another digits it could not)
On algorithmic side you can do:
if you have fast enough division you can lower the computations number
char a[]="7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450\0";
int i=0,s=0,m=1,q;
for (i=0;i<4;i++)
{
q=a[i ]-'0'; if (q) m*=q;
}
for (i=0;i<996;i++)
{
q=a[i+4]-'0'; if (q) m*=q;
if (s<m) s=m;
q=a[i ]-'0'; if (q) m/=q;
}
also you can do a table for mul,div operations for speed (but that is not faster in all cases)
int mul_5digits[9*9*9*9*9+1][10]={ 0*0,0*1,0*2, ... ,9*9*9*9*9/9 };
int div_5digits[9*9*9*9*9+1][10]={ 0/0,0/1,0/2, ... ,9*9*9*9*9/9 };
// so a=b*c; is rewritten by a=mul_5digits[b][c];
// so a=b/c; is rewritten by a=div_5digits[b][c];
of course instead of values 0*0 have to add neutral value = 1 !!!
of course instead of values i/0 have to add neutral value = i !!!
int i=0,s=0,t=1;
for (i=0;i<4;i++)
{
t=mul_5digits[t][a[i ]-'0'];
}
for (i=0;i<996;i++)
{
t=mul_5digits[t][a[i+4]-'0'];
if (s<t) s=t;
t=div_5digits[t][a[i ]-'0'];
}
Run-time measurements on AMD 3.2GHz, 64bit Win7, 32 bit App BDS2006 C++:
0.022ms classic approach
0.013ms single mul,div per step (produce false outut if there is none product > 0 present)
0.054ms tabled single mul,div per step (is slower for my setup)
PS.
All code improvements should be measured so you see if you actually speed thing up or not.
Because what is faster for one compiler/platform/computer can be slower for another.
Use at least 0.1 ms resolution.
I prefer the use of RDTSC or PerformanceCounter for that.
Except for the errors pointed out in the comments, that much multiplications aren´t necessary. If you start with the product of [0] * [1] * [2] * [3] * [4] for index 0, what would be the product starting at [1]? The old result divided by [0] and multiplied by [5]. One division and one multiplication could be faster than 4 multiplications
You don't need to store all the digits at once. Just current five of them (use an array with cyclic overwriting), one variable to store the current problem result and one to store the latest multiplication result(see below). If the number of digits in the input will grow you won't get any troubles with memory.
Also you could have the check if the oldest read digit equals zero. If it is, than you will really have to multiply all the five current digits, but if not - a better way will be to divide previous multiplication result by the oldest digit and multiply it by the latest read digit.