Performance: Matlab vs C++ Matrix vector multiplication

Performance: Matlab vs C++ Matrix vector multiplication - c++

Preamble
Some time ago I asked a question about performance of Matlab vs Python (Performance: Matlab vs Python). I was surprised that Matlab is faster than Python, especially in meshgrid. In the discussion of that question, it was pointed to me that I should use a wrapper in Python to call my C++ code because C++ code is also available to me. I have the same code in C++, Matlab and Python.
While doing that, I was surprised once again to find that Matlab is faster than C++ in matrix assembly and computation.I have a slightly larger code, from which I am investigating a segment of matrix-vector multiplication. The larger code performs such multiplications at multiple instances. Overall the code in C++ is much much faster than Matlab (because function calling in Matlab has an overhead etc.), but Matlab seems to be outperforming C++ in the matrix-vector multiplication (code snippet at the bottom).
Results
The table below shows the comparison of time it takes to assemble the kernel matrix and the time it takes to multiply the matrix with the vector. The results are compiled for a matrix size NxN where N varies from 10,000 to 40,000. Which is not that large. But the interesting thing is that Matlab outperforms C++ the larger the N gets. Matlab is 3.8 - 5.8 times faster in total time. Moreover it is also faster in both matrix assembly and computation.
___________________________________________
|N=10,000 Assembly Computation Total |
|MATLAB 0.3387 0.031 0.3697 |
|C++ 1.15 0.24 1.4 |
|Times faster 3.8 |
___________________________________________
|N=20,000 Assembly Computation Total |
|MATLAB 1.089 0.0977 1.187 |
|C++ 5.1 1.03 6.13 |
|Times faster 5.2 |
___________________________________________
|N=40,000 Assembly Computation Total |
|MATLAB 4.31 0.348 4.655 |
|C++ 23.25 3.91 27.16 |
|Times faster 5.8 |
-------------------------------------------
Question
Is there a faster way of doing this in C++? Am I missing something? I understand that C++ is using for loops but my understanding is that Matlab will also be doing something similar in meshgrid.
Code Snippets
Matlab Code:
%% GET INPUT DATA FROM DATA FILES ------------------------------------------- %
% Read data from input file
Data = load('Input/input.txt');
location = Data(:,1:2);
charges = Data(:,3:end);
N = length(location);
m = size(charges,2);
%% EXACT MATRIX VECTOR PRODUCT ---------------------------------------------- %
kex1=ex1;
tic
Q = kex1.kernel_2D(location , location);
fprintf('\n Assembly time: %f ', toc);
tic
potential_exact = Q * charges;
fprintf('\n Computation time: %f \n', toc);
Class (Using meshgrid):
classdef ex1
methods
function [kernel] = kernel_2D(obj, x,y)
[i1,j1] = meshgrid(y(:,1),x(:,1));
[i2,j2] = meshgrid(y(:,2),x(:,2));
kernel = sqrt( (i1 - j1) .^ 2 + (i2 - j2) .^2 );
end
end
end
C++ Code:
EDIT
Compiled using a make file with following flags:
CC=g++
CFLAGS=-c -fopenmp -w -Wall -DNDEBUG -O3 -march=native -ffast-math -ffinite-math-only -I header/ -I /usr/include
LDFLAGS= -g -fopenmp
LIB_PATH=
SOURCESTEXT= src/read_Location_Charges.cpp
SOURCESF=examples/matvec.cpp
OBJECTSF= $(SOURCESF:.cpp=.o) $(SOURCESTEXT:.cpp=.o)
EXECUTABLEF=./exec/mykernel
mykernel: $(SOURCESF) $(SOURCESTEXT) $(EXECUTABLEF)
$(EXECUTABLEF): $(OBJECTSF)
$(CC) $(LDFLAGS) $(KERNEL) $(INDEX) $(OBJECTSF) -o $# $(LIB_PATH)
.cpp.o:
$(CC) $(CFLAGS) $(KERNEL) $(INDEX) $< -o $#
`
# include"environment.hpp"
using namespace std;
using namespace Eigen;
class ex1
{
public:
void kernel_2D(const unsigned long M, double*& x, const unsigned long N, double*& y, MatrixXd& kernel) {
kernel = MatrixXd::Zero(M,N);
for(unsigned long i=0;i<M;++i) {
for(unsigned long j=0;j<N;++j) {
double X = (x[0*N+i] - y[0*N+j]) ;
double Y = (x[1*N+i] - y[1*N+j]) ;
kernel(i,j) = sqrt((X*X) + (Y*Y));
}
}
}
};
int main()
{
/* Input ----------------------------------------------------------------------------- */
unsigned long N = 40000; unsigned m=1;
double* charges; double* location;
charges = new double[N * m](); location = new double[N * 2]();
clock_t start; clock_t end;
double exactAssemblyTime; double exactComputationTime;
read_Location_Charges ("input/test_input.txt", N, location, m, charges);
MatrixXd charges_ = Map<MatrixXd>(charges, N, m);
MatrixXd Q;
ex1 Kex1;
/* Process ------------------------------------------------------------------------ */
// Matrix assembly
start = clock();
Kex1.kernel_2D(N, location, N, location, Q);
end = clock();
exactAssemblyTime = double(end-start)/double(CLOCKS_PER_SEC);
//Computation
start = clock();
MatrixXd QH = Q * charges_;
end = clock();
exactComputationTime = double(end-start)/double(CLOCKS_PER_SEC);
cout << endl << "Assembly time: " << exactAssemblyTime << endl;
cout << endl << "Computation time: " << exactComputationTime << endl;
// Clean up
delete []charges;
delete []location;
return 0;
}

As said in the comments MatLab relies on Intel's MKL library for matrix products, which is the fastest library for such kind of operations. Nonetheless, Eigen alone should be able to deliver similar performance. To this end, make sure to use latest Eigen (e.g. 3.4), and proper compilation flags to enable AVX/FMA if available and multithreading:
-O3 -DNDEBUG -march=native
Since charges_ is a vector, better use a VectorXd to Eigen knows that you want a matrix-vector product and not a matrix-matrix one.
If you have Intel's MKL, then you can also let Eigen uses it to get exact same performance than MatLab for this precise operation.
Regarding the assembly, better inverse the two loops to enable vectorization, then enable multithreading with OpenMP (add -fopenmp as compiler flags) to make the outermost loop run in parallel, and finally you can simplify your code using Eigen:
void kernel_2D(const unsigned long M, double* x, const unsigned long N, double* y, MatrixXd& kernel) {
kernel.resize(M,N);
auto x0 = ArrayXd::Map(x,M);
auto x1 = ArrayXd::Map(x+M,M);
auto y0 = ArrayXd::Map(y,N);
auto y1 = ArrayXd::Map(y+N,N);
#pragma omp parallel for
for(unsigned long j=0;j<N;++j)
kernel.col(j) = sqrt((x0-y0(j)).abs2() + (x1-y1(j)).abs2());
}
With multi-threading you need to measure the wall clock time. Here (Haswell with 4 physical cores running at 2.6GHz) the assembly time drops to 0.36s for N=20000, and the matrix-vector products take 0.24s so 0.6s in total that is faster than MatLab whereas my CPU seems to be slower than yours.

You might be interested to look at the MATLAB Central contribution mtimesx.
Mtimesx is a mex function that optimizes matrix multiplications using the BLAS library, openMP and other methods. In my experience, when it was originally posted it could be beat stock MATLAB by 3 orders of magnitude in some cases. (Somewhat embarrassing for MATHWORKS, I presume.) These days MATLAB has improved its own methods (I suspect borrowing from this.) and the differences are less severe. MATLAB sometimes out-performs it.

Related

Why can sin(Vector) on all cores be as fast as sin(V) on one core?

I have a simple C++ code that runs a default sin function across a vector of values.
static void BM_sin() {
int data_size = 100000000;
double lower_bound = 0;
double upper_bound = 1;
random_device device;
mt19937 engine(device());
uniform_real_distribution<double> distribution(lower_bound, upper_bound);
auto generator = bind(distribution, engine);
vector<double> data(data_size);
generate(begin(data), end(data), generator);
#pragma omp parallel for
for(int i = 0; i < data_size; ++i) {
data[i] = sin(data[i]);
}
cout << accumulate(data.begin(), data.end(), 0) << endl;
}
I get same time when I run this function with export OMP_NUM_THREADS set to 1 and 8 having 8 cores. Also commenting line #pragma omp parallel for out does not help. So I wonder why sinus applied to a vector from all threads is as fast as applied from one thread?
(I compile with -Ofast -fopenmp on gcc-4.8)

Simple answer is simple:
Not all things scale well. I don't know fast_sin, but it's possible it's mainly memory-bandwidth limited. In that case, you'll win nothing by distributing the workload across cores.
Also, I doubt your measuring methods. If your generator is the mt19337, it's a lot more complex than your sine, so parallelizing your sine doesn't do much, because most of the time is spent generating random numbers.

You are measuring something wrongly. The generator loop is slow, but not that slow that it completely overshadows the sine loop. Here are the results of measuring the execution speed of several code parts on two different Intel architectures:
Code part | WM (x64) | WM (x86) | SB (x64) | SB (x86)
-----------------------+----------+----------+----------+----------
generate() | 1,45 s | 2,44 s | 1,28 s | 2,18 s
sine loop (serial) | 2,17 s | 2,88 s | 1,80 s | 2,91 s
sine loop (6 threads) | 0,37 s | 0,51 s | 0,31 s | 0,52 s
accumulate() | 0,31 s | 0,70 s | 0,33 s | 0,67 s
-----------------------+----------+----------+----------+----------
speed-up: overall | 1,85x | 1,65x | 1,78x | 1,71x
speed-up: sine loop | 5,86x | 5,65x | 5,81x | 5,60x
speed-up: Amdahl | 2,23x | 1,92x | 2,12x | 2,02x
In the above table, WM stands for Intel X5675, a Westmere CPU, while SB stands for Intel E5-2650, a Sandy Bridge CPU. x64 stands for 64-bit mode and x86 - for 32-bit mode. GCC 4.8.5 was used with -Ofast -fopenmp -mtune=native (-m32 for 32-bit mode). Both systems are running CentOS 7.2. The execution times are only approximate, as I haven't done proper timing by taking the average of multiple executions. Timing was done using the portable omp_get_wtime() timer routine.
As you can see, the overall speed-up with 6 threads ranges from 1,65x to 1,85x, while the speed-up for the sine loop alone ranges from 5,60x to 5,86x. Both the generator loop and the accumulator loop are performed in serial, which caps the parallel speed-up (see Amdahl's law).
Two things to note here. First one, the generator loop could be a tad faster if the memory for the vector is pre-faulted. It basically means sweeping over the vector and touching every memory page that backs it. Running the generator loop twice and only timing the second invocation will also do the trick. On my systems that brings no noticeable advantage (the savings are on the same order as the measurement error), most likely since CentOS's kernel has transparent huge pages turned on by default.
The second thing is the last parameter to accumulate() is an integer 0, therefore the algorithm is forced to perform an integer conversion every time, which slows it down considerably and gives the wrong result at the end (0). accumulate(data.begin(), data.end(), 0.0) executes ten times faster and also produces the correct result.

Why is Eigens mean() method so much faster than sum()?

This is a rather theoretical question, but I'm quite interested in it and would be glad if someone has some expert knowledge on this which he or she is willing to share.
I have a matrix of floats with 2000 rows and 600 cols and want to subtract the mean of the columns from each row. I have tested the following two lines and compared their runtime:
MatrixXf centered = data.rowwise() - (data.colwise().sum() / data.cols());
MatrixXf centered = data.rowwise() - data.colwise().mean();
I thought, mean() would not do something different from dividing the sum of each column by the number of rows, but while the execution of the first line takes 12.3 seconds on my computer, the second line finishes in 0.09 seconds.
I'm using Eigen version 3.2.6, which currently is the latest version, and my matrices are stored in row-major order.
Does someone know something about the internals of Eigen which could explain this huge performance difference?
Edit: I should add that data in the code above actually is of type Eigen::Map< Eigen::MatrixXf<Eigen::Dynamic, Eigen::Dynamic, Eigen::RowMajor> > and maps Eigen's functionality to a raw buffer.
Edit 2: As suggested by GuyGreer, I'll provide some sample code to reproduce my findings:
#include <iostream>
#include <chrono>
#include <Eigen/Core>
using namespace std;
using namespace std::chrono;
using namespace Eigen;
int main(int argc, char * argv[])
{
MatrixXf data(10000, 1000), centered;
data.setRandom();
auto start = high_resolution_clock::now();
if (argc > 1)
centered = data.rowwise() - data.colwise().mean();
else
centered = data.rowwise() - (data.colwise().sum() / data.rows());
auto stop = high_resolution_clock::now();
cout << duration_cast<milliseconds>(stop - start).count() << " ms" << endl;
return 0;
}
Compile with:
g++ -O3 -std=c++11 -o test test.cc
Running the resulting program without arguments, so that is uses sum(), takes 126 seconds on my machine, while running test 1 using mean() only takes 0.03 seconds!
Edit 3: As it turned out (see comments), it is not sum() which takes so long, but the division of the resulting vector by the number of rows. So the new question is: Why does Eigen take more than 2 minutes to divide a vector with 1000 columns by a single scalar?

Somehow, both the partial reduction (sum) and division are recomputed every time because some crucial information about the evaluation cost of the partial reduction are wrongly lost by operator/... Explicitly evaluating the mean fixes the issue:
centered = data.rowwise() - (data.colwise().sum() / data.cols()).eval();
Of course, this evaluation should be done by Eigen for you, as fixed by the changeset 42ab43a. This fix will be part of the next 3.2.7 and 3.3 releases.

Reconciling exponential function results in C++ (Rcpp) and R

I am working on speeding up software from my dissertation by utilizing Rcpp and RcppEigen. I have been very impressed with Rcpp and RcppEigen as the speed of my software has increased by upwards of 100 times. This is quite exciting to me because my R code had been parallelized using snow/doSNOW and the foreach package, so the actual speed gain is probably somewhere around 400x. However, the last time I attempeted to run my program in entirety to assess overall speed gains after translating some gradient/hessian calculations into Cpp, I see that the new Hessian matrix calculated using my C++ code differs from the old, much slower version which was calculated strictly in R. I had been very careful to check my results line by line, slowly increasing the complexity of my calculations while assuring the results were identical in R and C++. I realize now that I was only checking the first 11 or so digits.
The code for optimization has been very robust in R, but was dreadfully slow. All of the calculations in C++ have been checked and were virtually identical to previous versions in R (this was checked to 11 digits via specifying options(digits = 11) at the beginning of each session). However, deviations in long vectors or matrices representing particular quantities begin at 15 or so digits past the decimal point in some cells/elements. These differences become problematic when using matrix multiplication and summing over risk sets, as a small difference can lead to a large error (is it an error?) in the overall precision of the final estimate.
After looking back over my code and finding the first point of deviation in results between R and C++, I observed that this first occurs after taking the exponential of a matrix or vector in my Rcpp code. This led me to work out the examples below, which I hope illustrates the issue I am seeing. Has anyone observed this before, and is there a way to utilize the R exponential function within C++ or change the routine used within C++?
## A small example to illustrate issues with Rcppsugar exponentiate function
library(RcppEigen)
library(inline)
RcppsugarexpC <-
"
using Eigen::MatrixXd;
typedef Eigen::Map<Eigen::MatrixXd> MapMatd;
MapMatd A(as<MapMatd>(AA));
MatrixXd B = exp(A.array());
return wrap(B);
"
RcppexpC <-
"
using Eigen::MatrixXd;
using Eigen::VectorXd;
typedef Eigen::Map<Eigen::MatrixXd> MapMatd;
MapMatd A(as<MapMatd>(AA));
MatrixXd B = A.array().exp().matrix();
return wrap(B);
"
Rcppsugarexp <- cxxfunction(signature(AA = "NumericMatrix"), RcppsugarexpC, plugin = "RcppEigen")
Rcppexp <- cxxfunction(signature(AA = "NumericMatrix"), RcppexpC, plugin = "RcppEigen")
mat <- matrix(seq(-5.25, 10.25, by = 1), ncol = 4, nrow = 4)
RcppsugarC <- Rcppsugarexp(mat)
RcppexpC <- Rcppexp(mat)
exp <- exp(mat)
I then tested whether these exponentiated matrices were actually equal beyond the print standard (default is 7) that R uses via:
exp == RcppexpC ## inequalities in 3 cells
exp == RcppsugarC ## inequalities in 3 cells
RcppsugarC == RcppexpC ## these are equal!
sprintf("%.22f", exp)
Please forgive me if this is a dense question - my computer science skills are not as strong as they should be, but I am eager to learn how to do better. I appreciate any and all help or advice that can be given me. Special thanks to the creators of Rcpp, and all of the wonderful moderators/contributors at this site - your previous answers have saved me from posting questions on here well over a hundred times!
Edit:
It turns out that I didn't know what I was doing. I wanted to apply Rcppsugar to the MatrixXd or VectorXd, which I was attempting by using the .array() method, however calling exp(A.array()) or A.exp() computes what is referred to as the matrix exponential, rather than computing exp(A_ij) element by element. My friend pointed this out to me when he worked out a simple example using std::exp() on each element in a nested for loop and found that this result was identical to what was reported in R. I thus needed to use the .unaryExpr functionality of eigen, which meant changing the compiler settings to -std=c++0x. I was able to do this by specifying the following in R:
settings$env$PKG_CXXFLAGS='-std=c++0x'
I then made a file called Rcpptesting.cpp which is below:
#include <RcppEigen.h>
// [[Rcpp::depends(RcppEigen)]]
using Eigen::Map; // 'maps' rather than copies
using Eigen::MatrixXd; // variable size matrix, double precision
using Eigen::VectorXd; // variable size vector, double precision
// [[Rcpp::export]]
MatrixXd expCorrect(Map<MatrixXd> M) {
MatrixXd M2 = M.unaryExpr([](double e){return(std::exp(e));});
return M2;
}
After this, I was able to call this function in with sourceCpp() in R as follows: (note that I used the option verbose = TRUE and rebuild = TRUE because this seems to give me info regarding what the settings are - I was trying to make sure that -std=c++0x was actually being used)
sourceCpp("~/testingRcpp.cpp", verbose = TRUE, rebuild = TRUE)
Then the following R code worked like a charm:
mat <- matrix(seq(-5.25, 10.25, by = 1), ncol = 4, nrow = 4)
exp(mat) == expCorrect(mat)
Pretty cool!

Simple π(x) in Haskell vs C++

I'm learning Haskell. My interest is to use it for personal computer experimentation. Right now, I'm trying to see how fast Haskell can get. Many claim parity with C(++), and if that is true, I would be very happy (I should note that I will be using Haskell whether or not it's fast, but fast is still a good thing).
My test program implements π(x) with a very simple algorithm: Primes numbers add 1 to the result. Prime numbers have no integer divisors between 1 and √x. This is not an algorithm battle, this is purely for compiler performance.
Haskell seems to be about 6x slower on my computer, which is fine (still 100x faster than pure Python), but that could be just because I'm a Haskell newbie.
Now, my question: How, without changing the algorithm, can I optimize the Haskell implementation? Is Haskell really on performance parity with C?
Here is my Haskell code:
import System.Environment
-- a simple integer square root
isqrt :: Int -> Int
isqrt = floor . sqrt . fromIntegral
-- primality test
prime :: Int -> Bool
prime x = null [x | q <- [3, 5..isqrt x], rem x q == 0]
main = do
n <- fmap (read . head) getArgs
print $ length $ filter prime (2:[3, 5..n])
Here is my C++ code:
#include <iostream>
#include <cmath>
#include <cstdlib>
using namespace std;
bool isPrime(int);
int main(int argc, char* argv[]) {
int primes = 10000, count = 0;
if (argc > 1) {
primes = atoi(argv[1]);
}
if (isPrime(2)) {
count++;
}
for (int i = 3; i <= primes; i+=2) {
if (isPrime(i)){
count++;
}
}
cout << count << endl;
return 0;
}
bool isPrime(int x){
for (int i = 2; i <= floor(sqrt(x)); i++) {
if (x % i == 0) {
return false;
}
}
return true;
}

Your Haskell version is constructing a lazy list in prime only to test if it is null. This seems to indeed be a bottle neck. The following version runs just as fast as the C++ version on my machine:
prime :: Int -> Bool
prime x = go 3
where
go q | q <= isqrt x = if rem x q == 0 then False else go (q+2)
go _ = True
3.31s when compiled with -O2 vs. 3.18s for C++ with gcc 4.8 and -O3 for n=5000000.
Of course, 'guessing' where the program is slow to optimize it is not a very good approach. Fortunately, Haskell has good profiling tools on board.
Compiling and running with
$ ghc --make primes.hs -O2 -prof -auto-all -fforce-recomp && ./primes 5000000 +RTS -p
gives
# primes.prof
Thu Feb 20 00:49 2014 Time and Allocation Profiling Report (Final)
primes +RTS -p -RTS 5000000
total time = 5.71 secs (5710 ticks # 1000 us, 1 processor)
total alloc = 259,580,976 bytes (excludes profiling overheads)
COST CENTRE MODULE %time %alloc
prime.go Main 96.4 0.0
main Main 2.0 84.6
isqrt Main 0.9 15.4
individual inherited
COST CENTRE MODULE no. entries %time %alloc %time %alloc
MAIN MAIN 45 0 0.0 0.0 100.0 100.0
main Main 91 0 2.0 84.6 100.0 100.0
prime Main 92 2500000 0.7 0.0 98.0 15.4
prime.go Main 93 326103491 96.4 0.0 97.3 15.4
isqrt Main 95 0 0.9 15.4 0.9 15.4
--- >8 ---
which clearly shows that prime is where things get hot. For more information on profiling, I'll refer you to Real World Haskell, Chap 25.
To really understand what is going on, you can look at (one of) GHC's intermediate languages Core, which will show you how the code looks like after optimization. Some good info is at the Haskell wiki. I would not recommend to do that unless necessary, but it is good to know that the possibility exists.
To your other questions:
1) How, without changing the algorithm, can I optimize the Haskell implementation?
Profile, and try to write inner loops so that they don't do any memory allocations and can be made strict by the compiler. Doing so can take some practice and experience.
2) Is Haskell really on performance parity with C?
That depends. GHC is amazing and can often optimize your program very well. If you know what you're doing you can usually get close to the performance of optimized C (100% - 200% of C's speed). That said, these optimizations are not always easy or pretty to the eye and high level Haskell can be slower. But don't forget that you're gaining amazing expressiveness and high level abstractions when using Haskell. It will usually be fast enough for all but the most performance critical applications and even then you can often get pretty close to C with some profiling and performance optimizations.

I dont think that the Haskell version (original and improved by first answer) are equivalent with the C++ version.
The reason is this:
Both only consider every second element (in the prime function), while the C++ version scans every element (only i++ in the isPrime() function.
When i fix this (change i++ to i+=2 in the isPrime() function for C++) i get down to almost 1/3 of the runtime of the optimized Haskell version (2.1s C++ vs 6s Haskell).
The output remains the same for both (of course).
Note that this is no specific opimization of the C++ version ,just an adaptation of the trick already applied in the Haskell version.

Efficiency of Bitwise XOR in c++ in comparison to more readable methods

I've recently been writing some code for a research project that I'm working on, where efficiency is very important. I've been considering scraping some of the regular methods I do things in and using bitwise XORs instead. What I'm wondering is if this will make if a difference (if I'm performing this operation say several million times) or if it's the same after I use 03 in g++.
The two examples that come to mind:
I had an instance where (I'm working with purely positive ints) I needed to change n to n-1 if n was odd or n to (n+1) if n was even. I figured I had a few options:
if(n%2) // or (n%2==0) and flip the order
n=n-1
else
n=n+1
or
n=n+2*n%2-1; //This of course was silly, but was the only non-bitwise 1 line I could come up with
Finally:
n=n^1;
All of the methods clearly do the same thing, but my feeling was that the third one would be the most efficient.
The next example is on a more general note. Say I'm comparing two positive integers, will one of these perform better than the others. Or will the difference really not be noticeable, even if I perform this operation several million times:
if(n_1==n_2)
if(! (n_1 ^ n_2) )
if( n_1 ^ n_2) else \do work here
Will the compiler just do the same operation in all of these instances? I'm just curious if there is an instance when I should use bitwise operations and not trust the compiler to do the work for me.
Fixed:In correct statement of problem.

It's easy enough to check, just fire up your disassembler. Take a look:
f.c:
unsigned int f1(unsigned int n)
{
n ^= 1;
return n;
}
unsigned int f2(unsigned int n)
{
if (n % 2)
n=n-1;
else
n=n+1;
return n;
}
Build and disassemble:
$ cc -O3 -c f.c
$ otool -tV f.o
f.o:
(__TEXT,__text) section
_f1:
00 pushq %rbp
01 movq %rsp,%rbp
04 xorl $0x01,%edi
07 movl %edi,%eax
09 leave
0a ret
0b nopl _f1(%rax,%rax)
_f2:
10 pushq %rbp
11 movq %rsp,%rbp
14 leal 0xff(%rdi),%eax
17 leal 0x01(%rdi),%edx
1a andl $0x01,%edi
1d cmovel %edx,%eax
20 leave
21 ret
It looks like f1() is a bit shorter, whether or not that matters in reality is up to some benchmarking.

I needed to change n to n-1 if n was even or n to (n+1) if n was odd.
In that case, regardless of efficiency, n = n ^ 1 is wrong.
For your second case, == will be just as efficient (if not more so) than any of the others.
In general, when it comes to optimization, you should benchmark it yourself. If a potential optimization is not worth benchmarking, it's not really worth making.

I kind-of disagree with most of the answers here, which is why I still see myself replying to a question of 2010 :-)
XOR is practically speaking the fastest operation a CPU can possibly do, and the good part is that all CPU's support it. The reason for this is quite easy: a XOR gate can be created with only 4 NAND gates or 5 NOR gates -- which means it's easy to create using the fabric of your silicon. Unsurprising, all CPU's that I know of can execute your XOR operation in 1 clock tick (or even less).
If you need to do a XOR on multiple items in an array, modern x64 CPU's also support XOR's on multiple items at once like f.ex. the SIMD instructions on Intel.
The alternative solution you opt uses the if-then-else. True, most compilers are able to figure this easy thing out... but why take any chances and what's the consequence?
The consequence of your compiler not figuring it out are branch prediction errors. A single branch prediction failure will easily take 17 clock ticks. If you take one look at the execution speeds of processor instructions, you'll find that branches are quite bad for your performance, especially when dealing with random data.
Note that this also means that if you construct your test incorrectly, the data will mess up your performance measurements.
So to conclude: first think, then program, then profile - not the other way around. And use XOR.

About the only way to know for sure is to test. I'd have to agree that it would take a fairly clever compiler to produce as efficient of output for:
if(n%2) // or (n%2==0) and flip the order
n=n-1
else
n=n+1
as it could for n ^= 1;, but I haven't checked anything similar recently enough to say with any certainty.
As for your second question, I doubt it makes any difference -- an equality comparison is going to end up fast for any of these methods. If you want speed, the main thing to do is avoid having a branch involved at all -- e.g. something like:
if (a == b)
c += d;
can be written as: c += d * (a==b);. Looking at the assembly language, the second will often look a bit messy (with ugly cruft to get the result of the comparison from the flags into a normal register) but still often perform better by avoiding any branches.
Edit: At least the compilers I have handy (gcc & MSVC), do not generate a cmov for the if, but they do generate a sete for the * (a==b). I expanded the code to something testable.
Edit2: Since Potatoswatter brought up another possibility using bit-wise and instead of multiplication, I decided to test that along with the others. Here's the code with that added:
#include <time.h>
#include <iostream>
#include <stdlib.h>
int addif1(int a, int b, int c, int d) {
if (a==b)
c+=d;
return c;
}
int addif2(int a, int b, int c, int d) {
return c += d * (a == b);
}
int addif3(int a, int b, int c, int d) {
return c += d & -(a == b);
}
int main() {
const int iterations = 50000;
int x = rand();
unsigned tot1 = 0;
unsigned tot2 = 0;
unsigned tot3 = 0;
clock_t start1 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot1 +=addif1(i, j, i, x);
}
clock_t stop1 = clock();
clock_t start2 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot2 +=addif2(i, j, i, x);
}
clock_t stop2 = clock();
clock_t start3 = clock();
for (int i=0; i<iterations; i++) {
for (int j=0; j<iterations; j++)
tot3 +=addif3(i, j, i, x);
}
clock_t stop3 = clock();
std::cout << "Ignore: " << tot1 << "\n";
std::cout << "Ignore: " << tot2 << "\n";
std::cout << "Ignore: " << tot3 << "\n";
std::cout << "addif1: " << stop1-start1 << "\n";
std::cout << "addif2: " << stop2-start2 << "\n";
std::cout << "addif3: " << stop3-start3 << "\n";
return 0;
}
Now the really interesting part: the results for the third version are quite interesting. For MS VC++, we get roughly what most of us would probably expect:
Ignore: 2682925904
Ignore: 2682925904
Ignore: 2682925904
addif1: 4814
addif2: 3504
addif3: 3021
Using the & instead of the *, gives a definite improvement -- almost as much of an improvement as * gives over if. With gcc the result is quite a bit different though:
Ignore: 2680875904
Ignore: 2680875904
Ignore: 2680875904
addif1: 2901
addif2: 2886
addif3: 7675
In this case, the code using if is much closer to the speed of the code using *, but the code using & is slower than either one -- a lot slower! In case anybody cares, I found this surprising enough that I re-compiled a couple of times with different flags, re-ran a few times with each, and so on and the result was entirely consistent -- the code using & was consistently considerably slower.
The poor result with the third version of the code compiled with gcc gets us back to what I said to start with [and ends this edit]:
As I said to start with, "the only way to know for sure is to test" -- but at least in this limited testing, the multiplication consistently beats the if. There may be some combination of compiler, compiler flags, CPU, data pattern, iteration count, etc., that favors the if over the multiplication -- there's no question that the difference is small enough that a test going the other direction is entirely believable. Nonetheless, I believe that it's a technique worth knowing; for mainstream compilers and CPUs, it seems reasonably effective (though it's certainly more helpful with MSVC than with gcc).
[resumption of edit2:] the result with gcc using & demonstrates the degree to which 1) micro-optimizations can be/are compiler specific, and 2) how much different real-life results can be from expectations.

Is n^=1 faster than if ( n%2 ) --n; else ++n;? Yes. I wouldn't expect a compiler to optimize that. Since the bitwise operation is so much more succinct, it might be worthwhile to familiarize yourself with XOR and maybe add a comment on that line of code.
If it's really critical to the functionality of your program, it could also be considered a portability issue: if you test on your compiler and it's fast, you would likely be in for a surprise when trying on another compiler. Usually this isn't an issue for algebraic optimizations.
Is x^y faster than x==y? No. Doing things in roundabout ways is generally not good.

A good compiler will optimize n%2 but you can always check the assembly produced to see. If you see divides, start optimizing it yourself because divide is about as slow as it gets.

You should trust your compiler. gcc/++ is the product of years of development and it's capable of doing any optimizations you're probably thinking of doing. And, it's likely that if you start playing around, you'll tamper with it's efforts to optimize your code.

n ^= 1 and n1==n2 are probably the best you can do, but really, if you are after maximum efficiency, don't eyeball the code looking for little things like that.
Here's an example of how to really tune for performance.
Don't expect low level optimizations to help much until sampling has proven they are where you should focus.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Performance: Matlab vs C++ Matrix vector multiplication - c++

Related

Why can sin(Vector) on all cores be as fast as sin(V) on one core?

Why is Eigens mean() method so much faster than sum()?

Reconciling exponential function results in C++ (Rcpp) and R

Simple π(x) in Haskell vs C++

Efficiency of Bitwise XOR in c++ in comparison to more readable methods

Categories

Resources