In general (or from your experience), is there difference in performance between for and while loops?
What if they are doubly/triply nested?
Is vectorization (SSE) affected by loop variant in g++ or Intel compilers?
Thank you
Here is a nice paper on the subject.
Any intelligent compiler won't really show a difference between them. A for loop is really just syntactic sugar for a certain form of while loop, anyways.
VS2015, Intel Xeon CPU
long long n = 1000000000;
int *v = new int[n];
int *v1 = new int[2*n];
start = clock();
for (long long i = 0, j=0; i < n; i++, j+=2)
v[i] = v1[j];
end = clock();
std::cout << "for1 - CPU time = " << (double)(end - start) / CLOCKS_PER_SEC << std::endl;
p = v; pe = p + n; p1 = v1;
start = clock();
while (p < pe)
{
*p++ = *p1;
p1 += 2;
}
end = clock();
std::cout << "while3 - CPU time = " << (double)(end - start) / CLOCKS_PER_SEC << std::endl;
for1 - CPU time = 4.055
while3 - CPU time = 1.271
This is something easily ascertained by looking at disassembly. For most loops, they will be the same assuming you do the same work.
int i = 0;
while (i < 10)
++i;
is the same as
for (int i = 0; i < 10; ++i)
;
As for nesting, it really depends on how you configure it but same setups should yield same code.
Should be zero difference, but do check as I've seen really crappy, older versions of GCC create different code ARM/Thumb code between the two. One optimized away a compare after a subtract to set the zero flag where as the other did not. Was very lame.
Nesting again should make no difference. Not sure on SSE/Vectorization stuff, but again I'd expect there to be no difference.
it should be negligible. an optimizing compiler should make the distinction nonexistent.
Related
Whenever working on a specific problem, I may come across different solutions. I'm not sure how to choose the better of the two options. The first idea is to compute the complexity of the two solutions, but sometimes they may share the same complexity, or they may differ but the range of the input is small that the constant factor matters.
The second idea is to benchmark both solutions. However, I'm not sure how to time them using c++. I have found this question:
How to Calculate Execution Time of a Code Snippet in C++ , but I don't know how to properly deal with compiler optimizations or processor inconsistencies.
In short: is the code provided in the question above sufficient for everyday tests? is there some options that I should enable in the compiler before I run the tests? (I'm using Visual C++) How many tests should I do, and how much time difference between the two benchmarks matters?
Here is an example of a code I want to test. Which of these is faster? How can I calculate that myself?
unsigned long long fiborecursion(int rank){
if (rank == 0) return 1;
else if (rank < 0) return 0;
return fiborecursion(rank-1) + fiborecursion(rank-2);
}
double sq5 = sqrt(5);
unsigned long long fiboconstant(int rank){
return pow((1 + sq5) / 2, rank + 1) / sq5 + 0.5;
}
Using the clock from this answer
#include <iostream>
#include <chrono>
class Timer
{
public:
Timer() : beg_(clock_::now()) {}
void reset() { beg_ = clock_::now(); }
double elapsed() const {
return std::chrono::duration_cast<second_>
(clock_::now() - beg_).count(); }
private:
typedef std::chrono::high_resolution_clock clock_;
typedef std::chrono::duration<double, std::ratio<1> > second_;
std::chrono::time_point<clock_> beg_;
};
You can write a program to time both of your functions.
int main() {
const int N = 10000;
Timer tmr;
tmr.reset();
for (int i = 0; i < N; i++) {
auto value = fiborecursion(i%50);
}
double time1 = tmr.elapsed();
tmr.reset();
for (int i = 0; i < N; i++) {
auto value = fiboconstant(i%50);
}
double time2 = tmr.elapsed();
std::cout << "Recursion"
<< "\n\tTotal: " << time1
<< "\n\tAvg: " << time1 / N
<< "\n"
<< "\nConstant"
<< "\n\tTotal: " << time2
<< "\n\tAvg: " << time2 / N
<< "\n";
}
I would try compiling with no compiler optimizations (-O0) and max compiler optimizations (-O3) just to see what the differences are. It is likely that at max optimizations the compiler may eliminate the loops entirely.
This question spawned from a separate question, which turned out to have some apparently machine specific quirks. When I run the C++ code listed below for recording the timing differences between tanh and exp, I see the following result:
tanh: 5.22203
exp: 14.9393
tanh runs ~3x as fast as exp. This is somewhat surprising considering the mathematical definition of tanh (and being ignorant of the algorithmic definition implemented).
What's more is that this happens on my laptop (Ubuntu 16.04, Intel Core i7-3517U CPU # 1.90GHz × 4), but does not occur on my desktop (same OS, not sure about CPU specs right now).
I compiled the code below with g++. The above times were with no compiler optimization, although the trend remains if I use -On for each n. I also fiddled with a and b values to see if the range of values being evaluated was having an effect. This doesn't seem to matter.
What would cause tanh to be faster than exp on different machines?
#include <iostream>
#include <cmath>
#include <ctime>
using namespace std;
int main() {
double a = -5;
double b = 5;
int N = 10001;
double x[10001];
double y[10001];
double h = (b-a) / (N-1);
clock_t begin, end;
for(int i=0; i < N; i++)
x[i] = a + i*h;
begin = clock();
for(int i=0; i < N; i++)
for(int j=0; j < N; j++)
y[i] = tanh(x[i]);
end = clock();
cout << "tanh: " << double(end - begin) / CLOCKS_PER_SEC << "\n";
begin = clock();
for(int i=0; i < N; i++)
for(int j=0; j < N; j++)
y[i] = exp(x[i]);
end = clock();
cout << "exp: " << double(end - begin) / CLOCKS_PER_SEC << "\n";
return 0;
}
edit: some assembly output
This is output when I compile the following simplified code below with g++ -g -O -Wa,-aslh nothing2.cpp > stuff.txt.
#include <cmath>
int main() {
double x = 0.0;
double y,z;
y = tanh(x);
z = exp(x);
return 0;
}
edit: another update
Assume nothing2.cpp contains the simplified code in the previous edit. I run:
g++ -o nothing2.so -shared -fPIC nothing2.cpp
objdump -d nothing2.so > stuff.txt
Here is the contents of stuff.txt
There is various possible explanation and the one applicable in your case depends on which platform you're using or exact which math library that is in use. But one possible explanation is:
First of all the calculation of tanh does not rely on the standard definition of tanh instead one expresses it in terms of exp(-2*x) or expm1(2*x) which means one only have to calculate one exponential which is probably the heavy operation (in addition there's a division and some additions).
Second which may be the trick is that for largish values of x this will reduce to (exp(2*x)-1)/(exp(2*x)+1) = 1 - 2/(expm1(2*x)+2). The advantage here is that since the second term is smallish it doesn't have to be calculated to the same relative accuracy to get the same final accuracy. This translate into that one wouldn't need the of expm1 here as in general.
Also for smalish values of x there's a similar trick in rewriting it as (1-exp(-2*x))/(1+exp(-2*x)) = - 1/ (1 + 2/(expm1(-2*x)+2) which again means that we can take advantage of the factor exp(-2*x) being large and not having to calculate it to the same accuracy. However you don't have to actually calculate it this way, you use the expression expm1(-2*x)/(2+expm1(-2*x)) instead with the same accuracy requirement on expm1.
In addition there are other optimizations available for larger values of x that isn't possible for exp of basically the same origin. With large x the factor expm1(2*x) will become so large that we can simply discard it entirely, while for exp we still have to calculate it (this is even the case for large negative x). For these values tanh would be immediately decided to be 1 while exp must be calculated.
Consider the following code:
#include <iostream>
#include <string>
#include <chrono>
using namespace std;
int main()
{
int iter = 1000000;
int loops = 10;
while (loops)
{
int a=0, b=0, c=0, f = 0, m = 0, q = 0;
auto begin = chrono::high_resolution_clock::now();
auto end = chrono::high_resolution_clock::now();
auto deltaT = end - begin;
auto accumT = end - begin;
accumT = accumT - accumT;
auto controlT = accumT;
srand(chrono::duration_cast<chrono::nanoseconds>(begin.time_since_epoch()).count());
for (int i = 0; i < iter; i++) {
begin = chrono::high_resolution_clock::now();
// No arithmetic operation
end = chrono::high_resolution_clock::now();
deltaT = end - begin;
accumT += deltaT;
}
controlT = accumT; // Control duration
accumT = accumT - accumT; // Reset to zero
for (int i = 0; i < iter; i++) {
auto n1 = rand() % 100;
auto n2 = rand() % 100;
begin = chrono::high_resolution_clock::now();
c += i*2*n1*n2; // Some arbitrary arithmetic operation
end = chrono::high_resolution_clock::now();
deltaT = end - begin;
accumT += deltaT;
}
// Print the difference in time between loop with no arithmetic operation and loop with
cout << " c = " << c << "\t\t" << " | ";
cout << "difference between the 1st and 2nd loop: "
<< chrono::duration_cast<chrono::nanoseconds>(accumT - controlT).count()
<< endl;
loops--;
}
return 0;
}
It tries to isolate the time measurement of an operation. The first loop is a control to establish a baseline and the second loop has an arbitrary arithmetic operation.
Then it outputs to the console. Here's sample output:
c = 2116663282 | difference between 1st and 2nd loop: -8620916
c = 112424882 | difference between 1st and 2nd loop: -1197927
c = -1569775878 | difference between 1st and 2nd loop: -5226990
c = 1670984684 | difference between 1st and 2nd loop: 4394706
c = -1608171014 | difference between 1st and 2nd loop: 676683
c = -1684897180 | difference between 1st and 2nd loop: 2868093
c = 112418158 | difference between 1st and 2nd loop: 5846887
c = 2019014070 | difference between 1st and 2nd loop: -951609
c = 656490372 | difference between 1st and 2nd loop: 997815
c = 263579698 | difference between 1st and 2nd loop: 2371088
Here's the very interesting part: sometime the loop with the arithmetic operation finishes faster than the loop with no arithmetic operation (negative difference). Which means that the operation to record the current time is slower than the arithmetic operation, and thus not negligible.
Is there a way around this?
PS: Yes, I understand you can wrap the whole loop between begin and end.
Setup machine: Core i7 architecture, Windows 10 64 bit, and Visual Studio 2015
Your problem is that you measure time and not the number of instructions processed. Time can be influenced by a lot of things that are not really what you would expect, or wish to measure.
Instead, you should measure the number of clock cycles. There exists a library for this which can be found on Agner Fog's website. He has a lot of useful information about optimizations:
http://www.agner.org/optimize/#manuals
Even using clock cycles, you can still experience peculiarities in the results. This could happen if the processor uses out-of-order execution which enables the processor to optimize the order of execution of the operations.
If you have compiled your code with debugging symbols, the compiler may have injected additional code, which may impact the result. When performing tests like this, you should always compile without debugging information.
You should use a steady clock, std::steady_clock.
The std::system_clock/std::high_resolution_clock is getting corrected by the OS.
According to my professor loops are faster and more deficient than using recursion yet I came up with this c++ code that calculates the Fibonacci series using both recursion and loops and the results prove that they are very similar. So I maxed the possible input to see if there was a difference in performance and for some reason recursion clocked in better than using a loop. Anyone know why? Thanks in advanced.
Here's the code:
#include "stdafx.h"
#include "iostream"
#include <time.h>
using namespace std;
double F[200000000];
//double F[5];
/*int Fib(int num)
{
if (num == 0)
{
return 0;
}
if (num == 1)
{
return 1;
}
return Fib(num - 1) + Fib(num - 2);
}*/
double FiboNR(int n) // array of size n
{
for (int i = 2; i <= n; i++)
{
F[i] = F[i - 1] + F[i - 2];
}
return (F[n]);
}
double FibMod(int i,int n) // array of size n
{
if (i==n)
{
return F[i];
}
F[i] = F[i - 1] + F[i - 2];
return (F[n]);
}
int _tmain(int argc, _TCHAR* argv[])
{
/*cout << "----------------Recursion--------------"<<endl;
for (int i = 0; i < 36; i=i+5)
{
clock_t tStart = clock();
cout << Fib(i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}*/
cout << "----------------Linear--------------"<<endl;
for (int i = 0; i < 200000000; i = i + 20000000)
//for (int i = 0; i < 50; i = i + 5)
{
clock_t tStart = clock();
F[0] = 0; F[1] = 1;
cout << FiboNR(i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}
cout << "----------------Recursion Modified--------------" << endl;
for (int i = 0; i < 200000000; i = i + 20000000)
//for (int i = 0; i < 50; i = i + 5)
{
clock_t tStart = clock();
F[0] = 0; F[1] = 1;
cout << FibMod(0,i);
printf("Time taken: %.2fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
cout << " : Fib(" << i << ")" << endl;
}
std::cin.ignore();
return 0;
}
You you go by the conventional programming approach loops are faster. But there is category of languages called functional programming languages which does not contain loops. I am a big fan of functional programming and I am an avid Haskell user. Haskell is a type of functional programming language. In this instead of loops you use recursions. To implement fast recursion there is something known as tail recursion. Basically to prevent having a lot of extra info to the system stack, you write the function such a way that all the computations are stored as function parameters so that nothing needs to be stored on the stack other that the function call pointer. So once the final recursive call has been called, instead of unwinding the stack the program just needs to go to the first function call stack entry. Functional programming language compilers have an inbuilt design to deal with this. Now even non functional programming languages are implementing tail recursion.
For example consider finding the recursive solution for finding the factorial of a positive number. The basic implementation in C would be
int fact(int n)
{
if(n == 1 || n== 0)
return 1
return n*fact(n-1);
}
In the above approach, each time the stack is called n is stored in the stack so that it can be multiplied with the result of fact(n-1). This basically happens during stack unwinding. Now check out the following implementation.
int fact(int n,int result)
{
if(n == 1 || n== 0)
return result
return fact(n-1,n*result);
}
In this approach we are passing the computation result in the variable result. So in the end we directly get the answer in the variable result. The only thing you have to do is that in the initial call pass a value of 1 for the result in this case. The stack can be unwound directly to its first entry. Of course I am not sure that C or C++ allows tail recursion detection, but functional programming languages do.
Your "recursion modified" version doesn't have recursion at all.
In fact, the only thing enabling a non-recursive version that fills in exactly one new entry of the array is the for-loop in your main function -- so it is actually a solution using iteration also (props to immibis and BlastFurnace for noticing that).
But your version doesn't even do that correctly. Rather since it is always called with i == 0, it illegally reads F[-1] and F[-2]. You are lucky (?)1 the program didn't crash.
The reason you are getting correct results is that the entire F array is prefilled by the correct version.
Your attempt to calculate Fib(2000....) isn't successful anyway, since you overflow a double. Did you even try running that code?
Here's a version that works correctly (to the precision of double, anyway) and doesn't use a global array (it really is iteration vs recursion and not iteration vs memoization).
#include <cstdio>
#include <ctime>
#include <utility>
double FiboIterative(int n)
{
double a = 0.0, b = 1.0;
if (n <= 0) return a;
for (int i = 2; i <= n; i++)
{
b += a;
a = b - a;
}
return b;
}
std::pair<double,double> FiboRecursive(int n)
{
if (n <= 0) return {};
if (n == 1) return {0, 1};
auto rec = FiboRecursive(n-1);
return {rec.second, rec.first + rec.second};
}
int main(void)
{
const int repetitions = 1000000;
const int n = 100;
volatile double result;
std::puts("----------------Iterative--------------");
std::clock_t tStart = std::clock();
for( int i = 0; i < repetitions; ++i )
result = FiboIterative(n);
std::printf("[%d] = %f\n", n, result);
std::printf("Time taken: %.2f us\n", (std::clock() - tStart) / 1.0 / CLOCKS_PER_SEC);
std::puts("----------------Recursive--------------");
tStart = std::clock();
for( int i = 0; i < repetitions; ++i )
result = FiboRecursive(n).second;
std::printf("[%d] = %f\n", n, result);
std::printf("Time taken: %.2f us\n", (std::clock() - tStart) / 1.0 / CLOCKS_PER_SEC);
return 0;
}
--
1Arguably anything that hides a bug is actually unlucky.
I don't think this is not a good question. But maybe the answer why is somehow interesting.
At first let me say that generally the statement is probably true. But ...
Questions about performance of c++ programs are very localized. It's never possible to give a good general answer. Every example should be profiled an analyzed separately. It involves lots of technicalities. c++ compilers are allowed to modify program practically as they wish as long as they don't produce visible side effects (whatever precisely that means). So as long as your computation gives the same result is fine. This technically allows to transform one version of your program into an equivalent even from recursive version into the loop based and vice versa. So it depends on compiler optimizations and compiler effort.
Also, to compare one version to another you would need to prove that the versions you compare are actually equivalent.
It might also happen that somehow a recursive implementation of algorithm is faster than a loop based one if it's easier to optimize for the compiler. Usually iterative versions are more complex, and generally the simpler the code is, the easier it is for the compiler to optimize because it can make assumptions about invariants, etc.
I have a huge vector<vector<int>> (18M x 128). Frequently I want to take 2 rows of this vector and compare them by this function:
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
for (int k = 0; k < 128; k += 2) {
pplus = nodeL[indx2][k] - nodeL[indx1][k];
pminus = nodeL[indx1][k + 1] - nodeL[indx2][k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
As you see, the function, loops through the two row vectors does some subtraction and at the end returns a maximum. This function will be used a million times, so I was wondering if it can be accelerated through SSE instructions. I use Ubuntu 12.04 and gcc.
Of course it is microoptimization but it would helpful if you could provide some help, since I know nothing about SSE. Thanks in advance
Benchmark:
int nofTestCases = 10000000;
vector<int> nodeIds(nofTestCases);
vector<int> goalNodeIds(nofTestCases);
vector<int> results(nofTestCases);
for (int l = 0; l < nofTestCases; l++) {
nodeIds[l] = randomNodeID(18000000);
goalNodeIds[l] = randomNodeID(18000000);
}
double time, result;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff2(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
time = timestamp();
for (int l = 0; l < nofTestCases; l++) {
results[l] = getDiff(nodeIds[l], goalNodeIds[l]);
}
result = timestamp() - time;
cout << result / nofTestCases << "s" << endl;
where
int randomNodeID(int n) {
return (int) (rand() / (double) (RAND_MAX + 1.0) * n);
}
/** Returns a timestamp ('now') in seconds (incl. a fractional part). */
inline double timestamp() {
struct timeval tp;
gettimeofday(&tp, NULL);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
FWIW I put together a pure SSE version (SSE4.1) which seems to run around 20% faster than the original scalar code on a Core i7:
#include <smmintrin.h>
int getDiff_SSE(int indx1, int indx2)
{
int result[4] __attribute__ ((aligned(16))) = { 0 };
const int * const p1 = &nodeL[indx1][0];
const int * const p2 = &nodeL[indx2][0];
const __m128i vke = _mm_set_epi32(0, -1, 0, -1);
const __m128i vko = _mm_set_epi32(-1, 0, -1, 0);
__m128i vresult = _mm_set1_epi32(0);
for (int k = 0; k < 128; k += 4)
{
__m128i v1, v2, vmax;
v1 = _mm_loadu_si128((__m128i *)&p1[k]);
v2 = _mm_loadu_si128((__m128i *)&p2[k]);
v1 = _mm_xor_si128(v1, vke);
v2 = _mm_xor_si128(v2, vko);
v1 = _mm_sub_epi32(v1, vke);
v2 = _mm_sub_epi32(v2, vko);
vmax = _mm_add_epi32(v1, v2);
vresult = _mm_max_epi32(vresult, vmax);
}
_mm_store_si128((__m128i *)result, vresult);
return max(max(max(result[0], result[1]), result[2]), result[3]);
}
You probably can get the compiler to use SSE for this. Will it make the code quicker? Probably not. The reason being is that there is a lot of memory access compared to computation. The CPU is much faster than the memory and a trivial implementation of the above will already have the CPU stalling when it's waiting for data to arrive over the system bus. Making the CPU faster will just increase the amount of waiting it does.
The declaration of nodeL can have an effect on the performance so it's important to choose an efficient container for your data.
There is a threshold where optimising does have a benfit, and that's when you're doing more computation between memory reads - i.e. the time between memory reads is much greater. The point at which this occurs depends a lot on your hardware.
It can be helpful, however, to optimise the code if you've got non-memory constrained tasks that can run in prarallel so that the CPU is kept busy whilst waiting for the data.
This will be faster. Double dereference of vector of vectors is expensive. Caching one of the dereferences will help. I know it's not answering the posted question but I think it will be a more helpful answer.
int getDiff(int indx1, int indx2) {
int result = 0;
int pplus, pminus, tmp;
const vector<int>& nodetemp1 = nodeL[indx1];
const vector<int>& nodetemp2 = nodeL[indx2];
for (int k = 0; k < 128; k += 2) {
pplus = nodetemp2[k] - nodetemp1[k];
pminus = nodetemp1[k + 1] - nodetemp2[k + 1];
tmp = max(pplus, pminus);
if (tmp > result) {
result = tmp;
}
}
return result;
}
A couple of things to look at. One is the amount of data you are passing around. That will cause a bigger issue than the trivial calculation.
I've tried to rewrite it using SSE instructions (AVX) using library here
The original code on my system ran in 11.5s
With Neil Kirk's optimisation, it went down to 10.5s
EDIT: Tested the code with a debugger rather than in my head!
int getDiff(std::vector<std::vector<int>>& nodeL,int row1, int row2) {
Vec4i result(0);
const std::vector<int>& nodetemp1 = nodeL[row1];
const std::vector<int>& nodetemp2 = nodeL[row2];
Vec8i mask(-1,0,-1,0,-1,0,-1,0);
for (int k = 0; k < 128; k += 8) {
Vec8i nodeA(nodetemp1[k],nodetemp1[k+1],nodetemp1[k+2],nodetemp1[k+3],nodetemp1[k+4],nodetemp1[k+5],nodetemp1[k+6],nodetemp1[k+7]);
Vec8i nodeB(nodetemp2[k],nodetemp2[k+1],nodetemp2[k+2],nodetemp2[k+3],nodetemp2[k+4],nodetemp2[k+5],nodetemp2[k+6],nodetemp2[k+7]);
Vec8i tmp = select(mask,nodeB-nodeA,nodeA-nodeB);
Vec4i tmp_a(tmp[0],tmp[2],tmp[4],tmp[6]);
Vec4i tmp_b(tmp[1],tmp[3],tmp[5],tmp[7]);
Vec4i max_tmp = max(tmp_a,tmp_b);
result = select(max_tmp > result,max_tmp,result);
}
return horizontal_add(result);
}
The lack of branching speeds it up to 9.5s but still data is the biggest impact.
If you want to speed it up more, try to change the data structure to a single array/vector rather than a 2D one (a.l.a. std::vector) as that will reduce cache pressure.
EDIT
I thought of something - you could add a custom allocator to ensure you allocate the 2*18M vectors in a contiguous block of memory which allows you to keep the data structure and still go through it quickly. But you'd need to profile it to be sure
EDIT 2: Tested the code with a debugger rather than in my head!
Sorry Alex, this should be better. Not sure it will be faster than what the compiler can do. I still maintain that it's memory access that's the issue, so I would still try the single array approach. Give this a go though.