OpenMP code which gives wrong answers when I start using 12 threads - c++

I have this piece of Open MP code here which performs an integeration of the function 4.0/(1+x^2) on the interval [0,1]. The analytical answer to this is pi = 3.14159...
The method of integrating the function is just by a plain approximating Riemann sum. Now the code
gives me the correct answer when I use 1 OpenMP thread, upto 11 OpenMP threads.
However it starts giving increasingly wrong answers once I start using 12 OpenMP threads or more.
Why could this be happening? First here is the C++ code. I am using gcc in an Ubuntu 10.10 environment. The code is compiled with g++ -fopenmp integration_OpenMP.cpp
// f(x) = 4/(1+x^2)
// Domain of integration: [0,1]
// Integral over the domain = pi =(approx) 3.14159
#include <iostream>
#include <omp.h>
#include <vector>
#include <algorithm>
#include <functional>
#include <numeric>
int main (void)
{
//Information common to serial and parallel computation.
int num_steps = 2e8;
double dx = 1.0/num_steps;
//Serial Computation: Method pf integration is just a plain Riemann sum
double start = omp_get_wtime();
double serial_sum = 0;
double x = 0;
for (int i=0;i< num_steps; ++i)
{
serial_sum += 4.0*dx/(1.0+x*x);
x += dx;
}
double end = omp_get_wtime();
std::cout << "Time taken for the serial computation: " << end-start << " seconds";
std::cout << "\t\tPi serial: " << serial_sum << std::endl;
//OpenMP computation. Method of integration, just a plain Riemann sum
std::cout << "How many OpenMP threads do you need for parallel computation? ";
int t;//number of openmp threads
std::cin >> t;
start = omp_get_wtime();
double parallel_sum = 0; //will be modified atomically
#pragma omp parallel num_threads(t)
{
int threadIdx = omp_get_thread_num();
int begin = threadIdx * num_steps/t; //integer index of left end point of subinterval
int end = begin + num_steps/t; // integer index of right-endpoint of sub-interval
double dx_local = dx;
double temp = 0;
double x = begin*dx;
for (int i = begin; i < end; ++i)
{
temp += 4.0*dx_local/(1.0+x*x);
x += dx_local;
}
#pragma omp atomic
parallel_sum += temp;
}
end = omp_get_wtime();
std::cout << "Time taken for the parallel computation: " << end-start << " seconds";
std::cout << "\tPi parallel: " << parallel_sum << std::endl;
return 0;
}
Here is the output for different number of threads starting with 11 threads.
OpenMP: ./a.out
Time taken for the serial computation: 1.27744 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 11
Time taken for the parallel computation: 0.366467 seconds Pi parallel: 3.14159
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP: ./a.out
Time taken for the serial computation: 1.28167 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 12
Time taken for the parallel computation: 0.351284 seconds Pi parallel: 3.16496
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP:
OpenMP: ./a.out
Time taken for the serial computation: 1.28178 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 13
Time taken for the parallel computation: 0.434283 seconds Pi parallel: 3.21112
OpenMP: ./a.out
Time taken for the serial computation: 1.2765 seconds Pi serial: 3.14159
How many OpenMP threads do you need for parallel computation? 14
Time taken for the parallel computation: 0.375078 seconds Pi parallel: 3.27163
OpenMP:

Why not just use a parallel for with static partitioning instead?
#pragma omp parallel shared(dx) num_threads(t)
{
double x = omp_get_thread_num() * 1.0 / t;
#pragma omp for reduction(+ : parallel_Sum)
for (int i = 0; i < num_steps; ++i)
{
parallel_Sum += 4.0*dx/(1.0+x*x);
x += dx;
}
}
Then you won't need to manage all the partitioning and atomic collection of results by yourself.
In order to correctly initialize x, we notice that x = (begin * dx) = (threadIdx * num_steps/t) * (1.0 / num_steps) = (threadIdx * 1.0) / t.
Edit: Just tested this final version on my machine and it seems to work correctly.

The problem is in calculating begin:
while you set num_steps = 2e8, when threadIdx==11, num_steps * threadIdx will lead to 32-bit integer overflow, so your start will be calculated incorrectly.
I advise you use long long int for threadIdx, begin and end.
EDIT:
Also note, that your method of calculating begin and end can lead to steps (and precision ) will be lost. For example, for 313 threads you loose 199 steps.
Right way to calculate begin and end would be:
long long int begin = threadIdx * num_steps/t;
long long int end = (threadIdx + 1) * num_steps/t;
For the same reason, you cannot do the trick with parenthesis, but have to use long long.

Related

OpenMP paralellized C++ code exhibits higher execution times on more threads?

I'm trying to learn paralellization of C++ using openmp, and I'm trying to use the following example. But for some reason when I increase the number of threads the code runs slower. Im compiling it using the -fopenmp flag. It would be nice if I could get your expert opinion.
#include <omp.h>
#include <iostream>
static long num_steps =100000000;
#define NUM_THREADS 4
double step;
int main(){
int i,nthreads;
double pi, sum[NUM_THREADS]; // should be shared : hence promoted scalar sum into an array
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
int i, id, nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
//if(id==0) nthreads = nthrds; // This is done because the number of threads can be different
// ie the environment can give you a different number of threads
// than requested
for(i=id, sum[id] = 0.0; i<num_steps;i=i+nthrds){
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
double t2 = omp_get_wtime();
std::cout << "Time : " ;
double ms_double = t2 - t1;
std::cout << ms_double << "ms\n";
for(i=0,pi=0.0; i < nthreads; i++){
pi += sum[i]*step;
}
}
Minor complaints aside, your big problem is the loop update i=i+nthrds. This means that each cache line will be accessed by all 4 of your threads. (Btw, use the OMP_NUM_THREADS environment variable to set the number of threads. Do not hardcode.) This is called false sharing and it's really bad for performance: you want each cacheline to be exclusively in one core.
The main advantage of OpenMP is that you do not have to do reduction manually. You just have to add an extra line to the serial code. So, your code should be something like this (which is free from false-sharing):
double sum=0;
#pragma omp parallel for reduction(+:sum)
for(unsigned long i=0; i<num_steps; ++i){
const double x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
double pi = sum*step;
Note that your code had an uninitialized variable (pi) and your code did not handle the properly if you got less threads than requested.
What #Victor Ejkhout called "minor complaints" might not be so minor. It is only normal that using a new API (omp) for the first time can be confusing. And that reflects on the coding style of the application code as well, more often than not. But especially in such cases, special attention should be paid to readability.
The code below is the "prettied-up" version of your attempt. And next to the omp parallel integration it also has the single threaded and a multi threaded (using std::thread) version so you can compare them to each other.
#include <omp.h>
#include <iostream>
#include <thread>
constexpr int MAX_PARALLEL_THREADS = 4; // long is wrong - is it an isize_t or a int32_t or an int64_t???
// the function we want to integrate
double f(double x) {
return 4.0 / (1.0 + x * x);
}
// performs the summation of function values on the interval [left,right[
double sum_interval(double left, double right, double step) {
double sum = 0.0;
for (double x = left; x < right; x += step) {
sum += f(x);
}
return sum;
}
double integrate_single_threaded(double left, double right, double step) {
return sum_interval(left, right, step) / (right - left);
}
double integrate_multi_threaded(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
std::thread threads[MAX_PARALLEL_THREADS];
for (int i= 0; i < MAX_PARALLEL_THREADS;i++) {
threads[i] = std::thread( [&sums,left,right,step,i] () {
double ileft = left + (right - left) / MAX_PARALLEL_THREADS * i;
double iright = left + (right - left) / MAX_PARALLEL_THREADS * (i + 1);
sums[i] = sum_interval(ileft,iright,step);
});
}
double total_sum = 0.0;
for (int i = 0; i < MAX_PARALLEL_THREADS; i++) {
threads[i].join();
total_sum += sums[i];
}
return total_sum / (right - left);
}
double integrate_parallel(double left, double right, double step) {
double sums[MAX_PARALLEL_THREADS];
int thread_count = 0;
omp_set_num_threads(MAX_PARALLEL_THREADS);
#pragma omp parallel
{
thread_count = omp_get_num_threads(); // 0 is impossible, there is always 1 thread minimum...
int interval_index = omp_get_thread_num();
double ileft = left + (right - left) / thread_count * interval_index;
double iright = left + (right - left) / thread_count * (interval_index + 1);
sums[interval_index] = sum_interval(ileft,iright,step);
}
double total_sum = 0.0;
for (int i = 0; i < thread_count; i++) {
total_sum += sums[i];
}
return total_sum / (right - left);
}
int main (int argc, const char* argv[]) {
double left = -1.0;
double right = 1.0;
double step = 1.0E-9;
// run single threaded calculation
std::cout << "single" << std::endl;
double tstart = omp_get_wtime();
double i_single = integrate_single_threaded(left, right, step);
double tend = omp_get_wtime();
double st_time = tend - tstart;
// run multi threaded calculation
std::cout << "multi" << std::endl;
tstart = omp_get_wtime();
double i_multi = integrate_multi_threaded(left, right, step);
tend = omp_get_wtime();
double mt_time = tend - tstart;
// run omp calculation
std::cout << "omp" << std::endl;
tstart = omp_get_wtime();
double i_omp = integrate_parallel(left, right, step);
tend = omp_get_wtime();
double omp_time = tend - tstart;
std::cout
<< "i_single: " << i_single
<< " st_time: " << st_time << std::endl
<< "i_multi: " << i_multi
<< " mt_time: " << mt_time << std::endl
<< "i_omp: " << i_omp
<< " omp_time: " << omp_time << std::endl;
return 0;
}
When I compile this on my Debian with g++ --std=c++17 -Wall -O3 -lpthread -fopenmp -o para para.cpp -pthread, I get the following results:
single
multi
omp
i_single: 3.14159e+09 st_time: 2.37662
i_multi: 3.14159e+09 mt_time: 0.635195
i_omp: 3.14159e+09 omp_time: 0.660593
So, at least my conclusion is, that it is not worth the effort to learn openMP, given that the (more general use) std::thread version looks just as nice and performs at least equally well.
I am not really trusting the computed integral result in either case, though. But I did not really focus on that. They all produce the same value. That is the important part.

Why doesn't my OpenMP program scale with number of threads?

I write a program to calculate the sum of an array of 1M numbers where all elements = 1. I use OpenMP for multithreading. However, the run time doesn't scale with the number of threads. Here is the code:
#include <iostream>
#include <omp.h>
#define SIZE 1000000
#define N_THREADS 4
using namespace std;
int main() {
int* arr = new int[SIZE];
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(N_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, 16) reduction(+:sum)
for (int i = 0; i < SIZE; i++) {
sum += arr[i];
}
}
double t2 = omp_get_wtime();
cout << "n_threads " << n_threads << endl;
cout << "time " << (t2 - t1)*1000 << endl;
cout << sum << endl;
}
The run time (in milliseconds) for different values of N_THREADS is as follows:
n_threads 1
time 3.6718
n_threads 2
time 2.5308
n_threads 3
time 3.4383
n_threads 4
time 3.7427
n_threads 5
time 2.4621
I used schedule(static, 16) to use chunks of 16 iterations per thread to avoid false sharing problem. I thought the performance issue was related to false sharing, but I now think it's not. What could possibly be the problem?
Your code is memory bound, not computation expensive. Its speed depends on the speed of memory access (cache utilization, number of memory channels, etc), therefore it is not expected to scale well with the number of threads.
UPDATE, I run this code using 1000x bigger SIZE (i.e. #define SIZE 100000000) (g++ -fopenmp -O3 -mavx2)
Here are the results, it still scales badly with number of threads:
n_threads 1
time 652.656
time 657.207
time 608.838
time 639.168
1000000000
n_threads 2
time 422.621
time 373.995
time 425.819
time 386.511
time 466.632
time 394.198
1000000000
n_threads 3
time 394.419
time 391.283
time 470.925
time 375.833
time 442.268
time 449.611
time 370.12
time 458.79
1000000000
n_threads 4
time 421.89
time 402.363
time 424.738
time 414.368
time 491.843
time 429.757
time 431.459
time 497.566
1000000000
n_threads 8
time 414.426
time 430.29
time 494.899
time 442.164
time 458.576
time 449.313
time 452.309
1000000000
5 threads contending for same accumulator for reduction or having only 16 chunk size must be inhibiting efficient pipelining of loop iterations. Try coarser region per thread.
Maybe more importantly, you need multiple repeats of benchmark programmatically to get an average and to heat CPU caches/cores into higher frequencies to have better measurement.
The benchmark results saying 1MB/s. Surely the worst RAM will do 1000 times better than that. So memory is not bottleneck (for now). 1 million elements per 4 second is like locking contention or non-heated benchmark. Normally even a Pentium 1 would make more bandwidth than that. You sure you are compiling with O3 optimization?
I have reimplemented the test as a Google Benchmark with different values:
#include <benchmark/benchmark.h>
#include <memory>
#include <omp.h>
constexpr int SCALE{32};
constexpr int ARRAY_SIZE{1000000};
constexpr int CHUNK_SIZE{16};
void original_benchmark(benchmark::State& state)
{
const int num_threads{state.range(0)};
const int array_size{state.range(1)};
const int chunk_size{state.range(2)};
auto arr = std::make_unique<int[]>(array_size);
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(num_threads);
// double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, chunk_size)
for (int i = 0; i < array_size; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, chunk_size) reduction(+:sum)
for (int i = 0; i < array_size; i++) {
sum += arr[i];
}
}
// double t2 = omp_get_wtime();
// cout << "n_threads " << n_threads << endl;
// cout << "time " << (t2 - t1)*1000 << endl;
// cout << sum << endl;
state.counters["n_threads"] = n_threads;
}
static void BM_original_benchmark(benchmark::State& state) {
for (auto _ : state) {
original_benchmark(state);
}
}
BENCHMARK(BM_original_benchmark)
->Args({1, ARRAY_SIZE, CHUNK_SIZE})
->Args({1, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({1, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({2, ARRAY_SIZE, CHUNK_SIZE})
->Args({2, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({2, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({4, ARRAY_SIZE, CHUNK_SIZE})
->Args({4, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({4, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({8, ARRAY_SIZE, CHUNK_SIZE})
->Args({8, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({8, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({16, ARRAY_SIZE, CHUNK_SIZE})
->Args({16, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({16, ARRAY_SIZE, SCALE * CHUNK_SIZE});
BENCHMARK_MAIN();
I only have access to Compiler Explorer at the moment which will not execute the complete suite of benchmarks. However, it looks like increasing the chunk size will improve the performance. Obviously, benchmark and optimize for your own system.

Why is my C++ code so much slower than R?

I have written the following codes in R and C++ which perform the same algorithm:
a) To simulate the random variable X 500 times. (X has value 0.9 with prob 0.5 and 1.1 with prob 0.5)
b) Multiply these 500 simulated values together to get a value. Save that value in a container
c) Repeat 10000000 times such that the container has 10000000 values
R:
ptm <- proc.time()
steps <- 500
MCsize <- 10000000
a <- rbinom(MCsize,steps,0.5)
b <- rep(500,times=MCsize) - a
result <- rep(1.1,times=MCsize)^a*rep(0.9,times=MCsize)^b
proc.time()-ptm
C++
#include <numeric>
#include <vector>
#include <iostream>
#include <random>
#include <thread>
#include <mutex>
#include <cmath>
#include <algorithm>
#include <chrono>
const size_t MCsize = 10000000;
std::mutex mutex1;
std::mutex mutex2;
unsigned seed_;
std::vector<double> cache;
void generatereturns(size_t steps, int RUNS){
mutex2.lock();
// setting seed
try{
std::mt19937 tmpgenerator(seed_);
seed_ = tmpgenerator();
std::cout << "SEED : " << seed_ << std::endl;
}catch(int exception){
mutex2.unlock();
}
mutex2.unlock();
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
std::lock_guard<std::mutex> guard(mutex1);
cache.push_back(returns);
}
}
int main(){
std::chrono::steady_clock::time_point start = std::chrono::steady_clock::now();
size_t steps = 500;
seed_ = 777;
unsigned concurentThreadsSupported = std::max(std::thread::hardware_concurrency(),(unsigned)1);
int remainder = MCsize % concurentThreadsSupported;
std::vector<std::thread> threads;
// starting sub-thread simulations
if(concurentThreadsSupported != 1){
for(int i = 0 ; i != concurentThreadsSupported - 1; ++i){
if(remainder != 0){
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported + 1));
remainder--;
}else{
threads.push_back(std::thread(generatereturns,steps,MCsize / concurentThreadsSupported));
}
}
}
//starting main thread simulation
if(remainder != 0){
generatereturns(steps, MCsize / concurentThreadsSupported + 1);
remainder--;
}else{
generatereturns(steps, MCsize / concurentThreadsSupported);
}
for (auto& th : threads) th.join();
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now() ;
typedef std::chrono::duration<int,std::milli> millisecs_t ;
millisecs_t duration( std::chrono::duration_cast<millisecs_t>(end-start) ) ;
std::cout << "Time elapsed : " << duration.count() << " milliseconds.\n" ;
return 0;
}
I can't understand why my R code is so much faster than my C++ code (3.29s vs 12s) even though I have used four threads in the C++ code? Can anyone enlighten me please? How should I improve my C++ code to make it run faster?
EDIT:
Thanks for all the advice! I reserved capacity for my vectors and reduced the amount of locking in my code. The crucial update in the generatereturns() function is :
std::vector<double> cache(MCsize);
std::vector<double>::iterator currit = cache.begin();
//.....
// Creating generator
std::binomial_distribution<int> distribution(steps,0.5);
std::mt19937 generator(seed_);
std::vector<double> tmpvec(RUNS);
for(int i = 0; i!= RUNS; ++i){
double power;
double returns;
power = distribution(generator);
returns = pow(0.9,power) * pow(1.1,(double)steps - power);
tmpvec[i] = returns;
}
std::lock_guard<std::mutex> guard(mutex1);
std::move(tmpvec.begin(),tmpvec.end(),currit);
currit += RUNS;
Instead of locking every time, I created a temporary vector and then used std::move to shift the elements in that tempvec into cache. Now the elapsed time has reduced to 1.9seconds.
First of all, are you running it in release mode?
Switching from debug to release reduced the running time from ~15s to ~4.5s on my laptop (windows 7, i5 3210M).
Also, reducing the number of threads to 2 instead of 4 in my case (I just have 2 cores but with hyperthreading) further reduced the running time to ~2.4s.
Changing the variable power to int (as jimifiki also suggested) also offered a slight boost, reducing the time to ~2.3s.
I really enjoyed your question and I tried the code at home. I tried to change the random number generator, my implementation of std::binomial_distribution requires on average about 9.6 calls of generator().
I know the question is more about comparing R with C++ performances, but since you ask "How should I improve my C++ code to make it run faster?" I insist with pow optimization. You can easily avoid one half of the call by precomputing either 0.9^steps or 1.1^steps before the for loop. This makes your code run a bit faster:
double power1 = pow(0.9,steps);
double ratio = 1.1/0.9;
for(int i = 0; i!= RUNS; ++i){
...
returns = myF1 * pow(myF2, (double)power);
Analogously you can improve the R code:
...
ratio <-1.1/0.9
pow1 = 0.9^steps
result <- rep(ratio,times=MCsize)^rep(pow1,times=MCsize)
...
Probably doesn't help you that much, but
start by using pow(double,int) when your exponent is an int.
int power;
returns = pow(0.9,power) * pow(1.1,(int)steps - power);
Can you see any improvement?

Timing performance of multithreading

I wrote a small program to check on the performance of threading which I found a couple of questions from the result i obtained
(cpu of my laptop is i5 3220M)
1) The time required pumped up for 2 thread every time I ran the program. Is it because the omp timer I use or I have some logical error in the program?
2) Also will it be better if I use cpu cycle to measure the performance instead?
3) The time continue to decrease as the number of thread increase. I know my program is simple enough so probably requires no context switch but where does the extra performance come? Coz the cpu adj itself to the turbo freq? (Normal 2.6MHz, turbo 3.3MHz according to intel website)
Thanks!
Output
Adding 1 for 1000 million times
Average Time Elapsed for 1 threads = 3.11565(Check = 5000000000)
Average Time Elapsed for 2 threads = 4.54309(Check = 5000000000)
Average Time Elapsed for 4 threads = 2.19321(Check = 5000000000)
Average Time Elapsed for 8 threads = 2.48927(Check = 5000000000)
Average Time Elapsed for 16 threads = 1.84427(Check = 5000000000)
Average Time Elapsed for 32 threads = 1.30958(Check = 5000000000)
Average Time Elapsed for 64 threads = 1.08472(Check = 5000000000)
Average Time Elapsed for 128 threads = 0.996898(Check = 5000000000)
Average Time Elapsed for 256 threads = 1.01366(Check = 5000000000)
Average Time Elapsed for 512 threads = 0.951436(Check = 5000000000)
Average Time Elapsed for 1024 threads = 0.973331(Check = 4999997440)
Program
#include <iostream>
#include <thread>
#include <algorithm> // for_each
#include <vector>
#include <omp.h> // omp_get_wtime
class Adder{
public:
long sum;
Adder(){};
void operator()(long endVal_i){
sum = 0;
for (long i = 1; i<= endVal_i; i++)
sum++;
};
};
int main()
{
long totalCount = 1000000000;
int maxThread = 1025;
int numSample = 5;
std::vector<std::thread> threads;
Adder adderArray[maxThread];
std::cout << "Adding 1 for " << totalCount/1000000 << " million times\n\n";
for (int numThread = 1; numThread <=maxThread; numThread=numThread*2){
double avgTime=0;
long check = 0;
for (int i = 1; i<=numSample; i++){
double startTime = omp_get_wtime();
long loop = totalCount/numThread;
for (int i = 0; i<numThread;i++)
threads.push_back(std::thread(std::ref(adderArray[i]), loop));
std::for_each(threads.begin(), threads.end(),std::mem_fn(&std::thread::join));
double endTime = omp_get_wtime();
for (int i = 0; i<numThread;i++)
check += adderArray[i].sum;
threads.erase(threads.begin(), threads.end());
avgTime += endTime - startTime;
}
std::cout << "Average Time Elapsed for " << numThread<< " threads = " << avgTime/numSample << "(Check = "<<check<<")\n";
}
}

Performance swapping integers vs double

For some reason my code is able to perform swaps on doubles faster than on the integers. I have no idea why this would be happening.
On my machine the double swap loop completes 11 times faster than the integer swap loop. What property of doubles/integers make them perform this way?
Test setup
Visual Studio 2012 x64
cpu core i7 950
Build as Release and run exe directly, VS Debug hooks skew things
Output:
Process time for ints 1.438 secs
Process time for doubles 0.125 secs
#include <iostream>
#include <ctime>
using namespace std;
#define N 2000000000
void swap_i(int *x, int *y) {
int tmp = *x;
*x = *y;
*y = tmp;
}
void swap_d(double *x, double *y) {
double tmp = *x;
*x = *y;
*y = tmp;
}
int main () {
int a = 1, b = 2;
double d = 1.0, e = 2.0, iTime, dTime;
clock_t c0, c1;
// Time int swaps
c0 = clock();
for (int i = 0; i < N; i++) {
swap_i(&a, &b);
}
c1 = clock();
iTime = (double)(c1-c0)/CLOCKS_PER_SEC;
// Time double swaps
c0 = clock();
for (int i = 0; i < N; i++) {
swap_d(&d, &e);
}
c1 = clock();
dTime = (double)(c1-c0)/CLOCKS_PER_SEC;
cout << "Process time for ints " << iTime << " secs" << endl;
cout << "Process time for doubles " << dTime << " secs" << endl;
}
It seems that VS only optimized one of the loops as Blastfurnace explained.
When I disable all compiler optimizations and have my swap code inline inside the loops, I got the following results (I also switched my timer to std::chrono::high_resolution_clock):
Process time for ints 1449 ms
Process time for doubles 1248 ms
You can find the answer by looking at the generated assembly.
Using Visual C++ 2012 (32-bit Release build) the body of swap_i is three mov instructions but the body of swap_d is completely optimized away to an empty loop. The compiler is smart enough to see that an even number of swaps has no visible effect. I don't know why it doesn't do the same with the int loop.
Just changing #define N 2000000000 to #define N 2000000001 and rebuilding causes the swap_d body to perform actual work. The final times are close on my machine with swap_d being about 3% slower.