Parallel execution taking more time than serial - c++

I am basically writing code to count if a pair sum is even(among all pairs from 1 to 100000). I wrote a code using pthreads and without pthreads. But the code with pthreads is taking more time than the serial one. Here is my serial code
#include<bits/stdc++.h>
using namespace std;
int main()
{
long long sum = 0, count = 0, n = 100000;
auto start = chrono::high_resolution_clock::now();
for(int i = 1; i <= n; i++)
for(int j = i-1; j >= 0; j--)
{
sum = i + j;
if(sum%2 == 0)
count++;
}
cout<<"count is "<<count<<endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
and here is my parallel code
#include<bits/stdc++.h>
using namespace std;
#define MAX_THREAD 3
long long cnt[5] = {0};
long long n = 100000;
int work_per_thread;
int start[] = {1, 60001, 83001, 100001};
void *count_array(void* arg)
{
int t = *((int*)arg);
long long sum = 0;
for(int i = start[t]; i < start[t+1]; i++)
for(int j = i-1; j >=0; j--)
{
sum = i + j;
if(sum%2 == 0)
cnt[t]++;
}
cout<<"thread"<<t<<" finished work "<<cnt[t]<<endl;
return NULL;
}
int main()
{
pthread_t threads[MAX_THREAD];
int arr[] = {0,1,2};
long long total_count = 0;
work_per_thread = n/MAX_THREAD;
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, count_array, &arr[i]);
for(int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
for(int i = 0; i < MAX_THREAD; i++)
total_count += cnt[i];
cout << "count is " << total_count << endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
In the parallel code I am creating three threads and 1st thread will be doing its computation from 1 to 60000, 2nd thread from 60001 to 83000 and so on. I have chosen these numbers so that each thread gets to do approximately similar number of computations. The parallel execution takes 10.3 secs whereas serial one takes 7.7 secs. I have 6 cores and 2 threads per core. I also used htop command to check if the required number of threads are running or not and it seems to be working fine. I don't understand where the problem is.

The all cores in the threaded version compete for cnt[].
Use a local counter inside the loop and copy the result into cnt[t] after the loop is ready.

Related

Using OpenMP multithread is much slower than single thread

I am trying to parallel my C++ Neural Network Training Process using OpenMP. But it won't work.
And then I used a simple C++ code with nested loops to test the OpenMP.
But it is much slower with OpenMP multithread than single thread.
Did I do something wrong to make it slower? Or did I miss something?
System
MacOS 4 cores
Language
C++
Time functions
I used both high_resolution_clock::now() and omp_get_wtime().
std::chrono::high_resolution_clock::now();
single thread cost time: 0.00000000000000
2 threads cost time: 0.00010013580322
4 threads cost time: 0.00016403198242
6 threads cost time: 0.00017309188843
8 threads cost time: 0.00112605094910
10 threads cost time: 0.00013613700867
12 threads cost time: 0.00082898139954
omp_get_wtime();
single thread cost time: 0.00000005900000
2 threads cost time: 0.00009907600000
4 threads cost time: 0.00018207300000
6 threads cost time: 0.00014479500000
8 threads cost time: 0.00070604400000
10 threads cost time: 0.00057277700000
12 threads cost time: 0.00074358000000
Code
#include <iostream>
#include <omp.h>
#include <chrono>
#include <iomanip>
using namespace std;
void test() {
int j = 0;
for (int i = 0; i < 100000; i++) {
// do something to kill time...
j++;
}
};
int main()
{
auto startTime = chrono::high_resolution_clock::now();
auto endTime = chrono::high_resolution_clock::now();
// without openMp
startTime = chrono::high_resolution_clock::now();
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
chrono::duration<double> diff = endTime - startTime;
cout << setprecision(14) << fixed;
cout << "single thread cost time: " << diff.count() << endl;
// 2 threads
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(2)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "2 threads cost time: " << diff.count() << endl;
// 4 threads
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(4)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "4 threads cost time: " << diff.count() << endl;
// 6 threads
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(6)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "6 threads cost time: " << diff.count() << endl;
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(8)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "8 threads cost time: " << diff.count() << endl;
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(10)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "10 threads cost time: " << diff.count() << endl;
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(12)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "12 threads cost time: " << diff.count() << endl;
// system("pause");
return 0;
}
How I compile the code
clang++ -std=c++11 -Xpreprocessor -fopenmp parallel.cpp -O3 -o parallel -lomp
Update
Hi guys, the previous problem has solved, I think I should not use NUM_THREAD.
But when I use OpenMP to accelerate my neural network, it takes longer time.
Data size
MNIST dataset, 60000 each epoch
Time Function
omp_get_wtime()
Single thread result
***** train epoch 1.
Batch count: 6000.
batch size: 10.
Progress: 5999/6000.
train time is ... 64.7082.
Accuracy: 97.72% 9772/10000.
predict time is ... 3.51836.
Releasing Data Samples...
Releasing Neural Network...
Result with OpenMP
***** train epoch 1.
Batch count: 6000.
batch size: 10.
Progress: 5999/6000.
train time is: 247.615.
Accuracy: 97.72% 9772/10000.
predict time is: 30.739.
Code using parallel for
#pragma omp parallel for
for (int k = 0; k < size; k++) {
layer->map[i].data[k] = activation_func::tan_h(layer->map_common[k] + layer->map[i].b);
// cout << "current thread: " << omp_get_thread_num() << endl;
}
Code using parallel for and omp critical
for (int k = 0; k < layer->map_count; k++) {
for (int i = 0; i < map_h; i++) {
for (int j = 0; j < map_w; j++) {
double max_value = prev_layer->map[k].data[2*i*upmap_w + 2*j];
for (int n = 2*i; n < 2*(i + 1); n++) {
#pragma omp parallel for
for (int m = 2*j; m < 2*(j + 1); m++) {
#pragma omp critical
max_value = MAX(max_value, prev_layer->map[k].data[n*upmap_w + m]);
}
}
layer->map[k].data[i*map_w + j] = activation_func::tan_h(max_value);
}
}
}
I am trying to parallel my C++ Neural Network Training Process using
OpenMP. But it won't work. And then I used a simple C++ code with
nested loops to test the OpenMP.
I see this quite often; Introducing OpenMP in a code, or parallelism for that matter, will not magically make your code faster.
Why ?? because of a lot of factors but (in your context) because the work done in parallel should be big enough to overcome the overhead of the parallelism (e.g., thread creation, synchronization and so). To do that you need to increase the size/number of the parallel tasks.
Another issue is with the way you are benchmarking the code:
Your parallel task:
void test() {
int j = 0;
for (int i = 0; i < 100000; i++) {
// do something to kill time...
j++; <---- Not enough work done in parallel
}
};
In the sequential version the compiler can easily deduct that j = 100000 - 1;. Moreover, because you are not doing anything with that value (i.e., j) the compiler can actually optimized the entire call to the test() function away. Hence, as pointed out in the comments:
Your test loop doesn't really do anything, so the compiler might be
removing it. Then the time you get would be mostly the time spent
creating threads. – 1201ProgramAlarm
and
The test function should return the value and your code should print
it somewhere. AS #1201ProgramAlarm has said, the compiler might detect
that you're just wasting compute time and remove the loop. – Michael
Klemm
Furthermore, instead of having the following block of code:
// 2 threads
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(2)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "2 threads cost time: " << diff.count() << endl;
replicated a bunch of times, it would have been better to have it a single time and change the number of threads using the environment variable OMP_NUM_THREADS from the outside.
Regarding your update:
for (int k = 0; k < layer->map_count; k++) {
for (int i = 0; i < map_h; i++) {
for (int j = 0; j < map_w; j++) {
double max_value = prev_layer->map[k].data[2*i*upmap_w + 2*j];
for (int n = 2*i; n < 2*(i + 1); n++) {
#pragma omp parallel for
for (int m = 2*j; m < 2*(j + 1); m++) {
#pragma omp critical
max_value = MAX(max_value, prev_layer->map[k].data[n*upmap_w + m]);
}
}
layer->map[k].data[i*map_w + j] = activation_func::tan_h(max_value);
}
}
}
that critical section is basically making the code sequential. Actually even worse than sequential because there is the additional overhead of the locking mechanism.
Instead of #pragma omp critical you should use the OpenMP reduce, which is exactly meant for this kind of situations. Moreover, you can try to parallelize the for (int n = 2*i; n < 2*(i + 1); n++) instead:
for (int k = 0; k < layer->map_count; k++) {
for (int i = 0; i < map_h; i++) {
for (int j = 0; j < map_w; j++) {
double max_value = prev_layer->map[k].data[2*i*upmap_w + 2*j];
#pragma omp parallel for reduction(max: max_value)
for (int n = 2*i; n < 2*(i + 1); n++) {
for (int m = 2*j; m < 2*(j + 1); m++) {
max_value = MAX(max_value, prev_layer->map[k].data[n*upmap_w + m]);
}
}
layer->map[k].data[i*map_w + j] = activation_func::tan_h(max_value);
}
}
}
A side note, personally, and don't take it in the wrong way, but I think you should spend more time learning first the basics of multithreading and OpenMP before trying to blindly parallelize code.
Please, don't keep adding updates to the original question, with newer question. Just create a new question instead.

Measuring time with chrono changes after printing

I want to measure the execution time of a program in ns in C++. For that purpose I am using the chrono library.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
return 0;
}
I measured the time and it executed in 90 ns . However when I add a printing afterwards the time changes.
int main() {
const int ROWS = 200;
const int COLS = 200;
double input[ROWS][COLS];
int i,j;
auto start = std::chrono::steady_clock::now();
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
input[i][j] = i + j;
}
auto end = std::chrono::steady_clock::now();
auto res=std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count();
std::cout << "Elapsed time in nanoseconds : "
<< res
<< " ns" << std::endl;
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
std::cout<<input[i][j];
}
return 0;
}
The time changes to 89700 ns. What could be the problem. I only want to measure the execution time of the for.

Determining CPU time required to execute loop

I've done some SO searching and found this and that outlining timing methods.
My problem is that I need to determine the CPU time (in milliseconds) required to execute the following loop:
for (int i = 0, temp = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
I've looked at two methods, clock() and stead_clock::now(). Per the docs, I know that clock() returns "ticks" so I can get it in seconds by dividing the difference using CLOCKS_PER_SEC. The docs also mention that steady_clock is designed for interval timing, but you have to call duration_cast<milliseconds> to change its unit.
What I've done to time the two (since doing both in the same run may lead to one taking longer since the other was called first) is run them each by themselves:
clock_t t = clock();
for (int i = 0, temp = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
t = clock() - t;
cout << (float(t)/CLOCKS_PER_SEC) * 1000 << "ms taken" << endl;
chrono::steady_clock::time_point p1 = chrono::steady_clock::now();
for (int i = 0, temp = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
chrono::steady_clock::time_point p2 = chrono::steady_clock::now();
cout << chrono::duration_cast<milliseconds>(p2-p1).count() << "ms taken" << endl;
Output:
0ms taken
0ms taken
Do both these methods floor the result? Surely some fractal of milliseconds took place?
So which is ideal (or rather, more appropriate) for determining the CPU time required to execute the loop? At first glance, I would argue for clock() since the docs specifically tell me that its for determining CPU time.
For context, my CLOCKS_PER_SEC holds a value of 1000.
Edit/Update:
Tried the following:
clock_t t = clock();
for (int j = 0; j < 1000000; j++) {
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
}
t = clock() - t;
cout << (float(t) * 1000.0f / CLOCKS_PER_SEC / 1000000.0f) << "ms taken" << endl;
Outputs: 0.019953ms taken
clock_t start = clock();
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
clock_t end = clock();
cout << fixed << setprecision(2) << 1000.0 * (end - start) / CLOCKS_PER_SEC << "ms taken" << endl;
Outputs: 0.00ms taken
chrono::high_resolution_clock::time_point p1 = chrono::high_resolution_clock::now();
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
chrono::high_resolution_clock::time_point p2 = chrono::high_resolution_clock::now();
cout << (chrono::duration_cast<chrono::microseconds>(p2 - p1).count()) / 1000.0 << "ms taken" << endl;
Outputs: 0.072ms taken
chrono::steady_clock::time_point p1 = chrono::steady_clock::now();
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
chrono::steady_clock::time_point p2 = chrono::steady_clock::now();
cout << (chrono::duration_cast<chrono::microseconds>(p2 - p1).count()) / 1000.0f << "ms taken" << endl;
Outputs: 0.044ms
So the question becomes, which is valid? The second method to me seems invalid because I think the loop is completing faster than a millisecond.
I understand the first method (simply to execute longer) but the last two methods produce drastically different results.
One thing I've noticed is that after compiling the program, the first time I run it I may get 0.073ms (for the high_resolution_clock) and 0.044ms (for the steady_clock) at first, but all subsequent runs are within the range of 0.019 - 0.025ms.
You can do the loop a million times, and divide. You can also add the volatile keyword to avoid some compiler optimizations.
clock_t t = clock();
for (int j = 0, j < 1000000; j++) {
volatile int temp = 0;
for (int i = 0; i < 10000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
}
t = clock() - t;
cout << (float(t) * 1000.0f / CLOCKS_PER_SEC / 1000000.0f) << "ms taken" << endl;
Well using GetTickCount() seems to be solution, I hope
double start_s = GetTickCount();
for (int i = 0, temp = 0; i < 10000000; i++)
{
if (i % 2 == 0)
{
temp = (i / 2) + 1;
}
else
{
temp = 2 * i;
}
}
double stop_s = GetTickCount();
cout << (stop_s - start_s) / double(CLOCKS_PER_SEC) * 1000 << "ms taken" << endl;
For me returns between 16-31ms

OpenMP implementation increasingly slow with thread count increase

I have been trying to learn to use OpenMP. However my code seemed to be running more quickly in series that parallel.
Indeed the more threads used, the slower the computation time.
To illustrate this I ran an experiment. I am trying to do the following operation:
long int C[num], D[num];
for (i=0; i<num; i++) C[i] = i;
for (i=0; i<num; i++){
for (j=0; j<N; j++) {
D[i] = pm(C[i]);
}
}
where the function pm is simply
int pm(int val) {
val++;
val--;
return val;
}
I implemented the inner loop in parallel and compared the run times as a function of the number of iterations on the inner loop (N) and the number of threads used. The code for the experiment is below.
#include <stdio.h>
#include <iostream>
#include <time.h>
#include "omp.h"
#include <fstream>
#include <cstdlib>
#include <cmath>
static long num = 1000;
using namespace std;
int pm(int val) {
val++;
val--;
return val;
}
int main() {
int i, j, k, l;
int iter = 8;
int iterT = 4;
long inum[iter];
for (i=0; i<iter; i++) inum[i] = pow(10, i);
double serial[iter][iterT], parallel[iter][iterT];
ofstream outdata;
outdata.open("output.dat");
if (!outdata) {
std::cerr << "Could not open file." << std::endl;
exit(1);
}
"""Experiment Start"""
for (l=1; l<iterT+1; l++) {
for (k=0; k<iter; k++) {
clock_t start = clock();
long int A[num], B[num];
omp_set_num_threads(l);
for (i=0; i<num; i++) A[i] = i;
for (i=0; i<num; i++){
#pragma omp parallel for schedule(static)
for (j=0; j<inum[k]; j++) {
B[i] = pm(A[i]);
}
}
clock_t finish = clock();
parallel[k][l-1] = (double) (finish - start) /\
CLOCKS_PER_SEC * 1000.0;
start = clock();
long int C[num], D[num];
for (i=0; i<num; i++) C[i] = i;
for (i=0; i<num; i++){
for (j=0; j<inum[k]; j++) {
D[i] = pm(C[i]);
}
}
finish = clock();
serial[k][l-1] = (double) (finish - start) /\
CLOCKS_PER_SEC * 1000.0;
}
}
"""Experiment End"""
for (j=0; j<iterT; j++) {
for (i=0; i<iter; i++) {
outdata << inum[i] << " " << j + 1 << " " << serial[i][j]\
<< " " << parallel[i][j]<< std::endl;
}
}
outdata.close();
return 0;
}
The link below is a plot of log(T) against log(N) for each thread count.
A plot of the run times for varying number of threads and magnitude of computational task.
(I just noticed that the legend labels for serial and parallel are the wrong way around).
As you can see using more than one thread increases the time greatly. Adding more threads increases the time taken linearly as a function of number of threads.
Can anyone tell me whats going on?
Thanks!
Freakish above was correct about the pm() function doing nothing, and the compiler was getting confused.
It also turns out that the rand() function does not play well withing OpenMP for loops.
Adding the function sqrt(i) (i being the loop index) I achieved the expected speedup to my code.

Speed of Comparison operators in C++

Today I came into a problem:
I have to read data from a file, the file contains a lot of test cases, it looks like
N
N lines followed..
...
...
So I used while(scanf("%d", &n) && n!=-1), but it took me more than 5s to read all data. However, when I changed it into while(scanf("%d", &n) && n>-1), it just took me 800ms to read alll data. So I suppose that there is difference between speed of comparison operators in C++, and can anyone give me the order?
PS: my compiler is GCC 5.1.0
OK, let me show more details of this problem.
The problem is here: http://acm.hdu.edu.cn/showproblem.php?pid=1171
Code with not equal is here:https://github.com/kimixuchen/codesnap/blob/master/greater
Code with gerater is here:
https://github.com/kimixuchen/codesnap/blob/master/not_equal
The question is about comparison, not reading files or badly formulated conditions. So lets test comparison only. Update: tested with /O2 optimization option.
#include <ctime>
#include <cstdlib>
#include <iostream>
int main()
{
const int testCases = 10000000;
const int iterations = 100;
srand(time(NULL));
int * A = new int[testCases];
bool *B = new bool[testCases];
freopen("output.txt", "w", stdout);
for (int i = 0; i < testCases; i++)
{
A[i] = rand() % 100;
}
clock_t begin = clock();
for (int j = 0; j < iterations; j++)
for (int i = 0; i < testCases; i++)
{
B[i] = A[i] != -1;
}
clock_t end = clock();
double elapsed_secs = end - begin;
std::cout << "Elapsed time using != - " << elapsed_secs << std::endl;
//Getting new random numbers for clean test
for (int i = 0; i < testCases; i++)
{
A[i] = rand() % 100;
}
begin = clock();
for (int j = 0; j < iterations; j++)
for (int i = 0; i < testCases; i++)
{
B[i] = A[i] > -1;
}
end = clock();
elapsed_secs = end - begin;
std::cout << "Elapsed time using > - " << elapsed_secs << std::endl;
return 0;
}
Results for 5 tests (in ticks):
'!=': 1005 994 1015 1009 1019
'>': 1006 1004 1004 1005 1035
Conclusion - there is no significant difference in optimized for speed program.