Using OpenMP multithread is much slower than single thread

Using OpenMP multithread is much slower than single thread - c++

I am trying to parallel my C++ Neural Network Training Process using OpenMP. But it won't work.
And then I used a simple C++ code with nested loops to test the OpenMP.
But it is much slower with OpenMP multithread than single thread.
Did I do something wrong to make it slower? Or did I miss something?
System
MacOS 4 cores
Language
C++
Time functions
I used both high_resolution_clock::now() and omp_get_wtime().
std::chrono::high_resolution_clock::now();
single thread cost time: 0.00000000000000
2 threads cost time: 0.00010013580322
4 threads cost time: 0.00016403198242
6 threads cost time: 0.00017309188843
8 threads cost time: 0.00112605094910
10 threads cost time: 0.00013613700867
12 threads cost time: 0.00082898139954
omp_get_wtime();
single thread cost time: 0.00000005900000
2 threads cost time: 0.00009907600000
4 threads cost time: 0.00018207300000
6 threads cost time: 0.00014479500000
8 threads cost time: 0.00070604400000
10 threads cost time: 0.00057277700000
12 threads cost time: 0.00074358000000
Code
#include <iostream>
#include <omp.h>
#include <chrono>
#include <iomanip>
using namespace std;
void test() {
int j = 0;
for (int i = 0; i < 100000; i++) {
// do something to kill time...
j++;
}
};
int main()
{
auto startTime = chrono::high_resolution_clock::now();
auto endTime = chrono::high_resolution_clock::now();
// without openMp
startTime = chrono::high_resolution_clock::now();
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
chrono::duration<double> diff = endTime - startTime;
cout << setprecision(14) << fixed;
cout << "single thread cost time: " << diff.count() << endl;
// 2 threads
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(2)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "2 threads cost time: " << diff.count() << endl;
// 4 threads
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(4)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "4 threads cost time: " << diff.count() << endl;
// 6 threads
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(6)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "6 threads cost time: " << diff.count() << endl;
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(8)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "8 threads cost time: " << diff.count() << endl;
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(10)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "10 threads cost time: " << diff.count() << endl;
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(12)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "12 threads cost time: " << diff.count() << endl;
// system("pause");
return 0;
}
How I compile the code
clang++ -std=c++11 -Xpreprocessor -fopenmp parallel.cpp -O3 -o parallel -lomp
Update
Hi guys, the previous problem has solved, I think I should not use NUM_THREAD.
But when I use OpenMP to accelerate my neural network, it takes longer time.
Data size
MNIST dataset, 60000 each epoch
Time Function
omp_get_wtime()
Single thread result
***** train epoch 1.
Batch count: 6000.
batch size: 10.
Progress: 5999/6000.
train time is ... 64.7082.
Accuracy: 97.72% 9772/10000.
predict time is ... 3.51836.
Releasing Data Samples...
Releasing Neural Network...
Result with OpenMP
***** train epoch 1.
Batch count: 6000.
batch size: 10.
Progress: 5999/6000.
train time is: 247.615.
Accuracy: 97.72% 9772/10000.
predict time is: 30.739.
Code using parallel for
#pragma omp parallel for
for (int k = 0; k < size; k++) {
layer->map[i].data[k] = activation_func::tan_h(layer->map_common[k] + layer->map[i].b);
// cout << "current thread: " << omp_get_thread_num() << endl;
}
Code using parallel for and omp critical
for (int k = 0; k < layer->map_count; k++) {
for (int i = 0; i < map_h; i++) {
for (int j = 0; j < map_w; j++) {
double max_value = prev_layer->map[k].data[2*i*upmap_w + 2*j];
for (int n = 2*i; n < 2*(i + 1); n++) {
#pragma omp parallel for
for (int m = 2*j; m < 2*(j + 1); m++) {
#pragma omp critical
max_value = MAX(max_value, prev_layer->map[k].data[n*upmap_w + m]);
}
}
layer->map[k].data[i*map_w + j] = activation_func::tan_h(max_value);
}
}
}

I am trying to parallel my C++ Neural Network Training Process using
OpenMP. But it won't work. And then I used a simple C++ code with
nested loops to test the OpenMP.
I see this quite often; Introducing OpenMP in a code, or parallelism for that matter, will not magically make your code faster.
Why ?? because of a lot of factors but (in your context) because the work done in parallel should be big enough to overcome the overhead of the parallelism (e.g., thread creation, synchronization and so). To do that you need to increase the size/number of the parallel tasks.
Another issue is with the way you are benchmarking the code:
Your parallel task:
void test() {
int j = 0;
for (int i = 0; i < 100000; i++) {
// do something to kill time...
j++; <---- Not enough work done in parallel
}
};
In the sequential version the compiler can easily deduct that j = 100000 - 1;. Moreover, because you are not doing anything with that value (i.e., j) the compiler can actually optimized the entire call to the test() function away. Hence, as pointed out in the comments:
Your test loop doesn't really do anything, so the compiler might be
removing it. Then the time you get would be mostly the time spent
creating threads. – 1201ProgramAlarm
and
The test function should return the value and your code should print
it somewhere. AS #1201ProgramAlarm has said, the compiler might detect
that you're just wasting compute time and remove the loop. – Michael
Klemm
Furthermore, instead of having the following block of code:
// 2 threads
startTime = chrono::high_resolution_clock::now();
#pragma omp parallel for num_threads(2)
for (int i = 0; i < 100000; i++) {
test();
}
endTime = chrono::high_resolution_clock::now();
diff = endTime - startTime;
cout << "2 threads cost time: " << diff.count() << endl;
replicated a bunch of times, it would have been better to have it a single time and change the number of threads using the environment variable OMP_NUM_THREADS from the outside.
Regarding your update:
for (int k = 0; k < layer->map_count; k++) {
for (int i = 0; i < map_h; i++) {
for (int j = 0; j < map_w; j++) {
double max_value = prev_layer->map[k].data[2*i*upmap_w + 2*j];
for (int n = 2*i; n < 2*(i + 1); n++) {
#pragma omp parallel for
for (int m = 2*j; m < 2*(j + 1); m++) {
#pragma omp critical
max_value = MAX(max_value, prev_layer->map[k].data[n*upmap_w + m]);
}
}
layer->map[k].data[i*map_w + j] = activation_func::tan_h(max_value);
}
}
}
that critical section is basically making the code sequential. Actually even worse than sequential because there is the additional overhead of the locking mechanism.
Instead of #pragma omp critical you should use the OpenMP reduce, which is exactly meant for this kind of situations. Moreover, you can try to parallelize the for (int n = 2*i; n < 2*(i + 1); n++) instead:
for (int k = 0; k < layer->map_count; k++) {
for (int i = 0; i < map_h; i++) {
for (int j = 0; j < map_w; j++) {
double max_value = prev_layer->map[k].data[2*i*upmap_w + 2*j];
#pragma omp parallel for reduction(max: max_value)
for (int n = 2*i; n < 2*(i + 1); n++) {
for (int m = 2*j; m < 2*(j + 1); m++) {
max_value = MAX(max_value, prev_layer->map[k].data[n*upmap_w + m]);
}
}
layer->map[k].data[i*map_w + j] = activation_func::tan_h(max_value);
}
}
}
A side note, personally, and don't take it in the wrong way, but I think you should spend more time learning first the basics of multithreading and OpenMP before trying to blindly parallelize code.
Please, don't keep adding updates to the original question, with newer question. Just create a new question instead.

Related

Parallel execution taking more time than serial

I am basically writing code to count if a pair sum is even(among all pairs from 1 to 100000). I wrote a code using pthreads and without pthreads. But the code with pthreads is taking more time than the serial one. Here is my serial code
#include<bits/stdc++.h>
using namespace std;
int main()
{
long long sum = 0, count = 0, n = 100000;
auto start = chrono::high_resolution_clock::now();
for(int i = 1; i <= n; i++)
for(int j = i-1; j >= 0; j--)
{
sum = i + j;
if(sum%2 == 0)
count++;
}
cout<<"count is "<<count<<endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
and here is my parallel code
#include<bits/stdc++.h>
using namespace std;
#define MAX_THREAD 3
long long cnt[5] = {0};
long long n = 100000;
int work_per_thread;
int start[] = {1, 60001, 83001, 100001};
void *count_array(void* arg)
{
int t = *((int*)arg);
long long sum = 0;
for(int i = start[t]; i < start[t+1]; i++)
for(int j = i-1; j >=0; j--)
{
sum = i + j;
if(sum%2 == 0)
cnt[t]++;
}
cout<<"thread"<<t<<" finished work "<<cnt[t]<<endl;
return NULL;
}
int main()
{
pthread_t threads[MAX_THREAD];
int arr[] = {0,1,2};
long long total_count = 0;
work_per_thread = n/MAX_THREAD;
auto start = chrono::high_resolution_clock::now();
for(int i = 0; i < MAX_THREAD; i++)
pthread_create(&threads[i], NULL, count_array, &arr[i]);
for(int i = 0; i < MAX_THREAD; i++)
pthread_join(threads[i], NULL);
for(int i = 0; i < MAX_THREAD; i++)
total_count += cnt[i];
cout << "count is " << total_count << endl;
auto end = chrono::high_resolution_clock::now();
double time_taken = chrono::duration_cast<chrono::nanoseconds>(end - start).count();
time_taken *= 1e-9;
cout << "Time taken by program is : " << fixed << time_taken << setprecision(9)<<" secs"<<endl;
return 0;
}
In the parallel code I am creating three threads and 1st thread will be doing its computation from 1 to 60000, 2nd thread from 60001 to 83000 and so on. I have chosen these numbers so that each thread gets to do approximately similar number of computations. The parallel execution takes 10.3 secs whereas serial one takes 7.7 secs. I have 6 cores and 2 threads per core. I also used htop command to check if the required number of threads are running or not and it seems to be working fine. I don't understand where the problem is.

The all cores in the threaded version compete for cnt[].
Use a local counter inside the loop and copy the result into cnt[t] after the loop is ready.

Matrix multiplication with vectorization in c++ executes for a long time

I have several matrices that I want to multiply in c++ with allowing vectorization. However the following code results in a large execution time ~858146125 ns. How do I modify the code so I have vectorization of the matrix multiplication and reach around 100ns of execution time.
I am using the flag O3.
const int ROWS = 1000;
const int COLS = 1000;
const int ROWS1 = 1000;
const int COLS1 = 1000;
const int l = 1000;
double random_matrix[ROWS][COLS];
double random_matrix1[ROWS1][COLS1];
double mult[l][l];
int i;
int j;
/* generate number: */
for (i = 0; i < ROWS; i++) {
for (j = 0; j < COLS; j++)
random_matrix[i][j] = i + j;
}
for (i = 0; i < ROWS1; i++) {
for (j = 0; j < COLS1; j++)
random_matrix1[i][j] = i + j;
}
auto start = std::chrono::steady_clock::now();
for (size_t row = 0; row < ROWS; ++row) {
for (size_t tmp = 0; tmp < COLS1; ++tmp) {
mult[row][tmp] = random_matrix[row][0]*random_matrix1[0][tmp];
for (size_t col = 1; col < COLS; ++col) {
mult[row][tmp] += random_matrix[row][col] * random_matrix1[col][tmp];
}
}
}
auto end = std::chrono::steady_clock::now();
std::cout << "Elapsed time in nanoseconds : "
<< std::chrono::duration_cast<std::chrono::nanoseconds>(end - start).count()
<< " ns" << std::endl;
std::cout<<"\n";
for (i=0;i<ROWS;i++)
{
for (j=0;j<COLS1;j++)
std::cout << mult[i][j] <<std::endl; //display table
std::cout<<"\n";
}

I'm afraid you'll never get to 100 ns total execution time with these matrix sizes, with vectorization or without. Matrix multiplication of two matrices 1000 x 1000 elements takes on the order of 1000 ^ 3 = 1,000,000,000 multiply-adds. That is one billion operations.
Secondly, if performance matters so much to you, you should NOT write your own code for these low-level mathematical primitives. There are optimized C++ libraries that will perform these operations for you, such as Eigen or BLAS (Intel MKL is a package that implements BLAS).
By using one of these packages, you not only get much better performance, but also avoid the potential pitfalls or bugs that you would likely have otherwise.

OpenMP implementation increasingly slow with thread count increase

I have been trying to learn to use OpenMP. However my code seemed to be running more quickly in series that parallel.
Indeed the more threads used, the slower the computation time.
To illustrate this I ran an experiment. I am trying to do the following operation:
long int C[num], D[num];
for (i=0; i<num; i++) C[i] = i;
for (i=0; i<num; i++){
for (j=0; j<N; j++) {
D[i] = pm(C[i]);
}
}
where the function pm is simply
int pm(int val) {
val++;
val--;
return val;
}
I implemented the inner loop in parallel and compared the run times as a function of the number of iterations on the inner loop (N) and the number of threads used. The code for the experiment is below.
#include <stdio.h>
#include <iostream>
#include <time.h>
#include "omp.h"
#include <fstream>
#include <cstdlib>
#include <cmath>
static long num = 1000;
using namespace std;
int pm(int val) {
val++;
val--;
return val;
}
int main() {
int i, j, k, l;
int iter = 8;
int iterT = 4;
long inum[iter];
for (i=0; i<iter; i++) inum[i] = pow(10, i);
double serial[iter][iterT], parallel[iter][iterT];
ofstream outdata;
outdata.open("output.dat");
if (!outdata) {
std::cerr << "Could not open file." << std::endl;
exit(1);
}
"""Experiment Start"""
for (l=1; l<iterT+1; l++) {
for (k=0; k<iter; k++) {
clock_t start = clock();
long int A[num], B[num];
omp_set_num_threads(l);
for (i=0; i<num; i++) A[i] = i;
for (i=0; i<num; i++){
#pragma omp parallel for schedule(static)
for (j=0; j<inum[k]; j++) {
B[i] = pm(A[i]);
}
}
clock_t finish = clock();
parallel[k][l-1] = (double) (finish - start) /\
CLOCKS_PER_SEC * 1000.0;
start = clock();
long int C[num], D[num];
for (i=0; i<num; i++) C[i] = i;
for (i=0; i<num; i++){
for (j=0; j<inum[k]; j++) {
D[i] = pm(C[i]);
}
}
finish = clock();
serial[k][l-1] = (double) (finish - start) /\
CLOCKS_PER_SEC * 1000.0;
}
}
"""Experiment End"""
for (j=0; j<iterT; j++) {
for (i=0; i<iter; i++) {
outdata << inum[i] << " " << j + 1 << " " << serial[i][j]\
<< " " << parallel[i][j]<< std::endl;
}
}
outdata.close();
return 0;
}
The link below is a plot of log(T) against log(N) for each thread count.
A plot of the run times for varying number of threads and magnitude of computational task.
(I just noticed that the legend labels for serial and parallel are the wrong way around).
As you can see using more than one thread increases the time greatly. Adding more threads increases the time taken linearly as a function of number of threads.
Can anyone tell me whats going on?
Thanks!

Freakish above was correct about the pm() function doing nothing, and the compiler was getting confused.
It also turns out that the rand() function does not play well withing OpenMP for loops.
Adding the function sqrt(i) (i being the loop index) I achieved the expected speedup to my code.

Speed of Comparison operators in C++

Today I came into a problem:
I have to read data from a file, the file contains a lot of test cases, it looks like
N
N lines followed..
...
...
So I used while(scanf("%d", &n) && n!=-1), but it took me more than 5s to read all data. However, when I changed it into while(scanf("%d", &n) && n>-1), it just took me 800ms to read alll data. So I suppose that there is difference between speed of comparison operators in C++, and can anyone give me the order?
PS: my compiler is GCC 5.1.0
OK, let me show more details of this problem.
The problem is here: http://acm.hdu.edu.cn/showproblem.php?pid=1171
Code with not equal is here:https://github.com/kimixuchen/codesnap/blob/master/greater
Code with gerater is here:
https://github.com/kimixuchen/codesnap/blob/master/not_equal

The question is about comparison, not reading files or badly formulated conditions. So lets test comparison only. Update: tested with /O2 optimization option.
#include <ctime>
#include <cstdlib>
#include <iostream>
int main()
{
const int testCases = 10000000;
const int iterations = 100;
srand(time(NULL));
int * A = new int[testCases];
bool *B = new bool[testCases];
freopen("output.txt", "w", stdout);
for (int i = 0; i < testCases; i++)
{
A[i] = rand() % 100;
}
clock_t begin = clock();
for (int j = 0; j < iterations; j++)
for (int i = 0; i < testCases; i++)
{
B[i] = A[i] != -1;
}
clock_t end = clock();
double elapsed_secs = end - begin;
std::cout << "Elapsed time using != - " << elapsed_secs << std::endl;
//Getting new random numbers for clean test
for (int i = 0; i < testCases; i++)
{
A[i] = rand() % 100;
}
begin = clock();
for (int j = 0; j < iterations; j++)
for (int i = 0; i < testCases; i++)
{
B[i] = A[i] > -1;
}
end = clock();
elapsed_secs = end - begin;
std::cout << "Elapsed time using > - " << elapsed_secs << std::endl;
return 0;
}
Results for 5 tests (in ticks):
'!=': 1005 994 1015 1009 1019
'>': 1006 1004 1004 1005 1035
Conclusion - there is no significant difference in optimized for speed program.

Generating matrix with omp causes trouble, different columsizes

i've got a problem and a question.
I tried to make some matrix multiplication with omp.
If i create the matrices a, b, and c with more than one thread, the colum sizes aren't equal.
The problem remains even if i use critical for push_back.
I thought omp divide the for loop in equal sized pieces, so every thread should have his own column. Is the problem in ?
What is a good way to give every thread a vector?
And what is a good way to avoid shared memory problems without critical and atomic, e.g. if i'm generating data and want to save it somewhere.
Thanks.
P.S. I am working on my english. It 's far away from perfect, so please don't mind.
#include "stdafx.h"
#include <omp.h>
#include <iostream>
#include <ctime>
#include <vector>
#define NRA 300 /* number of rows in matrix A */
#define NCA 300 /* number of columns in matrix A */
#define NCB 300 /* number of columns in matrix B */
int main(int argc, char *argv[])
{
int i, j, k, chunk;
std::vector < std::vector<int> > a;
a.resize(NRA);
std::vector < std::vector<int> > b;
b.resize(NCA);
std::vector < std::vector<int> > c;
c.resize(NRA);
/*
double a[NRA][NCA];
double b[NCA][NCB];
double c[NRA][NCB];
*/
chunk = 10;
std::clock_t start; //Zeitmessung
double duration; //Zeitdauer der Parallelisierung
omp_set_num_threads(4);
#pragma omp parallel
{
#pragma omp for schedule (static, chunk)
for (i = 0; i < NRA; i++)
for (j = 0; j < NCA; j++)
a[i].push_back(i + j);
#pragma omp for schedule (static, chunk)
for (i = 0; i < NCA; i++)
for (j = 0; j < NCB; j++)
b[i].push_back(i*j);
#pragma omp for ordered schedule(static, chunk)
for (i = 0; i < NRA; i++)
for (j = 0; j < NCB; j++)
c[i].push_back(0);
}
for (int nthreads = 1; nthreads < 40; nthreads++)
{
start = std::clock();
omp_set_dynamic(0);
#pragma omp parallel shared(a,b,c,nthreads,chunk) private(i,j,k) num_threads(nthreads)
{
#pragma omp for schedule (static, chunk)
for ( i = 0; i < NRA; i++)
for (j = 0; j < NCB; j++)
c[i][j] = 0;
#pragma omp for ordered schedule (static, chunk)
for (i = 0; i < NRA; i++)
{
for ( j = 0; j < NCB; j++)
for (k = 0; k < NCA; k++)
c[i][j] += a[i][k] * b[k][j];
}
}
duration = (std::clock() - start) / (double)CLOCKS_PER_SEC;
//Time n threads need
std::cout << "Benoetigte Zeit fuer " << nthreads << " Threads betrug " << duration << " Sekunden." << std::endl;
}
std::cin.get();
}

push_back() definitely modifies vector’s metadata, especially size. Try to resize() the inner vectors like you do with the outer ones (a, b, c) and then just modify the elements (a[i] = i + j; etc.) in the parallel run.
Since you know the final count of elements in the beginning, you can use plain arrays instead of vectors to minimize overhead.
int a[NRA][NCA];
int b[NCA][NCB];
int c[NRA][NCB];
I wonder why you’ve commented out the similar part of your code. ;-)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Using OpenMP multithread is much slower than single thread - c++

Related

Parallel execution taking more time than serial

Matrix multiplication with vectorization in c++ executes for a long time

OpenMP implementation increasingly slow with thread count increase

Speed of Comparison operators in C++

Generating matrix with omp causes trouble, different columsizes

Categories

Resources