Why is serial execution taking less time than parallel? [duplicate] - c++

This question already has answers here:
OpenMP time and clock() give two different results
(3 answers)
Closed 3 years ago.
I have to add two vectors and compare serial performance against parallel performance.
However, my parallel code seems to take longer to execute than the serial code.
Could you please suggest changes to make the parallel code faster?
#include <iostream>
#include <time.h>
#include "omp.h"
#define ull unsigned long long
using namespace std;
void parallelAddition (ull N, const double *A, const double *B, double *C)
{
ull i;
#pragma omp parallel for shared (A,B,C,N) private(i) schedule(static)
for (i = 0; i < N; ++i)
{
C[i] = A[i] + B[i];
}
}
int main(){
ull n = 100000000;
double* A = new double[n];
double* B = new double[n];
double* C = new double[n];
double time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
A[i] = 1;
B[i] = 1;
}
//PARALLEL
clock_t begin = clock();
parallelAddition(n, &A[0], &B[0], &C[0]);
clock_t end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in parallel : "<<time_spent<<endl;
//SERIAL
time_spent = 0.0;
for(ull i = 0; i<n; i++)
{
A[i] = 1;
B[i] = 1;
}
begin = clock();
for (ull i = 0; i < n; ++i)
{
C[i] = A[i] + B[i];
}
end = clock();
time_spent += (double)(end - begin) / CLOCKS_PER_SEC;
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}
These are results:
time elapsed in parallel : 0.824808
time elapsed in serial : 0.351246
I've read on another thread that there are factors like spawning of threads, allocation of resources. But I don't know what to do to get the expected result.
EDIT:
Thanks! #zulan and #Daniel Langr 's answers actually helped!
I used omp_get_wtime() instead of clock().
It happens to be that clock() measures cumulative time of all threads as against omp_get_wtime() which can be used to measure the time elasped from an arbitrary point to some other arbitrary point
This answer too answers this query pretty well: https://stackoverflow.com/a/10874371/4305675
Here's the fixed code:
void parallelAddition (ull N, const double *A, const double *B, double *C)
{
....
}
int main(){
....
//PARALLEL
double begin = omp_get_wtime();
parallelAddition(n, &A[0], &B[0], &C[0]);
double end = omp_get_wtime();
time_spent += (double)(end - begin);
cout<<"time elapsed in parallel : "<<time_spent<<endl;
....
//SERIAL
begin = omp_get_wtime();
for (ull i = 0; i < n; ++i)
{
C[i] = A[i] + B[i];
}
end = omp_get_wtime();
time_spent += (double)(end - begin);
cout<<"time elapsed in serial : "<<time_spent;
return 0;
}
RESULT AFTER CHANGES:
time elapsed in parallel : 0.204763
time elapsed in serial : 0.351711

There are multiple factors that influence your measurements:
Use omp_get_wtime() as #zulan suggested, otherwise, you may actually calculate combined CPU time, instead of wall time.
Threading has some overhead and typically does not pay off for short calculations. You may want to use higher n.
"Touch" data in C array before running parallelAddition. Otherwise, the memory pages are actually allocated from OS inside parallelAddition. Easy fix since C++11: double* C = new double[n]{};.
I tried your program for n being 1G and the last change reduced runtime of parallelAddition from 1.54 to 0.94 [s] for 2 threads. Serial version took 1.83 [s], therefore, the speedup with 2 threads was 1.95, which was pretty close to ideal.
Other considerations:
Generally, if you profile something, make sure that the program has some observable effect. Otherwise, a compiler may optimize a lot of code away. Your array addition has no observable effect.
Add some form of restrict keyword to the C parameter. Without it, a compiler might not be able to apply vectorization.
If you are on a multi-socket system, take care about affinity of threads and NUMA effects. On my dual-socket system, runtime of a parallel version for 2 threads took 0.94 [s] (as mentioned above) when restricting threads to a single NUMA node (numactl -N 0 -m 0). Without numactl, it took 1.35 [s], thus 1.44 times more.

Related

How to add two arrays using openMP?

I am trying to parallelize this: c[i]=a[i]+b[i]
By using C program I am getting:
Elapsed time = 1667417 nanoseconds
with OpenMP I get:
Elapsed time = 8673966 nanoseconds
I don't clearly understand why this is happening and what needs to be done to parallelize this code. I am assuming that it is very simple addition so probably parallelism is not getting exploited here but would like to know the correct reason and any other way by which I could effectively parallelize this addition. I also tried using dynamic, guided and various chunksizes but it gives more or less similar results.
#define N 100
int main (int argc, char *argv[])
{
int i;
float a[N], b[N], c[N];
uint64_t diff; /* Elapsed time */
struct timespec start, end;
/* Some initializations */
#pragma omp parallel for schedule(static,10) num_threads(4)
for (i=0; i < N; i++){
a[i] = b[i] = i * 1.0;
}
/*add two arrays*/
clock_gettime(CLOCK_MONOTONIC, &start); /* mark start time */
#pragma omp parallel for schedule(static) num_threads(4)
for (i=0; i<N; i++){
c[i] = a[i] + b[i];
printf("Thread number:%d,c[%d]= %f\n", omp_get_thread_num(),i,c[i]);
}
clock_gettime(CLOCK_MONOTONIC, &end); /* mark the end time */
diff = BILLION * (end.tv_sec - start.tv_sec) + end.tv_nsec - start.tv_nsec;
printf("\nElapsed time = %llu nanoseconds\n", (long long unsigned int) diff);
}

c++ thread creation big overhead

I have the following code, which confuses me a lot:
float OverlapRate(cv::Mat& model, cv::Mat& img) {
if ((model.rows!=img.rows)||(model.cols!=img.cols)) {
return 0;
}
cv::Mat bgr[3];
cv::split(img, bgr);
int counter = 0;
float b_average = 0, g_average = 0, r_average = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_average += bgr[0].at<uchar>(i, j);
g_average += bgr[1].at<uchar>(i, j);
r_average += bgr[2].at<uchar>(i, j);
}
}
}
b_average = b_average / counter;
g_average = g_average / counter;
r_average = r_average / counter;
counter = 0;
float b_stde = 0, g_stde = 0, r_stde = 0;
for (int i = 0; i < model.rows; i++) {
for (int j = 0; j < model.cols; j++) {
if((model.at<uchar>(i,j)==255)){
counter++;
b_stde += std::pow((bgr[0].at<uchar>(i, j) - b_average), 2);
g_stde += std::pow((bgr[1].at<uchar>(i, j) - g_average), 2);
r_stde += std::pow((bgr[2].at<uchar>(i, j) - r_average), 2);
}
}
}
b_stde = std::sqrt(b_stde / counter);
g_stde = std::sqrt(g_stde / counter);
r_stde = std::sqrt(r_stde / counter);
return (b_stde + g_stde + r_stde) / 3;
}
void work(cv::Mat& model, cv::Mat& img, int index, std::map<int, float>& results){
results[index] = OverlapRate(model, img);
}
int OCR(cv::Mat& a, std::map<int,cv::Mat>& b, const std::vector<int>& possible_values)
{
int recog_value = -1;
clock_t start = clock();
std::thread threads[10];
std::map<int, float> results;
for(int i=0; i<10; i++)
{
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
}
for(int i=0; i<10; i++)
threads[i].join();
float min_score = 1000;
int min_index = -1;
for(auto& it:results)
{
if (it.second < min_score) {
min_score = it.second;
min_index = it.first;
}
}
clock_t end = clock();
clock_t t = end - start;
printf ("It took me %d clicks (%f seconds) .\n",t,((float)t)/CLOCKS_PER_SEC);
recog_value = min_index;
}
What the above code does is just simple optical character recognition. I have one optical character as an input and compare it with 0 - 9 ten standard character models to get the most similar one, and then output the recognized value.
When I execute the above code without using ten threads running at the same time, the time is 7ms. BUT, when I use ten threads, it drops down to 1 or 2 seconds for a single optical character recognition.
What is the reason?? The debug information tells that thread creation consumes a lot of time, which is this code:
threads[i] = std::thread(work, std::ref(b[i]), std::ref(a), i, std::ref(results));
Why? Thanks.
Running multiple threads is useful in only 2 contexts: you have multiple hardware cores (so the threads can run simultaneously) OR each thread is waiting for IO (so one thread can run while another thread is waiting for IO, like a disk load or network transfer).
Your code is not IO bound, so I hope you have 10 cores to run your code. If you don't have 10 cores, then each thread will be competing for scarce resources, and the scarcest resource of all is L1 cache space. If all 10 threads are fighting for 1 or 2 cores and their cache space, then the caches will be "thrashing" and give you 10-100x slower performance.
Try running benchmarking your code 10 different times, with N=1 to 10 threads and see how it performs.
(There is one more reason the have multiple threads, which is when the cores support hyper threading. The OS will"pretend" that 1 core has 2 virtual processors, but with this you don't get 2x performance. You get something between 1x and 2x. But in order to get this partial boost, you have to run 2 threads per core)
Not always is efficient to use threads. If you use threads on small problem, then managing threads cost more time and resources then solving the problem. You must have enough work for threads and good managing work over threads.
If you want to know how many threads you can use on problem or how big must be problem, find Isoeffective functions (psi1, psi2, psi3) from theory of parallel computers.

openMp optimisation of dynamic array access

I am trying to measure the speedup in parallel section using one or four threads. As my parallel section is relatively simple, I expect a near-to-fourfold speedup. ( This is following my question:
openMp: severe perfomance loss when calling shared references of dynamic arrays )
As my parallel sections runs twice as fast on four cores compared to only one, I believe I have still not found the reason for the performance loss.
I want to parallelise my function iter as well as possible. The function is using entries of dynamic arrays and private quantities to change the entries of other dynamic arrays. Because every iteration step only uses the array entries of the respective loop step, I don't have different threads accessing the same array entry. Furthermore, I put some thought on false sharing, due to accessing entries in the same cache line. My guess is, that this is a minor effect, as my double-arrays are 5*10^5 long and by choosing a reasonable chunk size for the schedule(dynamic,chunk) command, I don't expect the very few entires in a given cache line to be accessed at the same time by different threads. In my simulation, I have about 80 of such arrays, so that allocating them on the stack is not comfortable and making private copies for every thread is out of question too.
Does anybody have an idea, how to improve this? I want to fully understand why this is so slow, before starting with compiler optimisations.
What also surprised me was: calling iter(parallel), with parallel = false, is slower than calling it with parallel = true and omp_set_num_threads(1).
main.cpp:
int main(){
mathClass m;
m.fillArrays();
double timeCount = 0.0;
for(int j = 0; j<1000; j++){
timeCount += m.iter(true);
}
printf("meam time difference = %fms\n",timeCount);
return 0;
}
mathClass.h:
class mathClass{
private:
double* A;
double* B;
double* C;
int length;
public:
double* D;
mathClass();
double iter(bool parallel);
void fillArrays();
};
mathClass.cpp:
mathClass::mathClass(){
length = 5000000;
A = new double[length];
B = new double[length];
C = new double[length];
D = new double[length];
}
void mathClass::fillArrays(){
int temp;
for ( int i=0; i<length; i++){
temp = rand() % 100;
A[i] = double(temp);
temp = rand() % 100;
B[i] = double(temp);
temp = rand() % 100;
C[i] = double(temp);
}
}
double mathClass::iter(bool parallel){
double startTime;
double endTime;
omp_set_num_threads(4);
startTime = omp_get_wtime();
#pragma omp parallel if(parallel)
{
int alpha; // private in all threads
#pragma omp for schedule(static)
for (int i=0; i<length; i++){
alpha = 15*A[i];
D[i] = C[i]*alpha + B[i]*alpha*alpha;
}
}
endTime = omp_get_wtime();
return endTime - startTime;
}

How to calculate GFLOPs for a funtion in c++ program?

I have a c++ code which calculates factorial of int data type, addition of float data type and execution time of each function as follows:
long Sample_C:: factorial(int n)
{
int counter;
long fact = 1;
for (int counter = 1; counter <= n; counter++)
{
fact = fact * counter;
}
Sleep(100);
return fact;
}
float Sample_C::add(float a, float b)
{
return a+b;
}
int main(){
Sample_C object;
clock_t start = clock();
object.factorial(6);
clock_t end = clock();
double time =(double)(end - start);// finding execution time of factorial()
cout<< time;
clock_t starts = clock();
object.add(1.1,5.5);
clock_t ends = clock();
double total_time = (double)(ends -starts);// finding execution time of add()
cout<< total_time;
return 0;
}
Now , i want to have the mesure GFLOPs for "add " function. So, kindly suggest how will i caculate it. As, i am completly new to GFLOPs so kindly tell me wether we can have GFLOPs calculated for functions having only foat data types? and also GFLOPs value vary with different functions?
If I was interested in estimating the execution time of the addition operation I might start with the following program. However, I would still only trust the number this program produced to within a factor of 10 to 100 at best (i.e. I don't really trust the output of this program).
#include <iostream>
#include <ctime>
int main (int argc, char** argv)
{
// Declare these as volatile so the compiler (hopefully) doesn't
// optimise them away.
volatile float a = 1.0;
volatile float b = 2.0;
volatile float c;
// Preform the calculation multiple times to account for a clock()
// implementation that doesn't have a sufficient timing resolution to
// measure the execution time of a single addition.
const int iter = 1000;
// Estimate the execution time of adding a and b and storing the
// result in the variable c.
// Depending on the compiler we might need to count this as 2 additions
// if we count the loop variable.
clock_t start = clock();
for (unsigned i = 0; i < iter; ++i)
{
c = a + b;
}
clock_t end = clock();
// Write the time for the user
std::cout << (end - start) / ((double) CLOCKS_PER_SEC * iter)
<< " seconds" << std::endl;
return 0;
}
If you knew how your particular architecture was executing this code you could then try and estimate FLOPS from the execution time but the estimate for FLOPS (on this type of operation) probably wouldn't be very accurate.
An improvement to this program might be to replace the for loop with a macro implementation or ensure your compiler expands for loops inline. Otherwise you may also be including the addition operation for the loop index in your measurement.
I think it is likely that the error wouldn't scale linearly with problem size. For example if the operation you were trying to time took 1e9 to 1e15 times longer you might be able to get a decent estimate for GFLOPS. But, unless you know exactly what your compiler and architecture are doing with your code I wouldn't feel confident trying to estimate GFLOPS in a high level language like C++, perhaps assembly might work better (just a hunch).
I'm not saying it can't be done, but for accurate estimates there are a lot of things you might need to consider.

C++ clock stays zero

Im trying get the elapsed time of my program. Actually i thought I should use yclock() from time.h. But it stays zero in all phases of the program although I'm adding 10^5 numbers(there must be some CPU time consumed). I already searched this problem and it seems like, people running Linux are having this issue only. I'm running Ubuntu 12.04LTS.
I'm going to compare AVX and SSE instructions, so using time_t is not really an option. Any hints?
Here is the code:
//Dimension of Arrays
unsigned int N = 100000;
//Fill two arrays with random numbers
unsigned int a[N];
clock_t start_of_programm = clock();
for(int i=0;i<N;i++){
a[i] = i;
}
clock_t after_init_of_a = clock();
unsigned int b[N];
for(int i=0;i<N;i++){
b[i] = i;
}
clock_t after_init_of_b = clock();
//Add the two arrays with Standard
unsigned int out[N];
for(int i = 0; i < N; ++i)
out[i] = a[i] + b[i];
clock_t after_add = clock();
cout << "start_of_programm " << start_of_programm << endl; // prints
cout << "after_init_of_a " << after_init_of_a << endl; // prints
cout << "after_init_of_b " << after_init_of_b << endl; // prints
cout << "after_add " << after_add << endl; // prints
cout << endl << "CLOCKS_PER_SEC " << CLOCKS_PER_SEC << endl;
And the output of the console. I also used printf() with %d, with no difference.
start_of_programm 0
after_init_of_a 0
after_init_of_b 0
after_add 0
CLOCKS_PER_SEC 1000000
clock does indeed return the CPU time used, but the granularity is in the order of 10Hz. So if your code doesn't take more than 100ms, you will get zero. And unless it's significantly longer than 100ms, you won't get a very accurate value, because it your error margin will be around 100ms.
So, increasing N or using a different method to measure time would be your choices. std::chrono will most likely produce a more accurate timing (but it will measure "wall-time", not CPU-time).
timespec t1, t2;
clock_gettime(CLOCK_REALTIME, &t1);
... do stuff ...
clock_gettime(CLOCK_REALTIME, &t2);
double t = timespec_diff(t2, t1);
double timespec_diff(timespec t2, timespec t1)
{
double d1 = t1.tv_sec + t1.tv_nsec / 1000000000.0;
double d2 = t2.tv_sec + t2.tv_nsec / 1000000000.0;
return d2 - d1;
}
The simplest way to get the time is to just use a stub function from OpenMP. This will work on MSVC, GCC, and ICC. With MSVC you don't even need to enable OpenMP. With ICC you can link just the stubs if you like -openmp-stubs. With GCC you have to use -fopenmp.
#include <omp.h>
double dtime;
dtime = omp_get_wtime();
foo();
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
First, compiler is very likely to optimize your code. Check your compiler's optimization option.
Since array including out[], a[], b[] are not used by the successive code, and no value from out[], a[], b[] would be output, the compiler is to optimize code block as follows like never execute at all:
for(int i=0;i<=N;i++){
a[i] = i;
}
for(int i=0;i<=N;i++){
b[i] = i;
}
for(int i = 0; i < N; ++i)
out[i] = a[i] + b[i];
Since clock() function returns CPU time, the above code consume almost no time after optimization.
And one more thing, set N a bigger value. 100000 is too small for a performance test, nowadays computer runs very fast with o(n) code at 100000 scale.
unsigned int N = 10000000;
Add this to the end of the code
int sum = 0;
for(int i = 0; i<N; i++)
sum += out[i];
cout << sum;
Then you will see the times.
Since you dont use a[], b[], out[] it ignores corresponding for loops. This is because of optimization of the compiler.
Also, to see the exact time it takes use debug mode instead of release, then you will be able to see the time it takes.