Related
I'm learning OpenMP in C++ using gcc 8.1.0 and MinGW64 (latest version as of this month), and I'm running into a weird debug error when my program encounters a segmentation fault.
I know the cause of the crash, attempting to create too many OpenMP threads (50,000), but it's the error itself that has me puzzled. I didn't compile gcc or MinGW64 from source, I just used the installers, and I'm on Windows.
Why is it looking for cygwin.s, and why use that file structure on Windows? My code and the error message from gdb are below the closing.
I'm learning OpenMP in the process of programming a path tracer, and I think I have a workaround for the thread limit (using while (threads < runs) and letting OpenMP set the thread count automatically), but I am stumped as to the error. Is there a workaround or solution for this?
It works fine with ~10,000 threads. I know it's not actually creating 10,000 threads simultaneously, but it's what I was doing before I thought of the workaround.
Thank you for the heads up about rand() and thread safety. I ended up replacing my RNG code with some that appears to be working fine in OpenMP, and it's literally a night and day difference visually. I will try the other changes and report back. Thanks!
WOW! It runs so much faster and the image is artifact-free! Thank you!
Jadan Bliss
Final code:
#pragma omp parellel
for (j = options.height - 1; j >= 0; j--){
for (i=0; i < options.width; i++) {
#pragma omp parallel for reduction(Vector3Add:col)
for (int s=0; s < options.samples; s++)
{
float u = (float(i) + scene_drand()) / float(options.width);
float v = (float(j) + scene_drand()) / float(options.height);
Ray r = cam.get_ray(u, v); // was: origin, lower_left_corner + u*horizontal + v*vertical);
col += color(r, world, 0);
}
col /= real(options.samples);
render.set(i,j, col);
col = Vector3(0.0);
}
}
Error:
Starting program:
C:\Users\Jadan\Documents\CBProjects\learnOMP\bin\Debug\learnOMP.exe
[New Thread 22136.0x6620] [New Thread 22136.0x80a8] [New Thread
22136.0x8008] [New Thread 22136.0x5428]
Thread 1 received signal SIGSEGV, Segmentation fault.
___chkstk_ms () at ../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S:126 126
../../../../../src/gcc-8.1.0/libgcc/config/i386/cygwin.S: No such file
or directory.
Here are some remarks on your code.
Using a huge number of thread will not bring you any gain and is the probable reason of your problems. Thread creation has a time and resource cost. Time cost makes that it will probably be the main time in your program and your parallel program will be by far longer than its sequential version. Concerning resource cost, each thread has its own stack segment. Its size is system dependent, but typical values are measured in MB. I do not know the characteristics of your system, but with 100000 threads, this is probably the reason why your code is crashing. I have no explaination for the message about about cygwin.s, but after a stack overflow, the behavior can be weird.
Threads are a mean to parralelize code, and, for data parallelism, it is most of the time useless to have more threads than the number of logical processors on your system. Let openmp set it, but you can experiment later to tune this number.
Besides that, there are other problems.
rand() is not thread safe as it uses a global state that will be modified concurrently by threads. rand_r() is, as the state of the random generator is not global and can be stored in every thread.
You should not modify a shared var like result without an atomic access as concurrent thread accesses can lead to unexpected results. While safe, using an atomic modification for every value is not a very efficient solution, though. Atomic accesses are very expensive and it is better to use a reduction that does local accumulation in every thread and a unique atomic access at the end.
#include <omp.h>
#include <iostream>
#include <random>
#include <time.h>
int main()
{
int runs = 100000;
double result = 0.0;
#pragma omp parallel
{
// per thread initialisation of rand_r seed.
unsigned int rand_state=omp_get_thread_num()*time(NULL);
// or whatever thread dependent seed
#pragma omp for reduction(+:result)
for(int i=0; i<runs; i++)
{
double d = double(rand_r(&rand_state))/double(RAND_MAX);
result += d;
}
}
result /= double(runs);
std::cout << "The computed average over " << runs << " runs was "
<< result << std::endl;
return 0;
}
Without using Open MP Directives - serial execution - check screenshot here
Using OpenMp Directives - parallel execution - check screenshot here
#include "stdafx.h"
#include <omp.h>
#include <iostream>
#include <time.h>
using namespace std;
static long num_steps = 100000;
double step;
double pi;
int main()
{
clock_t tStart = clock();
int i;
double x, sum = 0.0;
step = 1.0 / (double)num_steps;
#pragma omp parallel for shared(sum)
for (i = 0; i < num_steps; i++)
{
x = (i + 0.5)*step;
#pragma omp critical
{
sum += 4.0 / (1.0 + x * x);
}
}
pi = step * sum;
cout << pi <<"\n";
printf("Time taken: %.5fs\n", (double)(clock() - tStart) / CLOCKS_PER_SEC);
getchar();
return 0;
}
I have tried multiple times, the serial execution is always faster why?
Serial Execution Time: 0.0200s
Parallel Execution Time: 0.02500s
why is serial execution faster here? am I calculation the execution time in the right way?
OpenMP internally implement multithreading for parallel processing and multi threading's performance can be measured with large volume of data. With very small volume of data you cannot measure the performance of multithreaded application. The reasons:-
a) To create a thread O/S need to allocate memory to each thread which take time (even though it is tiny bit.)
b) When you create multi threads it needs context switching which also take time.
c) Need to release memory allocated to threads which also take time.
d) It depends on number of processors and total memory (RAM) in your machine
So when you try with small operation with multi threads it's performance will be as same as a single thread (O/S by default assign one thread to every process which is call main thread). So your outcome is perfect in this case. To measure the performance of multithread architecture use large amount of data with complex operation then only you can see the differences.
Because of your critical block you cannot sum sum in parallel. Everytime one thread reaches the critical section all other threads have to wait.
The smart approach would be to create a temporary copy of sum for each thread that can be summed without synchronization and afterwards to sum the results from the different threads.
Openmp can do this automatically for with the reduction clause. So your loop will be changed to.
#pragma omp parallel for reduction(+:sum)
for (i = 0; i < num_steps; i++)
{
x = (i + 0.5)*step;
sum += 4.0 / (1.0 + x * x);
}
On my machine this performs 10 times faster than the version using the critical block (I also increased num_steps to reduce the influence of one-time actions like thread-creation).
PS: I recommend you you to use <chrono>, <boost/timer/timer.hpp> or google benchmark for timing your code.
I have been studying C++ multithreads and get a question about it.
Here is what I am understanding about multithreads.
One of the reasons we use multithreads is to reduce the run time, right?
For example, I think if we use two threads we can expect half of the execution time.
So, I tried to code to prove it.
Here is the code.
#include <vector>
#include <iostream>
#include <thread>
#include <future>
using namespace std;
#define iterationNumber 1000000
void myFunction(const int index, const int numberInThread, promise<unsigned long>&& p, const vector<int>& numberList) {
clock_t begin,end;
int firstIndex = index * numberInThread;
int lastIndex = firstIndex + numberInThread;
vector<int>::const_iterator first = numberList.cbegin() + firstIndex;
vector<int>::const_iterator last = numberList.cbegin() + lastIndex;
vector<int> numbers(first,last);
unsigned long result = 0;
begin = clock();
for(int i = 0 ; i < numbers.size(); i++) {
result += numbers.at(i);
}
end = clock();
cout << "thread" << index << " took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
p.set_value(result);
}
int main(void)
{
vector<int> numberList;
vector<thread> t;
vector<future<unsigned long>> futures;
vector<unsigned long> result;
const int NumberOfThreads = thread::hardware_concurrency() ?: 2;
int numberInThread = iterationNumber / NumberOfThreads;
clock_t begin,end;
for(int i = 0 ; i < iterationNumber ; i++) {
int randomN = rand() % 10000 + 1;
numberList.push_back(randomN);
}
for(int j = 0 ; j < NumberOfThreads; j++){
promise<unsigned long> promises;
futures.push_back(promises.get_future());
t.push_back(thread(myFunction, j, numberInThread, std::move(promises), numberList));
}
for_each(t.begin(), t.end(), std::mem_fn(&std::thread::join));
for (int i = 0; i < futures.size(); i++) {
result.push_back(futures.at(i).get());
}
unsigned long RRR = 0;
begin = clock();
for(int i = 0 ; i < numberList.size(); i++) {
RRR += numberList.at(i);
}
end = clock();
cout << "not by thread took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
}
Because the hardware concurrency of my laptop is 4, it will create 4 threads and each takes a quarter of numberList and sum up the numbers.
However, the result was different than I expected.
thread0 took 0.007232
thread1 took 0.007402
thread2 took 0.010035
thread3 took 0.011759
not by thread took 0.009654
Why? Why it took more time than serial version(not by thread).
For example, I think if we use two threads we can expect half of the
execution time.
You'd think so, but sadly, that is often not the case in practice. The ideal "N cores means 1/Nth the execution time" scenario occurs only when the N cores can execute completely in parallel, without any core's actions interfering with the performance of the other cores.
But what your threads are doing is just summing up different sub-sections of an array... surely that can benefit from being executed in parallel? The answer is that in principle it can, but on a modern CPU, simple addition is so blindingly fast that it isn't really a factor in how long it takes a loop to complete. What really does limit the execute speed of a loop is access to RAM. Compared to the speed of the CPU, RAM access is very slow -- and on most desktop computers, each CPU has only one connection to RAM, regardless of how many cores it has. That means that what you are really measuring in your program is the speed at which a big array of integers can be read in from RAM to the CPU, and that speed is roughly the same -- equal to the CPU's memory-bus bandwidth -- regardless of whether it's one core doing the reading-in of the memory, or four.
To demonstrate how much RAM access is a factor, below is a modified/simplified version of your test program. In this version of the program, I've removed the big vectors, and instead the computation is just a series of calls to the (relatively expensive) sin() function. Note that in this version, the loop is only accessing a few memory locations, rather than thousands, and thus a core that is running the computation loop will not have to periodically wait for more data to be copied in from RAM to its local cache:
#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <math.h>
using namespace std;
static int iterationNumber = 1000000;
unsigned long long threadElapsedTimeMicros[10];
unsigned long threadResults[10];
void myFunction(const int index, const int numberInThread)
{
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i=0; i<numberInThread; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
threadResults[index] = result;
threadElapsedTimeMicros[index] = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();
// We'll print out the value of threadElapsedTimeMicros[index] later on,
// after all the threads have been join()'d.
// If we printed it out now it might affect the timing of the other threads
// that may still be executing
}
int main(void)
{
vector<thread> t;
const int NumberOfThreads = thread::hardware_concurrency();
const int numberInThread = iterationNumber / NumberOfThreads;
// Multithreaded approach
std::chrono::steady_clock::time_point allBegin = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) t.push_back(thread(myFunction, j, numberInThread));
for(int j = 0 ; j < NumberOfThreads; j++) t[j].join();
std::chrono::steady_clock::time_point allEnd = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) cout << " The computations in thread #" << j << ": result=" << threadResults[j] << ", took " << threadElapsedTimeMicros[j] << " microseconds" << std::endl;
cout << " Total time spent doing multithreaded computations was " << std::chrono::duration_cast<std::chrono::microseconds>(allEnd - allBegin).count() << " microseconds in total" << std::endl;
// And now, the single-threaded approach, for comparison
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i = 0 ; i < iterationNumber; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
cout << "result=" << result << ", single-threaded computation took " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds" << std::endl;
return 0;
}
When I run the above program on my dual-core Mac mini (i7 with hyperthreading), here are the results I get:
Jeremys-Mac-mini:~ lcsuser1$ g++ -std=c++11 -O3 ./temp.cpp
Jeremys-Mac-mini:~ lcsuser1$ ./a.out
The computations in thread #0: result=1062, took 11718 microseconds
The computations in thread #1: result=1062, took 11481 microseconds
The computations in thread #2: result=1062, took 11525 microseconds
The computations in thread #3: result=1062, took 11230 microseconds
Total time spent doing multithreaded computations was 16492 microseconds in total
result=1181, single-threaded computation took 49846 microseconds
So in this case the results are more like what you'd expect -- because memory access was not a bottleneck, each core was able to run at full speed, and complete its 25% portion of the total calculations in about 25% of the time that it took a single thread to complete 100% of the calculations... and since the four cores were running truly in parallel, the total time spent doing the calculations was about 33% of the time it took for the single-threaded routine to complete (ideally it would be 25% but there's some overhead involved in starting up and shutting down the threads, etc).
This is an explanation, for the beginner.
It's not technically accurate, but IMHO not that far from it that anyone takes damage from reading it.
It provides an entry into understanding the parallel processing terms.
Threads, Tasks, and Processes
It is important to know the difference between threads, and processes.
By default starting a new process, allocates a dedicated memory for that process. So they share memory with no other processes, and could (in theory) be run on separate computers.
(You can share memory with other processes, via operating system, or "shared memory", but you have to add these features, they are not by default available for your process)
Having multiple cores means that the each running process can be executed on any idle core.
So basically one program runs on one core, another program runs on a second core, and the background service doing something for you, runs on a third, (and so on and so forth)
Threads is something different.
For instance all processes will run in a main thread.
The operating system implements a scheduler, that is supposed to allocate cpu time for programs. In principle it will say:
Program A, get 0.01 seconds, than pause!
Program B, get 0.01 seconds, then pause!
Program A, get 0.01 seconds, then pause!
Program B, get 0.01 seconds, then pause!
you get the idea..
The scheduler typically can prioritize between threads, so some programs get more CPU time than others.
The scheduler can of course schedule threads on all cores, but if it does this within a process, (splits a process's threads over multiple cores) there can be a performance penalty as each core holds it's own very fast memory cache.
Since threads from the same process can access the same cache, sharing memory between threads is quite fast.
Accessing another cores cache is not as fast, (if even possible without going via RAM), so in general schedulers will not split a process over multiple cores.
The result is that all the threads belonging to a process runs on the same core.
| Core 1 | Core 2 | Core 3 |
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
A process can spawn multiple threads, they all share the parent threads memory area, and will normally all run on the core that the parent was running on.
It makes sense to spawn threads within a process, if you have an application that needs to respond to something that it cannot control the timing of.
I.E. the users presses on a cancel button, or attempts to move a window, while the application is running calculations that takes a long time to complete.
Responsiveness of the UI, requires the application to spend time reading, and handling what the user is attempting to do. This could be achieved in a main loop, if the program does parts of the calculation in each iteration.
However that get's complicated real fast, so instead of having the calculation code, exit in the middle of a calculation to check the UI, and update the UI, and then continue. You run the calculation code in another thread.
The scheduler then makes sure that the UI thread, and the calculation thread gets CPU time, so the UI responds to user input, while the calculation continues..
And your code stays fairly simple.
But I want to run my calculations another core to gain speed
To distribute calculations on multiple cores, you could spawn a new process for each calculation job. In this way the scheduler will know that each process get's it's own memory, and it can easily be launched on an idle core.
However you have a problem, you need to share memory with the other process, so it knows what to do.
A simple way of doing this, is sharing memory via the filesystem.
You could create a file with the data for the calculation, and then spawn a thread governing the execution (and communication) with another program, (so your UI is responsive, while we wait for the results).
The governing thread runs the other program via system commands, which starts it as another process.
The other program will be written such that it runs with the input file as input argument, so we can run it in multiple instances, on different files.
If the program self terminates when it's done, and creates an output file, it can run on any core, (or multiple) and your process can read the output file.
This actually works, and should the calculation take a long time (like many minutes) this is perhaps ok, even though we use files to communicate between our processes.
For calculations that only takes seconds, however, the file system is slow, and waiting for it will almost remove the gained performance of using processes instead of just using threads. So other more efficient memory sharing is used in real life. For instance creating a shared memory area in RAM.
The "create governing thread, and spawn subprocess, allow communication with process via governing thread, collect data when process is complete, and expose via governing thread" can be implemented in multiple ways.
Tasks
Well "tasks" is ambiguous.
In general it means "Process or thread that solves a task".
However, in certain languages like C#, it is something that implements a thread like thing, that the scheduler can treat as a process. Other languages that provide a similar feature typically dubs this either tasks or workers.
So with workers/tasks it appears to the programmer as if it was merely a thread, that you can share memory with easily, via references, and control like any other thread, by invoking methods on the thread.
But it appears to the scheduler as if it's a process that can be run on any core.
It implements the shared memory problem in a fairly efficient way, as part of the language, so the programmer won't have to re-invent this wheel for all tasks.
This is often referred to as "Hybrid threading" or simply "parallel threads"
Seems that you have some misconception about multi-threading. Simply using two threads cannot halve the processing time.
Multi-threading is a kind of complicated concept but you can easily find related materials on the web. You should read one of them first. But I will try to give a simple explanation with an example.
No matter how many CPUs(or cores) you have, the total handling capacity of the CPU will be always the same whether you use multi-thread or not, right? Then, where does the performance difference come from?
When a program runs on a device(computer) it uses not only CPU but also other system resources such as Networks, RAM, Hard drives, etc. If the flow of the program is serialized there will be a certain point of time when the CPU is idle waiting for other system resources to get done. But, in the case that the program runs with multiple threads(multiple flow), if a thread turns to idle(waiting some tasks done by other system resources) the other threads can use the CPU. Therefore, you can minimize the idle time of the CPU and improve the time performance. This is one of the most simple example about multi-threading.
Since your sample code is almost 'only CPU-consuming', using multi-thread could bring little improvement of performance. Sometimes it can be worse because multi-threading also comes with time cost of context-switching.
FYI, parallel processing is not the same as multi-threading.
This is very good to point out the problems with macs.
Provided you use a o.s. that can schedule threads in a useful manner, you have to consider if a problem is basically the product of 1 problem many times. An example is matrix multiplication. When you multiply 2 matrices there is a certain parts of it which are independent of the others. A 3x3 matrix times another 3x3 requires 9 dot products which can be computed independently of the others, which themselves require 3 multiplications and 2 additions but here the multiplications must be done first. So we see if we wanted to utilize multithreaded processor for this task we could use 9 cores or threads and given they get equal compute time or have same priority level (which is adjustable on windows) you would reduce the time to multiply a 3x3 matrices by 9. This is because we are essentially doing something 9 times which can be done at the same time by 9 people.
now for each of 9 threads we could have 3 cores perform multiplications totaling 3x9=24 cores all together now. Reducing time by t/24. But we have 18 additions and here we can get no gain from more cores. One addition must be piped into another. And the problem takes time t with one core or time t/24 ideally with 24 cores working together. Now you can see why problems are often seeked out if they are 'linear' because they can be done in parallel pretty good like graphics for example (some things like backside culling are sorting problems and inherently not linear so parallel processing has diminished performance boosts).
Then there is added overhead of starting threads and how they are scheduled by the o.s. and processor. Hope this helps.
i am studying task implementation in TBB and have run code for parallel and serial calculation of Fibonacci Series.
The Code is :
#include <iostream>
#include <list>
#include <tbb/task.h>
#include <tbb/task_group.h>
#include <stdlib.h>
#include "tbb/compat/thread"
#include "tbb/task_scheduler_init.h"
using namespace std;
using namespace tbb;
#define CutOff 2
long serialFib( long n ) {
if( n<2 )
return n;
else
return serialFib(n-1) + serialFib(n-2);
}
class FibTask: public task
{
public:
const long n;
long* const sum;
FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {}
task* execute()
{
// cout<<"task id of thread is \t"<<this_thread::get_id()<<"FibTask(n)="<<n<<endl; // Overrides virtual function task::execute
// cout<<"Task Stolen is"<<is_stolen_task()<<endl;
if( n<CutOff )
{
*sum = serialFib(n);
}
else
{
long x, y;
FibTask& a = *new( allocate_child() ) FibTask(n-1,&x);
FibTask& b = *new( allocate_child() ) FibTask(n-2,&y);
set_ref_count(3); // 3 = 2 children + 1 for wait // ref_countis used to keep track of the number of tasks spawned at the current level of the task graph
spawn( b );
// cout<<"child id of thread is \t"<<this_thread::get_id()<<"calculating n ="<<n<<endl;
spawn_and_wait_for_all( a ); //set tasks for execution and wait for them
*sum = x+y;
}
return NULL;
}
};
long parallelFib( long n )
{
long sum;
FibTask& a = *new(task::allocate_root()) FibTask(n,&sum);
task::spawn_root_and_wait(a);
return sum;
}
int main()
{
long i,j;
cout<<fixed;
cout<<"Fibonacci Series parallelly formed is "<<endl;
tick_count t0=tick_count::now();
for(i=0;i<50;i++)
cout<<parallelFib(i)<<"\t";
// cout<<"parallel execution of Fibonacci series for n=10 \t"<<parallelFib(i)<<endl;
tick_count t1=tick_count::now();
double t=(t1-t0).seconds();
cout<<"Time Elapsed in Parallel Execution is \t"<<t<<endl;
cout<<"\n Fibonacci Series Serially formed is "<<endl;
tick_count t3=tick_count::now();
for(j=0;j<50;j++)
cout<<serialFib(j)<<"\t";
tick_count t4=tick_count::now();
double t5=(t4-t3).seconds();
cout<<"Time Elapsed in Serial Execution is \t"<<t5<<endl;
return(0);
}
Parallel Execution is taking more time as compared to serial execution.In this Parallel Execution took 2500 sec whereas serial took around 167 secs.
Can anybody pls explain reason for this?
Overhead.
When your actual task is lightweight, the coordination/communication dominates and you do not (automatically) gain from parallel execution. This is a pretty common issue.
Try instead to compute M Fibonacci numbers (of a high enough cost) serially, then compute them in parallel. You should see a gain.
Change Cutoff to 12, compile with optimization on (-O on Linux; /O2 on Windows), and you should see significant speedup.
There is plenty of parallelism in the example. The problem is that with Cutoff=2, the individual units of useful parallel computation are swamped by scheduling overhead. Raising the Cutoff value should resolve the problem.
Here is the analysis. There are two important times for analyzing parallelism:
work - the total amount of computational work.
span - the length of the critical path.
The available parallelism is work/span.
For fib(n), when n is sufficiently large, the work is roughly proportional to fib(n) [yes, it describes itself!]. The span is the depth of the call tree - it is roughly proportional to n. So the parallelism is proportional to fib(n)/n. So even for n=10, there is plenty of available parallelism to keep a typical 2013 desktop machine humming.
The problem is that TBB tasks take time to create, execute, synchronize, and destroy. Changing Cutoff from 2 to 12 allows the serial code to take over when the work is so small that scheduling overheads would swamp it. This is a common pattern in recursive parallelism: recurse in parallel until you are down to chunks of work that might as well be done serially. In Other parallel frameworks (like OpenMP or Cilk Plus) have the same issue: there is overhead for tasks, albeit they may be more or less than TBB. All that changes is what the best threshold value is.
Try varying Cutoff. Lower values should give you more parallelism but more scheduling overhead. Higher values give you less parallelism but less scheduling overhead. In between, you will likely find a range of values that give good speedup.
Am I right in thinking that each task does result of fib(n-1) + result of fib(n-2) - so essentially, you start a task, which then starts another task and so on until we have a very large number of tasks (I got slightly lost trying to count them all - I think it's n squared). And the result of each such task is used to add up the fibonacci number.
First of all, there is no actual parallel execution here (other than perhaps two independent recursive calculations). Every task relies on the result of it's subtask, and can't really do anything in parallel. On the other hand, you are performing a whole lot of work to set up each task. Not at all surprising that you don't see any benefit)
Now, if you were to calculate the fibonacci numbers 1 .. 50 by iteration, and you started, say, one task per processor core in your system, and compared that to an iterative solution using just a single loop, I'm sure that would show a much better improvement.
Without more information it will be hard to tell. you need to check:How many processros your computer have? were there any other programs which might have made use of ther processors?
if you want to run in (true) parallel and gain performance benefits, than the Operating system must be able to allocate at least 2 free processors.
Also, for small tasks , the overhead of allocating threads and collecting their result
might exceed the benefits of parallel execution.
On my laptop with Intel Pentium dual-core processor T2370 (Acer Extensa) I ran a simple multithreading speedup test. I am using Linux. The code is pasted below. While I was expecting a speedup of 2-3 times, I was surprised to see a slowdown by a factor of 2. I tried the same with gcc optimization levels -O0 ... -O3, but everytime I got the same result. I am using pthreads. I also tried the same with only two threads (instead of 3 threads in the code), but the performance was similar.
What could be the reason? The faster version took reasonably long - about 20 secs - so it seems is not an issue of startup overhead.
NOTE: This code is a lot buggy (indeed it does not make much sense as the output of serial and parallel versions would be different). The intention was just to "get" a speedup comparison for the same number of instructions.
#include <stdio.h>
#include <time.h>
#include <unistd.h>
#include <pthread.h>
class Thread{
private:
pthread_t thread;
static void *thread_func(void *d){((Thread *)d)->run();}
public:
Thread(){}
virtual ~Thread(){}
virtual void run(){}
int start(){return pthread_create(&thread, NULL, Thread::thread_func, (void*)this);}
int wait(){return pthread_join(thread, NULL);}
};
#include <iostream>
const int ARR_SIZE = 100000000;
const int N = 20;
int arr[ARR_SIZE];
int main(void)
{
class Thread_a:public Thread{
public:
Thread_a(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_b:public Thread{
public:
Thread_b(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=ARR_SIZE/3; i<2*ARR_SIZE/3; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
class Thread_c:public Thread{
public:
Thread_c(int* a): arr_(a) {}
void run()
{
for(int n = 0; n<N; n++)
for(int i=2*ARR_SIZE/3; i<ARR_SIZE; i++){ arr_[i] += arr_[i-1];}
}
private:
int* arr_;
};
{
Thread *a=new Thread_a(arr);
Thread *b=new Thread_b(arr);
Thread *c=new Thread_c(arr);
clock_t start = clock();
if (a->start() != 0) {
return 1;
}
if (b->start() != 0) {
return 1;
}
if (c->start() != 0) {
return 1;
}
if (a->wait() != 0) {
return 1;
}
if (b->wait() != 0) {
return 1;
}
if (c->wait() != 0) {
return 1;
}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << duration << "seconds\n";
delete a;
delete b;
}
{
clock_t start = clock();
for(int n = 0; n<N; n++)
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
clock_t end = clock();
double duration = (double)(end - start) / CLOCKS_PER_SEC;
std::cout << "serial: " << duration << "seconds\n";
}
return 0;
}
See also: What can make a program run slower when using more threads?
The times you are reporting are measured using the clock function:
The clock() function returns an approximation of processor time used by the program.
$ time bin/amit_kumar_threads.cpp
6.62seconds
serial: 2.7seconds
real 0m5.247s
user 0m9.025s
sys 0m0.304s
The real time will be less for multiprocessor tasks, but the processor time will typically be greater.
When you use multiple threads, the work may be done by more than one processor, but the amount of work is the same, and in addition there may be some overhead such as contention for limited resources. clock() measures the total processor time, which will be the work + any contention overhead. So it should never be less than the processor time for doing the work in a single thread.
It's a little hard to tell from the question whether you knew this, and were surprised that the value returned by clock() was twice that for a single thread rather than being only a little more, or you were expecting it to be less.
Using clock_gettime() instead (you'll need the realtime library librt, g++ -lrt etc.) gives:
$ time bin/amit_kumar_threads.cpp
2.524 seconds
serial: 2.761 seconds
real 0m5.326s
user 0m9.057s
sys 0m0.344s
which still is less of a speed-up than one might hope for, but at least the numbers make some sense.
100000000*20/2.5s = 800Hz, the bus frequency is 1600 MHz, so I suspect with a read and a write for each iteration (assuming some caching), you're memory bandwidth limited as tstenner suggests, and the clock() value shows that most of the time some of your processors are waiting for data. (does anyone know whether clock() time includes such stalls?)
The only thing your thread does is adding some elements, so your application should be IO-bound. When you add an extra thread, you have 2 CPUs sharing the memory bus, so it won't go faster, instead, you'll have cache misses etc.
I believe that your algorithm essentially makes your cache memory useless.
Probably what you are seeing is the effect of (non)locality of reference between the three threads. Essentially because each thread is operating on a different section of data that is widely separated from the others you are causing cache misses as the data section for one thread replaces that for another thread in your cache. If your program was constructed so that the threads operated on sections of data that were smaller (so that they could all be kept in memory) or closer together (so that all threads could use the same in-cache pages), you'd see a performance boost. As it is I suspect that your slow down is because a lot of memory references are having to be satisifed from main memory instead of from your cache.
Not related to your threading issues, but there is a bounds error in your code.
You have:
for(int i=0; i<ARR_SIZE; i++){ arr[i] += arr[i-1];}
When i is zero you will be doing
arr[0] += arr[-1];
Also see herb's article on how multi cpu and cache lines interference in multithreaded code specially the section `All Sharing Is Bad -- Even of "Unshared" Objects...'
As others have pointed out, threads don't necessarily provide improvements to speed. In this particular example, the amount of time spent in each thread is significantly less than the amount of time required to perform context switches and synchronization.
tstenner has got it mostly right.
This is mainly a benchmark of your OS's "allocate and map a new page" algorithm. That array allocation allocates 800MB of virtual memory; the OS won't actually allocate real physical memory until it's needed. "Allocate and map a new page" is usually protected by a mutex, so more cores won't help.
Your benchmark also stresses the memory bus (minimum 800MB transferred; on OSs that zero memory just before they give it to you, the worst case is 800MB * 7 transfers). Adding more cores isn't really going to help if the bottleneck is the memory bus.
You have 3 threads that are trampling all over the same memory. The cache lines are being read and written to by different threads, so will be ping-ponging between the L1 caches on the two CPU cores. (A cache line that is to be written to can only be in one L1 cache, and that must be the L1 cache that is attached to the CPU code that's doing the write). This is not very efficient. The CPU cores are probably spending most of their time waiting for the cache line to be transferred, which is why this is slower with threads than if you single-threaded it.
Incidentally, the code is also buggy because the same array is read & written from different CPUs without locking. Proper locking would have an effect on performance.
Threads take you to the promised land of speed boosts(TM) when you have a proper vector implementation. Which means that you need to have:
a proper parallelization of your algorithm
a compiler that knows and can spread your algorithm out on the hardware as a parallel procedure
hardware support for parallelization
It is difficult to come up with the first. You need to be able to have redundancy and make sure that it's not eating in your performance, proper merging of data for processing the next batch of data and so on ...
But this is then only a theoretical standpoint.
Running multiple threads doesn't give you much when you have only one processor and a bad algorithm. Remember -- there is only one processor, so your threads have to wait for a time slice and essentially you are doing sequential processing.