Parallel Execution taking more time than Serial? - c++

i am studying task implementation in TBB and have run code for parallel and serial calculation of Fibonacci Series.
The Code is :
#include <iostream>
#include <list>
#include <tbb/task.h>
#include <tbb/task_group.h>
#include <stdlib.h>
#include "tbb/compat/thread"
#include "tbb/task_scheduler_init.h"
using namespace std;
using namespace tbb;
#define CutOff 2
long serialFib( long n ) {
if( n<2 )
return n;
return serialFib(n-1) + serialFib(n-2);
class FibTask: public task
const long n;
long* const sum;
FibTask( long n_, long* sum_ ) : n(n_), sum(sum_) {}
task* execute()
// cout<<"task id of thread is \t"<<this_thread::get_id()<<"FibTask(n)="<<n<<endl; // Overrides virtual function task::execute
// cout<<"Task Stolen is"<<is_stolen_task()<<endl;
if( n<CutOff )
*sum = serialFib(n);
long x, y;
FibTask& a = *new( allocate_child() ) FibTask(n-1,&x);
FibTask& b = *new( allocate_child() ) FibTask(n-2,&y);
set_ref_count(3); // 3 = 2 children + 1 for wait // ref_countis used to keep track of the number of tasks spawned at the current level of the task graph
spawn( b );
// cout<<"child id of thread is \t"<<this_thread::get_id()<<"calculating n ="<<n<<endl;
spawn_and_wait_for_all( a ); //set tasks for execution and wait for them
*sum = x+y;
return NULL;
long parallelFib( long n )
long sum;
FibTask& a = *new(task::allocate_root()) FibTask(n,&sum);
return sum;
int main()
long i,j;
cout<<"Fibonacci Series parallelly formed is "<<endl;
tick_count t0=tick_count::now();
// cout<<"parallel execution of Fibonacci series for n=10 \t"<<parallelFib(i)<<endl;
tick_count t1=tick_count::now();
double t=(t1-t0).seconds();
cout<<"Time Elapsed in Parallel Execution is \t"<<t<<endl;
cout<<"\n Fibonacci Series Serially formed is "<<endl;
tick_count t3=tick_count::now();
tick_count t4=tick_count::now();
double t5=(t4-t3).seconds();
cout<<"Time Elapsed in Serial Execution is \t"<<t5<<endl;
Parallel Execution is taking more time as compared to serial execution.In this Parallel Execution took 2500 sec whereas serial took around 167 secs.
Can anybody pls explain reason for this?

When your actual task is lightweight, the coordination/communication dominates and you do not (automatically) gain from parallel execution. This is a pretty common issue.
Try instead to compute M Fibonacci numbers (of a high enough cost) serially, then compute them in parallel. You should see a gain.

Change Cutoff to 12, compile with optimization on (-O on Linux; /O2 on Windows), and you should see significant speedup.
There is plenty of parallelism in the example. The problem is that with Cutoff=2, the individual units of useful parallel computation are swamped by scheduling overhead. Raising the Cutoff value should resolve the problem.
Here is the analysis. There are two important times for analyzing parallelism:
work - the total amount of computational work.
span - the length of the critical path.
The available parallelism is work/span.
For fib(n), when n is sufficiently large, the work is roughly proportional to fib(n) [yes, it describes itself!]. The span is the depth of the call tree - it is roughly proportional to n. So the parallelism is proportional to fib(n)/n. So even for n=10, there is plenty of available parallelism to keep a typical 2013 desktop machine humming.
The problem is that TBB tasks take time to create, execute, synchronize, and destroy. Changing Cutoff from 2 to 12 allows the serial code to take over when the work is so small that scheduling overheads would swamp it. This is a common pattern in recursive parallelism: recurse in parallel until you are down to chunks of work that might as well be done serially. In Other parallel frameworks (like OpenMP or Cilk Plus) have the same issue: there is overhead for tasks, albeit they may be more or less than TBB. All that changes is what the best threshold value is.
Try varying Cutoff. Lower values should give you more parallelism but more scheduling overhead. Higher values give you less parallelism but less scheduling overhead. In between, you will likely find a range of values that give good speedup.

Am I right in thinking that each task does result of fib(n-1) + result of fib(n-2) - so essentially, you start a task, which then starts another task and so on until we have a very large number of tasks (I got slightly lost trying to count them all - I think it's n squared). And the result of each such task is used to add up the fibonacci number.
First of all, there is no actual parallel execution here (other than perhaps two independent recursive calculations). Every task relies on the result of it's subtask, and can't really do anything in parallel. On the other hand, you are performing a whole lot of work to set up each task. Not at all surprising that you don't see any benefit)
Now, if you were to calculate the fibonacci numbers 1 .. 50 by iteration, and you started, say, one task per processor core in your system, and compared that to an iterative solution using just a single loop, I'm sure that would show a much better improvement.

Without more information it will be hard to tell. you need to check:How many processros your computer have? were there any other programs which might have made use of ther processors?
if you want to run in (true) parallel and gain performance benefits, than the Operating system must be able to allocate at least 2 free processors.
Also, for small tasks , the overhead of allocating threads and collecting their result
might exceed the benefits of parallel execution.


Reason for collapse of memory bandwidth when 2KB of data is cached in L1-cache

In a self-educational project I measure the bandwidth of the memory with help of the following code (here paraphrased, the whole code follows at the end of the question):
unsigned int doit(const std::vector<unsigned int> &mem){
const size_t BLOCK_SIZE=16;
size_t n = mem.size();
unsigned int result=0;
for(size_t i=0;i<n;i+=BLOCK_SIZE){
return result;
//... initialize mem, result and so on
int NITER = 200;
//... measure time of
for(int i=0;i<NITER;i++)
BLOCK_SIZE is choosen in such a way, that a whole 64byte cache line is fetched per single integer-addition. My machine (an Intel-Broadwell) needs about 0.35 nanosecond per integer-addion, so the code above could saturate a bandwith as high as 182GB/s (this value is just an upper bound and is probably quite off, what is important is the ratio of bandwidths for different sizes). The code is compiled with g++ and -O3.
Varying the size of the vector, I can observe expected bandwidths for L1(*)-, L2-, L3-caches and the RAM-memory:
However, there is an effect I'm really struggling to explain: the collapse of the measured bandwidth of L1-cache for sizes around 2 kB, here in somewhat higher resolution:
I could reproduce the results on all machines I have access to (which have Intel-Broadwell and Intel-Haswell processors).
My question: What is the reason for the performance-collapse for memory-sizes around 2 KB?
(*) I hope I understand correctly, that for L1-cache not 64 bytes but only 4 bytes per addition are read/transfered (there is no further faster cache where a cache line must be filled), so the plotted bandwidth for L1 is only the upper limit and not the badwidth itself.
Edit: When the step size in the inner for-loop is chosen to be
8 (instead of 16) the collapse happens for 1KB
4 (instead of 16) the collapse happens for 0.5KB
i.e. when the inner loop consists of about 31-35 steps/reads. That means the collapse isn't due to the memory-size but due to the number of steps in the inner loop.
It can be explained with branch misses as shown in #user10605163's great answer.
Listing for reproducing the results
#include <vector>
#include <chrono>
#include <iostream>
#include <algorithm>
//returns minimal time needed for one execution in seconds:
template<typename Fun>
double timeit(Fun&& stmt, int repeat, int number)
std::vector<double> times;
for(int i=0;i<repeat;i++){
auto begin = std::chrono::high_resolution_clock::now();
for(int i=0;i<number;i++){
auto end = std::chrono::high_resolution_clock::now();
double time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-begin).count()/1e9/number;
return *std::min_element(times.begin(), times.end());
const int NITER=200;
const int NTRIES=5;
const size_t BLOCK_SIZE=16;
struct Worker{
std::vector<unsigned int> &mem;
size_t n;
unsigned int result;
void operator()(){
for(size_t i=0;i<n;i+=BLOCK_SIZE){
Worker(std::vector<unsigned int> &mem_):
mem(mem_), n(mem.size()), result(1)
double get_size_in_kB(int SIZE){
return SIZE*sizeof(int)/(1024.0);
double get_speed_in_GB_per_sec(int SIZE){
std::vector<unsigned int> vals(SIZE, 42);
Worker worker(vals);
double time=timeit(worker, NTRIES, NITER);
return get_size_in_kB(SIZE)/(1024*1024)/time;
int main(){
int size=BLOCK_SIZE*16;
//ensure that nothing is optimized away:
std::cerr<<"Sum: "<<PREVENT_OPTIMIZATION<<"\n";
import sys
import pandas as pd
import matplotlib.pyplot as plt
plt.plot(data[labels[0]], data[labels[1]], label="my laptop")
Building/running/creating report:
>>> g++ -O3 -std=c++11 bandwidth.cpp -o bandwidth
>>> ./bandwidth > report.txt
>>> python report.txt
# image is in report.png
I changed the values slightly: NITER = 100000 and NTRIES=1 to get a less noisy result.
I don't have a Broadwell available right now, however I tried your code on my Coffee-Lake and got a performance drop, not at 2KB, but around 4.5KB. In addition I find erratic behavior of the throughput slightly above 2KB.
The blue line in the graph corresponds to your measurement (left axis):
The red line here is the result from perf stat -e branch-instructions,branch-misses, giving the fraction of branches that were not correctly predicted (in percent, right axis). As you can see there is a clear anti-correlation between the two.
Looking into the more detailed perf report, I found that basically all of these branch mispredictions happen in the most inner loop in Worker::operator(). If the taken/non-taken pattern for the loop branch becomes too long the branch predictor will not be able to keep track of it and so the exit branch of the inner loop will be mispredicted, leading to the sharp drop in throughput. With further increasing number of iterations the impact of this single mispredict will become less significant leading to the slow recover of the throughput.
For further information on the erratic behavior before the drop see the comments made by #PeterCordes below.
In any case the best way to avoid branch mispredictions is to avoid branches and so I manually unrolled the loop in Worker::operator(), like e.g.:
void operator()(){
for(size_t i=0;i+3*BLOCK_SIZE<n;i+=BLOCK_SIZE*4){
Unrolling 2, 3, 4, 6 or 8 iterations gives the results below. Note that I did not correct for the blocks at the end of the vector which were ignored due to the unrolling. Therefore the periodic peaks in the blue line should be ignored, the lower bound base line of the periodic pattern is the actual bandwidth.
As you can see the fraction of branch mispredictions didn't really change, but because the total number of branches is reduced by the factor of unrolled iterations, they will not contribute strongly to the performance anymore.
There is also an additional benefit of the processor being more free to do the calculations out-of-order if the loop is unrolled.
If this is supposed to have practical application I would suggest to try to give the hot loop a compile-time fixed number of iteration or some guarantee on divisibility, so that (maybe with some extra hints) the compiler can decide on the optimal number of iterations to unroll.
Might be unrelated but your Linux machine might playing with CPU frequency. I know Ubuntu 18 has a gouverner that is balanced between power and performance. You also want to play with the process affinity to make sure it does not get migrated to different core while running.

Bottleneck at random number generation with multiple threads

I was facing performance issues while generating random numbers via multiple threads. This was cause of using the same random engine for all threads. Then I implemented a vector which contains a random engine for each thread (found this solution in another post here on stackoverflow). But I would expect that the number of iterations per second grows linearily with the number of threads I'm executing. But this seems not to be the case.
Here is a minimal example:
#include <random>
#include <omp.h>
const int threads = 4;
int main()
std::uniform_int_distribution<uint64_t> uint_dist;
std::vector<std::mt19937_64> random_engines;
std::random_device rd;
for (int i = 0;i < threads;i++)
int counter = 0;
#pragma omp parallel for
for (int i = 0;i < threads;++i)
int thread = omp_get_thread_num();
while (counter < 100)
if (uint_dist((random_engines[thread])) < (1ULL << 42))
While executing this code with one active thread it takes an average execution time of ~4 seconds on my CPU. Setting threads to 4 gives me an average execution time of ~2 seconds, so the number of threads gets a multiplicator of 4, which ends up in a speedup of 2.
Do I miss something?
First, if you have two cores and hyper threading, it looks like four processors to your code, but it's not four times the speed, only a bit better than twice as fast if you are lucky.
Second, if you use all the CPU power that you have, your computer will heat up and then reduce the clock speed.
Third, you may be using a random number with huge state. The state for one may fit into L1 cache, but not the state for four of them. That can give a huge slowdown.
Fourth, you have a variable "counter" that is shared between threads and read at each iteration. That's not going to be fast.

C++ multithreads run time issue

I have been studying C++ multithreads and get a question about it.
Here is what I am understanding about multithreads.
One of the reasons we use multithreads is to reduce the run time, right?
For example, I think if we use two threads we can expect half of the execution time.
So, I tried to code to prove it.
Here is the code.
#include <vector>
#include <iostream>
#include <thread>
#include <future>
using namespace std;
#define iterationNumber 1000000
void myFunction(const int index, const int numberInThread, promise<unsigned long>&& p, const vector<int>& numberList) {
clock_t begin,end;
int firstIndex = index * numberInThread;
int lastIndex = firstIndex + numberInThread;
vector<int>::const_iterator first = numberList.cbegin() + firstIndex;
vector<int>::const_iterator last = numberList.cbegin() + lastIndex;
vector<int> numbers(first,last);
unsigned long result = 0;
begin = clock();
for(int i = 0 ; i < numbers.size(); i++) {
result +=;
end = clock();
cout << "thread" << index << " took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
int main(void)
vector<int> numberList;
vector<thread> t;
vector<future<unsigned long>> futures;
vector<unsigned long> result;
const int NumberOfThreads = thread::hardware_concurrency() ?: 2;
int numberInThread = iterationNumber / NumberOfThreads;
clock_t begin,end;
for(int i = 0 ; i < iterationNumber ; i++) {
int randomN = rand() % 10000 + 1;
for(int j = 0 ; j < NumberOfThreads; j++){
promise<unsigned long> promises;
t.push_back(thread(myFunction, j, numberInThread, std::move(promises), numberList));
for_each(t.begin(), t.end(), std::mem_fn(&std::thread::join));
for (int i = 0; i < futures.size(); i++) {
unsigned long RRR = 0;
begin = clock();
for(int i = 0 ; i < numberList.size(); i++) {
RRR +=;
end = clock();
cout << "not by thread took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
Because the hardware concurrency of my laptop is 4, it will create 4 threads and each takes a quarter of numberList and sum up the numbers.
However, the result was different than I expected.
thread0 took 0.007232
thread1 took 0.007402
thread2 took 0.010035
thread3 took 0.011759
not by thread took 0.009654
Why? Why it took more time than serial version(not by thread).
For example, I think if we use two threads we can expect half of the
execution time.
You'd think so, but sadly, that is often not the case in practice. The ideal "N cores means 1/Nth the execution time" scenario occurs only when the N cores can execute completely in parallel, without any core's actions interfering with the performance of the other cores.
But what your threads are doing is just summing up different sub-sections of an array... surely that can benefit from being executed in parallel? The answer is that in principle it can, but on a modern CPU, simple addition is so blindingly fast that it isn't really a factor in how long it takes a loop to complete. What really does limit the execute speed of a loop is access to RAM. Compared to the speed of the CPU, RAM access is very slow -- and on most desktop computers, each CPU has only one connection to RAM, regardless of how many cores it has. That means that what you are really measuring in your program is the speed at which a big array of integers can be read in from RAM to the CPU, and that speed is roughly the same -- equal to the CPU's memory-bus bandwidth -- regardless of whether it's one core doing the reading-in of the memory, or four.
To demonstrate how much RAM access is a factor, below is a modified/simplified version of your test program. In this version of the program, I've removed the big vectors, and instead the computation is just a series of calls to the (relatively expensive) sin() function. Note that in this version, the loop is only accessing a few memory locations, rather than thousands, and thus a core that is running the computation loop will not have to periodically wait for more data to be copied in from RAM to its local cache:
#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <math.h>
using namespace std;
static int iterationNumber = 1000000;
unsigned long long threadElapsedTimeMicros[10];
unsigned long threadResults[10];
void myFunction(const int index, const int numberInThread)
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i=0; i<numberInThread; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
threadResults[index] = result;
threadElapsedTimeMicros[index] = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();
// We'll print out the value of threadElapsedTimeMicros[index] later on,
// after all the threads have been join()'d.
// If we printed it out now it might affect the timing of the other threads
// that may still be executing
int main(void)
vector<thread> t;
const int NumberOfThreads = thread::hardware_concurrency();
const int numberInThread = iterationNumber / NumberOfThreads;
// Multithreaded approach
std::chrono::steady_clock::time_point allBegin = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) t.push_back(thread(myFunction, j, numberInThread));
for(int j = 0 ; j < NumberOfThreads; j++) t[j].join();
std::chrono::steady_clock::time_point allEnd = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) cout << " The computations in thread #" << j << ": result=" << threadResults[j] << ", took " << threadElapsedTimeMicros[j] << " microseconds" << std::endl;
cout << " Total time spent doing multithreaded computations was " << std::chrono::duration_cast<std::chrono::microseconds>(allEnd - allBegin).count() << " microseconds in total" << std::endl;
// And now, the single-threaded approach, for comparison
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i = 0 ; i < iterationNumber; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
cout << "result=" << result << ", single-threaded computation took " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds" << std::endl;
return 0;
When I run the above program on my dual-core Mac mini (i7 with hyperthreading), here are the results I get:
Jeremys-Mac-mini:~ lcsuser1$ g++ -std=c++11 -O3 ./temp.cpp
Jeremys-Mac-mini:~ lcsuser1$ ./a.out
The computations in thread #0: result=1062, took 11718 microseconds
The computations in thread #1: result=1062, took 11481 microseconds
The computations in thread #2: result=1062, took 11525 microseconds
The computations in thread #3: result=1062, took 11230 microseconds
Total time spent doing multithreaded computations was 16492 microseconds in total
result=1181, single-threaded computation took 49846 microseconds
So in this case the results are more like what you'd expect -- because memory access was not a bottleneck, each core was able to run at full speed, and complete its 25% portion of the total calculations in about 25% of the time that it took a single thread to complete 100% of the calculations... and since the four cores were running truly in parallel, the total time spent doing the calculations was about 33% of the time it took for the single-threaded routine to complete (ideally it would be 25% but there's some overhead involved in starting up and shutting down the threads, etc).
This is an explanation, for the beginner.
It's not technically accurate, but IMHO not that far from it that anyone takes damage from reading it.
It provides an entry into understanding the parallel processing terms.
Threads, Tasks, and Processes
It is important to know the difference between threads, and processes.
By default starting a new process, allocates a dedicated memory for that process. So they share memory with no other processes, and could (in theory) be run on separate computers.
(You can share memory with other processes, via operating system, or "shared memory", but you have to add these features, they are not by default available for your process)
Having multiple cores means that the each running process can be executed on any idle core.
So basically one program runs on one core, another program runs on a second core, and the background service doing something for you, runs on a third, (and so on and so forth)
Threads is something different.
For instance all processes will run in a main thread.
The operating system implements a scheduler, that is supposed to allocate cpu time for programs. In principle it will say:
Program A, get 0.01 seconds, than pause!
Program B, get 0.01 seconds, then pause!
Program A, get 0.01 seconds, then pause!
Program B, get 0.01 seconds, then pause!
you get the idea..
The scheduler typically can prioritize between threads, so some programs get more CPU time than others.
The scheduler can of course schedule threads on all cores, but if it does this within a process, (splits a process's threads over multiple cores) there can be a performance penalty as each core holds it's own very fast memory cache.
Since threads from the same process can access the same cache, sharing memory between threads is quite fast.
Accessing another cores cache is not as fast, (if even possible without going via RAM), so in general schedulers will not split a process over multiple cores.
The result is that all the threads belonging to a process runs on the same core.
| Core 1 | Core 2 | Core 3 |
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
A process can spawn multiple threads, they all share the parent threads memory area, and will normally all run on the core that the parent was running on.
It makes sense to spawn threads within a process, if you have an application that needs to respond to something that it cannot control the timing of.
I.E. the users presses on a cancel button, or attempts to move a window, while the application is running calculations that takes a long time to complete.
Responsiveness of the UI, requires the application to spend time reading, and handling what the user is attempting to do. This could be achieved in a main loop, if the program does parts of the calculation in each iteration.
However that get's complicated real fast, so instead of having the calculation code, exit in the middle of a calculation to check the UI, and update the UI, and then continue. You run the calculation code in another thread.
The scheduler then makes sure that the UI thread, and the calculation thread gets CPU time, so the UI responds to user input, while the calculation continues..
And your code stays fairly simple.
But I want to run my calculations another core to gain speed
To distribute calculations on multiple cores, you could spawn a new process for each calculation job. In this way the scheduler will know that each process get's it's own memory, and it can easily be launched on an idle core.
However you have a problem, you need to share memory with the other process, so it knows what to do.
A simple way of doing this, is sharing memory via the filesystem.
You could create a file with the data for the calculation, and then spawn a thread governing the execution (and communication) with another program, (so your UI is responsive, while we wait for the results).
The governing thread runs the other program via system commands, which starts it as another process.
The other program will be written such that it runs with the input file as input argument, so we can run it in multiple instances, on different files.
If the program self terminates when it's done, and creates an output file, it can run on any core, (or multiple) and your process can read the output file.
This actually works, and should the calculation take a long time (like many minutes) this is perhaps ok, even though we use files to communicate between our processes.
For calculations that only takes seconds, however, the file system is slow, and waiting for it will almost remove the gained performance of using processes instead of just using threads. So other more efficient memory sharing is used in real life. For instance creating a shared memory area in RAM.
The "create governing thread, and spawn subprocess, allow communication with process via governing thread, collect data when process is complete, and expose via governing thread" can be implemented in multiple ways.
Well "tasks" is ambiguous.
In general it means "Process or thread that solves a task".
However, in certain languages like C#, it is something that implements a thread like thing, that the scheduler can treat as a process. Other languages that provide a similar feature typically dubs this either tasks or workers.
So with workers/tasks it appears to the programmer as if it was merely a thread, that you can share memory with easily, via references, and control like any other thread, by invoking methods on the thread.
But it appears to the scheduler as if it's a process that can be run on any core.
It implements the shared memory problem in a fairly efficient way, as part of the language, so the programmer won't have to re-invent this wheel for all tasks.
This is often referred to as "Hybrid threading" or simply "parallel threads"
Seems that you have some misconception about multi-threading. Simply using two threads cannot halve the processing time.
Multi-threading is a kind of complicated concept but you can easily find related materials on the web. You should read one of them first. But I will try to give a simple explanation with an example.
No matter how many CPUs(or cores) you have, the total handling capacity of the CPU will be always the same whether you use multi-thread or not, right? Then, where does the performance difference come from?
When a program runs on a device(computer) it uses not only CPU but also other system resources such as Networks, RAM, Hard drives, etc. If the flow of the program is serialized there will be a certain point of time when the CPU is idle waiting for other system resources to get done. But, in the case that the program runs with multiple threads(multiple flow), if a thread turns to idle(waiting some tasks done by other system resources) the other threads can use the CPU. Therefore, you can minimize the idle time of the CPU and improve the time performance. This is one of the most simple example about multi-threading.
Since your sample code is almost 'only CPU-consuming', using multi-thread could bring little improvement of performance. Sometimes it can be worse because multi-threading also comes with time cost of context-switching.
FYI, parallel processing is not the same as multi-threading.
This is very good to point out the problems with macs.
Provided you use a o.s. that can schedule threads in a useful manner, you have to consider if a problem is basically the product of 1 problem many times. An example is matrix multiplication. When you multiply 2 matrices there is a certain parts of it which are independent of the others. A 3x3 matrix times another 3x3 requires 9 dot products which can be computed independently of the others, which themselves require 3 multiplications and 2 additions but here the multiplications must be done first. So we see if we wanted to utilize multithreaded processor for this task we could use 9 cores or threads and given they get equal compute time or have same priority level (which is adjustable on windows) you would reduce the time to multiply a 3x3 matrices by 9. This is because we are essentially doing something 9 times which can be done at the same time by 9 people.
now for each of 9 threads we could have 3 cores perform multiplications totaling 3x9=24 cores all together now. Reducing time by t/24. But we have 18 additions and here we can get no gain from more cores. One addition must be piped into another. And the problem takes time t with one core or time t/24 ideally with 24 cores working together. Now you can see why problems are often seeked out if they are 'linear' because they can be done in parallel pretty good like graphics for example (some things like backside culling are sorting problems and inherently not linear so parallel processing has diminished performance boosts).
Then there is added overhead of starting threads and how they are scheduled by the o.s. and processor. Hope this helps.

C++ Multithreaded prime counter between specified range

#include <math.h>
#include <sstream>
#include <iostream>
#include <mutex>
#include <stdlib.h>
#include <chrono>
#include <thread>
bool isPrime(int number) {
int i;
for (i = 2; i < number; i++) {
if (number % i == 0) {
return false;
return true;
std::mutex myMutex;
int pCnt = 0;
int icounter = 0;
int limit = 0;
int getNext() {
std::lock_guard<std::mutex> guard(myMutex);
return icounter;
void primeCnt() {
std::lock_guard<std::mutex> guard(myMutex);
void primes() {
while (getNext() <= limit)
if (isPrime(icounter))
int main(int argc, char *argv[]) {
std::stringstream ss(argv[2]);
int tCount;
ss >> tCount;
std::stringstream ss1(argv[4]);
int lim;
ss1 >> lim;
limit = lim;
auto t1 = std::chrono::high_resolution_clock::now();
std::thread *arr;
arr = new std::thread[tCount];
for (int i = 0; i < tCount; i++)
arr[i] = std::thread(primes);
for (int i = 0; i < tCount; i++)
auto t2 = std::chrono::high_resolution_clock::now();
std::cout << "Primes: " << pCnt << std::endl;
std::cout << "Program took: " << std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count() <<
" milliseconds" << std::endl;
return 0;
Hello , im trying to find the amount of prime numbers between the user specified range, i.e., 1-1000000 with a user specified amount of threads to speed up the process, however, it seems to take the same amount of time for any amount of threads compared to one thread. Im not sure if its supposed to be that way or if theres a mistake in my code. thank you in advance!
You don't see performance gain because time spent in isPrime() is much smaller than time which threads take when fighting on mutex.
One possible solution is to use atomic operations, as #The Badger suggested. The other way is to partition your task into smaller ones and distribute them over your thread pool.
For example, if you have n threads, then each thread should test numbers from i*(limit/n) to (i+1)*(limit/n), where i is thread number. This way you wouldn't need to do any synchronization at all and your program would (theoretically) scale linearly.
Multithreaded algorithms work best when threads can do a lot of work on their own.
Imagine doing this in real life: you have a group of 20 humans that will do work for you, and you want them to test whether each number up to 1000 is prime. How will you do this?
Would you hand each person a single number at a time, and ask them to come back to you to tell you if its prime and to receive another number?
Surely not; you would give each person a bunch of numbers to work on at once, and have them come back and tell you how many were prime and to receive another bunch of numbers.
Maybe even you'd divide up the entire set of numbers into 20 groups and tell each person to work on a group. (but then you run the risk of one person being slow and having everyone else sitting idle while you wait for that one person to finish... although there are so-called "work stealing" algorithms, but that's complicated)
The same thing applies here; you want each thread to do a lot of work on its own and keep its own tally, and only have to check back with the centralized information once in a while.
A better solution would be to use the Sieve of Atkin to find the primes (even the Sieve of Eratosthenes which is easier to understand is better), your basic algorithm is very poor to start with. It will for every number n in your interval do n checks in order to determine if it's prime and do this limit times. This means that you're doing about limit*limit/2 checks - that's what we call O(n^2) complexity. The Sieve of Atkins OTOH only have to do O(n) operations to find all primes. If n is large it is hard to beat the algorithm that has fewer steps by performing the steps faster. Trying to fix a poor algorithm by throwing more resources on it is a bad strategy.
Another problem with your implementation is that it has race conditions and therefore is broken to start with. It's often little use in optimizing something unless you first make sure it's working correctly. The problem is in the primes function:
void primes() {
while (getNext() <= limit)
if( isPrime(icounter) )
Between the getNext() and isPrime another thread may have increased the icounter and cause the program to skip candidates. This results in the program giving different result each time. In addition neither icounter nor pCnt is declared volatile so there's actually no guarantee that the value gets to the global storage location as part of the mutex lock.
Since the problem is CPU intensive, that is almost all of the time is spent executing CPU instructions multi threading won't help unless you have multiple CPU's (or cores) which the OS are scheduling threads of the same process on. This means that there is a limit of number of threads (that can be as low as 1 - I fx see only a improvement for two threads, beyond that theres none) where you can expect an improved performance. What happens if you have more threads than cores is that the OS will just let one thread run for a while on a core and then switch the thread an let the next thread execute for a while.
The problem that may arise when scheduling threads on different cores is in addition that each core may have separate cache (which is faster than the shared cache). In effect if two threads are going to access the same memory the separated cache has to be flushed as part of the synchronization of the data involved - this may be time consuming.
That is you have to strive to keep the data that the different threads are working on separate and minimize the frequent use of common variable data. In your example it would mean that you should avoid the global data as much as possible. The counter for example need only be accessed when the counting has finished (to add the threads contribution to the count). Also you could minimize the use of icounter by not reading it for each candidate, but get a bunch of candidates in one go. Something like:
void primes() {
int next;
int count=0;
while( (next = getNext(1000)) <= limit ) {
for( int j = next; j < next+1000 && j <= limit ; j++ ) {
if( isPrime(j) )
where getNext is the same, but it reserves a number of candidates (by increasing icounter by the supplied count) and primeCnt adds count to pCnt.
Consequently you may end up in a situation where the core runs one thread, then after a while switch to another thread and so on. The result of this is that you will have to run all the code for your problem plus code for switching between the thread. Add that you will probably have more cache hits, then this will probably even be slower.
Perhaps instead of a mutex try to use an atomic integer for the counter. It might speed it up a bit, not sure by how much.
#include <atomic>
std::atomic<uint64_t> pCnt; // Made uint64 for bigger range as #IgnisErus mentioned
std::atomic<uint64_t> icounter;
int getNext() {
return ++icounter; // Pre increment is faster
void primeCnt() {
On benchmarking, most of the time the processor need to warm up to get the best performance, so to take the time once is not always a good representation of the actual performance. Try to run the code many times and get an average. You can also try to do some heavy work before you do the calculation (A long for-loop calculating the power of some counter?)
Getting accurate benchmark results is also a topic of interest for me since I do not yet know how to do it.

breaking the round robin for loop

using namespace std;
static long num_steps = 100;
#define NUM 8
double step;
void main()
clock_t time =clock();
ofstream result; ("Result.txt");
int a[100];
double pi, sum=0.0;
step = 1.0/(double) num_steps;
#pragma omp parallel num_threads(NUM)
int i, ID;
double x, psum= 0.0;
int nthreads = omp_get_num_threads();
ID = omp_get_thread_num();
for (i=ID;i<= num_steps; i+=nthreads)
x = (i+0.5)*step;
psum += 4.0/(1.0+x*x);
#pragma omp critical
sum += psum;
pi = step * sum;
for (int j=0;j<100;j++)
time = clock() - time;
result << "Time Elapsed: " << (((double)time)/CLOCKS_PER_SEC) << endl;
result <<"======================================================================================="<<endl;
The question is:
for (i=ID;i<= num_steps; i+=nthreads)
the following for loop execute the threads in the following order:
01234567 01234567 01234567 etc...
the assignment is to change the for loop to so threads are distributed equally and anot in rounded way. first the zeroes then ones then twos .... then the sevens
How should I change the forloop?
you must use some kind of thread synchronization for that ...
You tag Visual studio so I assume Windows platform ...
Lately this become my favourite:
// init
// start lock
// stop lock
// exit
But there are many other ways.
You can also try to make your own locks or lock-less threads
but be aware of that in newer OS like Windows 7 is different process sheduler
and have tendency to be crazy
with what i mean 100% working lock-less code on previous OS-es is now choppy or freezing
so i prefer to use OS locks.
If you use locks wrongly you risk to lose any benefit of multi-threaded speed-up.
If you just worry about that your solution does not compute threads on the same time
not parallel but serial in your case than it can be caused by this:
processing time granularity.
any sheduled task is divided to chunks of time.
If your task is too short then it is done sooner then the other task even begin execution.
to test that try bigger payload (compute time > few seconds)
enlarge number of cycles greatly
add Sleep(time ms) to have longer computation time
if the output will be mixed then it was it
if not then you are still under granularity boundary
or your multi-thread code is wrong
wrong multi-thread code
are you shore your threads are created/running at the same time ?
or do you synchronize to something wrong ? (like till the end of previous task)
also some compilers do a big deal of volatile variables (add locks to it what sometimes do very weird things ... I stumped on it many times but mostly on MCU platforms and Eclipse)
Single core
on some cases if you have just 1 CPU/Core/Computer for processing
or just setted affinity mask to single CPU
on some algorithms windows shedulers do not shedule the CPU time evenly
even regardless the process/thread priority/class
something similar appears sometimes on Windows 7 even for more CPUs ...
especially with code mixed with Kernel mode code
To play with granularity you can use his:
// obtain OS time capabilities
// set new granularity
if (timeBeginPeriod(time ms)!=TIMERR_NOERROR) log("time granularity out of range");
// return to previous hranularity
timeEndPeriod(time ms ... must be the same as beginperiod);out of range");
PS. Very good stuf about this is here: