Boost Thread_Group in a loop is very slow

Boost Thread_Group in a loop is very slow - c++

I wanted to use threading to run check multiple images in a vector at the same time. Here is the code
boost::thread_group tGroup;
for (int line = 0;line < sourceImageData.size(); line++) {
for (int pixel = 0;pixel < sourceImageData[line].size();pixel++) {
for (int im = 0;im < m_images.size();im++) {
tGroup.create_thread(boost::bind(&ClassX::ClassXFunction, this, line, pixel, im));
}
tGroup.join_all();
}
}
This creates the thread group and loops thru lines of pixel data and each pixel and then multiple images. Its a weird project but anyway I bind the thread to a method in the same instance of the class this code is in so "this" is used. This runs through a population of about 20 images, binding each thread as it goes and then when it is done looping the join_all function takes effect when the threads are done. Then it goes to the next pixel and starts over again.
I'v tested running 50 threads at the same time with this simple program
void run(int index) {
for (int i = 0;i < 100;i++) {
std::cout << "Index : " <<index<<" "<<i << std::endl;
}
}
int main() {
boost::thread_group tGroup;
for (int i = 0;i < 50;i++){
tGroup.create_thread(boost::bind(run, i));
}
tGroup.join_all();
int done;
std::cin >> done;
return 0;
}
This works very quickly. Even though the method the threads are bound to in the previous program is more complicated it shouldn't be as slow as it is. It takes like 4 seconds for one loop of sourceImageData (line) to complete. I'm new to boost threading so I don't know if something is blatantly wrong with the nested loops or otherwise. Any insight is appreciated.

The answer is simple. Don't start that many threads. Consider starting as many threads as you have logical CPU cores. Starting threads is very expensive.
Certainly never start a thread just to do one tiny job. Keep the threads and give them lots of (small) tasks using a task queue.
See here for a good example where the number of threads was similarly the issue: boost thread throwing exception "thread_resource_error: resource temporarily unavailable"
In this case I'd think you can gain a lot of performance by increasing the size of each task (don't create one per pixel, but per scan-line for example)

I believe the difference here is in when you decide to join the threads.
In the first piece of code, you join the threads at every pixel of the supposed source image. In the second piece of code, you only join the threads once at the very end.
Thread synchronization is expensive and often a bottleneck for parallel programs because you are basically pausing execution of any new threads until ALL threads that need to be synchronized, which in this case is all the threads that are active, are done running.
If the iterations of the innermost loop(the one with im) are not dependent on each other, I would suggest you join the threads after the entire outermost loop is done.

Related

Parallelizing for loop with openMP inside of while loop?

I have a program structure similarly to this:
ssize_t remain = nsamp;
while (!nsamp || remain > 0) {
#pragma omp parallel for num_threads(nthread)
for (ssize_t ii=0; ii < nthread; ii++) {
<generate noise>
}
// write noise
out.write(data, nthread*PERITER);
remain -= nthread*PERITER;
}
The problem is, when I benchmark the output of this, if I run with eg: two threads, sometimes it takes ~ the same time as a single thread, and sometimes I get a 2x speedup, it feels like there's some sort of synchronization race condition that I'm running into, sometimes I hit it and things go smoothly and sometimes (often) not.
Does anyone know what might be causing this and what the right way to parallelize a section inside of an outer while loop is?
Edit: Using strace, I see a lot of calls to sched_yield() This is probably making it look like I'm doing a lot on the CPU but I'm fighting the scheduler for a good scheduling pattern.

You are creating a new bunch of threads each time the while loop gets entered. After the parallel loop, the threads are destroyed. Because of the nature of a while loop, this might happen irregularily (depending on the condition).
So if your loops gets executed only a few times, then the thread creation process might overweigh the actual workload and thus you get at most sequential performance, if not less. However, maybe the parallel system (OpenMP) can detect if the loop is entered many times to keep threads alive.
Nothing guaranteed though.

I'd suggest something like this.
For nsamp == 0 you'll need some more reasonable handling. For proper Signal handling with OpenMP, please refer to this answer.
ssize_t remain = nsamp;
#pragma omp parallel num_threads(nthread) shared(out, remain, data)
while (remain > 0) {
#pragma omp for
for (ssize_t ii=0; ii < nthread; ii++) {
/* generate noise */
}
#pragma omp single
{
// write noise
out.write(data, nthread*PERITER);
remain -= nthread*PERITER;
}
}

Incomplete multi-threading RayTracer taking twice as much time as expected

I am making a MT Ray-Tracer multithreading, and as the title says, its taking twice as much to execute as the single thread version. Obviously the purpose is to cut the render time by the half, however what I am doing now is just to send the ray-tracing method to run twice, one for each thread, basically executing the same rendering twice. Nonetheless, as threads can run in parallel, shall not there be a meaningful increase in execution time. But is about doubling.
This has to be related to my multithreading setup. I think its related to the fact I create them as joinable. So I am going to explain what I am doing and also put the related code to see if someone can confirm if that's the issue.
I create two threads and set them as joinable so. Create a RayTracer that allocates enough memory to store the image pixels (this is done in the constructor). Run a two iterations loop for sending relevant info for each thread, like the thread id and the adress of the Raytracer instance.
Then pthread_create calls run_thread, whose purpose is to call the ray_tracer:draw method where the work is done. On the draw method, I have a
pthread_exit (NULL);
as the last thing on it (the only MT thing on it). Then do another loop to join the threads. Finally I star to write the file in a small loop. Finally close the file and delete the pointers related to the array used to store the image in the draw method.
I may not need to use to join now that I am not doing a "real" multithreading ray-tracer, just rendering it twice, but as soon as I start alternate between the image pixels (ie, thread0 -> renders pixel0 - thread0 -> stores pixel0, thread1 -> renders pixel1 - thread1 -> stores pixel1, thread0 -> renders pixel2 - thread0 -> stores pixel2, , thread1 -> renders pixel3 - thread1 -> stores pixel3,etc...) I think I will need it so to be able to write the pixels in correct order on a file.
Is that correct? Do I really need to use join here with my method (or with any other?). If I do, how can I send the threads to run concurrently, not waiting for the other to complete? Is the problem totally unrelated to join?
pthread_t threads [2];
thread_data td_array [2];
pthread_attr_t attr;
void *status;
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
TGAManager tgaManager ("z.tga",true);
if (tgaManager.isFileOpen()) {
tgaManager.writeHeadersData (image);
RayTracer rt (image.getHeight() * image.getWidth());
int rc;
for (int i=0; i<2; i++) {
//cout << "main() : creating thread, " << i << endl;
td_array[i].thread_id=i;
td_array[i].rt_ptr = &rt;
td_array[i].img_ptr = ℑ
td_array[i].scene_ptr = &scene;
//cout << "td_array.thread_index: " << td_array[i].thread_id << endl;
rc = pthread_create (&threads[i], NULL, RayTracer::run_thread, &td_array[i]);
}
if (rc) {
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
pthread_attr_destroy(&attr);
for (int i=0; i<2; i++ ) {
rc = pthread_join(threads[i], &status);
if (rc) {
cout << "Error:unable to join," << rc << endl;
exit(-1);
}
}
//tgaManager.writeImage (rt,image.getSize());
for (int i=0; i<image.getWidth() * image.getHeight(); i++) {
cout << i << endl;
tgaManager.file_writer.put (rt.b[i]);
tgaManager.file_writer.put (rt.c[i]);
tgaManager.file_writer.put (rt.d[i]);
}
tgaManager.closeFile(1);
rt.deleteImgPtr ();
}

You do want to join() the threads, because if you don't, you have several problems:
How do you know when the threads have finished executing? You don't want to start writing out the resulting image only to find that it wasn't fully calculated at the moment you wrote it out.
How do you know when it is safe to tear down any data structures that the threads might be accessing? For example, your RayTracer object is on the stack, and (AFAICT) your threads are writing into its pixel-array. If your main function returns before the the threads have exited, there is a very good chance that the threads will sometimes end up writing into a RayTracer object that no longer exists, which will corrupt the stack by overwriting whatever other objects might exist (by happenstance) at those same locations after your function returned.
So you definitely need to join() your threads; you don't need to explicitly declare them as PTHREAD_CREATE_JOINABLE, though, since that attribute is already set by default anyway.
Joining the threads should not cause the threads to slow down, as long as both threads are created and running before you call join() on any of them (which appears to be the case in your posted code).
As for why you are seeing a slowdown with two threads, that's hard to say since a slowdown could be coming from a number of places. Some possibilities:
Something in your ray-tracing code is locking a mutex, such that for much of the ray-tracing run, only one of the two threads is allowed to execute at a time anyway.
Both threads are writing to the same memory locations at around the same time, and that is causing cache-contention which slows down the execution of both threads.
My suggestion would be to set your threads so that thread #1 renders only the top half of the image, and thread #2 renders only the bottom half of the image; that way when they write their output they will be writing to different sections of memory.
If that doesn't help, you might temporarily replace the rendering code with something simpler (e.g. a "renderer" that just sets pixels to random values) to see if you can see a speedup with that. If so, then there might be something in your RayTracer's implementation that isn't multithreading-friendly.

what is the optimal Multithreading scenario for processing a long file lines?

I have a big file and i want to read and also [process] all lines (even lines) of the file with multi threads.
One suggests to read the whole file and break it to multiple files (same count as threads), then let every thread process a specific file. as this idea will read the whole file, write it again and read multiple files it seems to be slow (3x I/O) and i think there must be better scenarios,
I myself though this could be a better scenario:
One thread will read the file and put the data on a global variable and other threads will read the data from that variable and process. more detailed:
One thread will read the main file with running func1 function and put each even line on a Buffer: line1Buffer of a max size MAX_BUFFER_SIZE and other threads will pop their data from the Buffer and process it with running func2 function. in code:
Global variables:
#define MAX_BUFFER_SIZE 100
vector<string> line1Buffer;
bool continue = true;// to end thread 2 to last thread by setting to false
string file = "reads.fq";
Function func1 : (thread 1)
void func1(){
ifstream ifstr(file.c_str());
for (long long i = 0; i < numberOfReads; i++) { // 2 lines per read
getline(ifstr,ReadSeq);
getline(ifstr,ReadSeq);// reading even lines
while( line1Buffer.size() == MAX_BUFFER_SIZE )
; // to delay when the buffer is full
line1Buffer.push_back(ReadSeq);
}
continue = false;
return;
}
And function func2 : (other threads)
void func2(){
string ReadSeq;
while(continue){
if(line2Buffer.size() > 0 ){
ReadSeq = line1Buffer.pop_back();
// do the proccessing....
}
}
}
About the speed:
If the reading part is slower so the total time will be equal to reading the file for just one time(and the buffer may just contain 1 file at each time and hence just 1 another thread will be able to work with thread 1). and if the processing part is slower then the total time will be equal to the time for the whole processing with numberOfThreads - 1 threads. both cases is faster than reading the file and writing in multiple files with 1 thread and then read the files with multi threads and process...
and so there is 2 question:
1- how to call the functions by threads the way thread 1 runs func1 and others run func2 ?
2- is there any faster scenario?
3-[Deleted] anyone can extend this idea to M threads for reading and N threads for processing? obviously we know :M+N==umberOfThreads is true
Edit: the 3rd question is not right as multiple threads can't help in reading a single file
Thanks All

An other approach could be interleaved thread.
Reading is done by every thread, but only 1 at once.
Because of the waiting in the very first iteration, the
threads will be interleaved.
But this is only an scaleable option, if work() is the bottleneck
(then every non-parallel execution would be better)
Thread:
while (!end) {
// should be fair!
lock();
read();
unlock();
work();
}
basic example: (you should probably add some error-handling)
void thread_exec(ifstream* file,std::mutex* mutex,int* global_line_counter) {
std::string line;
std::vector<std::string> data;
int i;
do {
i = 0;
// only 1 concurrent reader
mutex->lock();
// try to read the maximum number of lines
while(i < MAX_NUMBER_OF_LINES_PER_ITERATION && getline(*file,line)) {
// only the even lines we want to process
if (*global_line_counter % 2 == 0) {
data.push_back(line);
i++;
}
(*global_line_counter)++;
}
mutex->unlock();
// execute work for every line
for (int j=0; j < data.size(); j++) {
work(data[j]);
}
// free old data
data.clear();
//until EOF was not reached
} while(i == MAX_NUMBER_OF_LINES_PER_ITERATION);
}
void process_data(std::string file) {
// counter for checking if line is even
int global_line_counter = 0;
// open file
ifstream ifstr(file.c_str());
// mutex for synchronization
// maybe a fair-lock would be a better solution
std::mutex mutex;
// create threads and start them with thread_exec(&ifstr, &mutex, &global_line_counter);
std::vector<std::thread> threads(NUM_THREADS);
for (int i=0; i < NUM_THREADS; i++) {
threads[i] = std::thread(thread_exec, &ifstr, &mutex, &global_line_counter);
}
// wait until all threads have finished
for (int i=0; i < NUM_THREADS; i++) {
threads[i].join();
}
}

What is your bottleneck? Hard disk or processing time?
If it's the hard disk, then you're probably not going to get any more performance out as you've hit the limits of the hardware. Concurrent reads are by far faster than trying to jump around the file. Having multiple threads trying to read your file will almost certainly reduce the overall speed as it will increase disk thrashing.
A single thread reading the file and a thread pool (or just 1 other thread) to deal with the contents is probably as good as you can get.
Global variables:
This is a bad habit to get into.

Assume having #p treads, two scenarios mentioned in the post and answers:
1) Reading with 'a' thread and processing with other threads, in this case #p-1 thread will process in comparison with only one thread reading. assume the time for full operation is jobTime and time for processing with n threads is pTime(n) so:
worst case occurs when reading time is very slower than processing and jobTime = pTime(1)+readTime and the best case is when the processing is slower than reading in which jobTime is equal to pTime(#p-1)+readTime
2) read and process with all #p threads. in this scenario every thread needs to do two steps. first step is to read a part of the file with size MAX_BUFFER_SIZE which is sequential; means no two threads can read at one time. but the second part is processing the read data which can be parallel. this way in the worst case jobTime is pTime(1)+readTime as before (but*), but the best optimized case is pTime(#p)+readTime which is better than previous.
*: in 2nd approach's worst case, however reading is slower but you can find a optimized MAX_BUFFER_SIZE in which (in the worst case) some reading with one thread will overlaps with some processing with another thread. with this optimized MAX_BUFFER_SIZE the jobTime will be less than pTime(1)+readTime and could diverge to readTime

First off, reading a file is a slow operation so unless you are doing some superheavy processing, the file reading will be limiting.
If you do decide to go the multithreaded route a queue is the right approach. Just make sure you push in front an pop out back. An stl::deque should work well. Also you will need to lock the queue with a mutex and sychronize it with a conditional variable.
One last thing is you will need to limit the size if the queue for the scenario where we are pushing faster than we are popping.

C++ multithreads run time issue

I have been studying C++ multithreads and get a question about it.
Here is what I am understanding about multithreads.
One of the reasons we use multithreads is to reduce the run time, right?
For example, I think if we use two threads we can expect half of the execution time.
So, I tried to code to prove it.
Here is the code.
#include <vector>
#include <iostream>
#include <thread>
#include <future>
using namespace std;
#define iterationNumber 1000000
void myFunction(const int index, const int numberInThread, promise<unsigned long>&& p, const vector<int>& numberList) {
clock_t begin,end;
int firstIndex = index * numberInThread;
int lastIndex = firstIndex + numberInThread;
vector<int>::const_iterator first = numberList.cbegin() + firstIndex;
vector<int>::const_iterator last = numberList.cbegin() + lastIndex;
vector<int> numbers(first,last);
unsigned long result = 0;
begin = clock();
for(int i = 0 ; i < numbers.size(); i++) {
result += numbers.at(i);
}
end = clock();
cout << "thread" << index << " took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
p.set_value(result);
}
int main(void)
{
vector<int> numberList;
vector<thread> t;
vector<future<unsigned long>> futures;
vector<unsigned long> result;
const int NumberOfThreads = thread::hardware_concurrency() ?: 2;
int numberInThread = iterationNumber / NumberOfThreads;
clock_t begin,end;
for(int i = 0 ; i < iterationNumber ; i++) {
int randomN = rand() % 10000 + 1;
numberList.push_back(randomN);
}
for(int j = 0 ; j < NumberOfThreads; j++){
promise<unsigned long> promises;
futures.push_back(promises.get_future());
t.push_back(thread(myFunction, j, numberInThread, std::move(promises), numberList));
}
for_each(t.begin(), t.end(), std::mem_fn(&std::thread::join));
for (int i = 0; i < futures.size(); i++) {
result.push_back(futures.at(i).get());
}
unsigned long RRR = 0;
begin = clock();
for(int i = 0 ; i < numberList.size(); i++) {
RRR += numberList.at(i);
}
end = clock();
cout << "not by thread took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
}
Because the hardware concurrency of my laptop is 4, it will create 4 threads and each takes a quarter of numberList and sum up the numbers.
However, the result was different than I expected.
thread0 took 0.007232
thread1 took 0.007402
thread2 took 0.010035
thread3 took 0.011759
not by thread took 0.009654
Why? Why it took more time than serial version(not by thread).

For example, I think if we use two threads we can expect half of the
execution time.
You'd think so, but sadly, that is often not the case in practice. The ideal "N cores means 1/Nth the execution time" scenario occurs only when the N cores can execute completely in parallel, without any core's actions interfering with the performance of the other cores.
But what your threads are doing is just summing up different sub-sections of an array... surely that can benefit from being executed in parallel? The answer is that in principle it can, but on a modern CPU, simple addition is so blindingly fast that it isn't really a factor in how long it takes a loop to complete. What really does limit the execute speed of a loop is access to RAM. Compared to the speed of the CPU, RAM access is very slow -- and on most desktop computers, each CPU has only one connection to RAM, regardless of how many cores it has. That means that what you are really measuring in your program is the speed at which a big array of integers can be read in from RAM to the CPU, and that speed is roughly the same -- equal to the CPU's memory-bus bandwidth -- regardless of whether it's one core doing the reading-in of the memory, or four.
To demonstrate how much RAM access is a factor, below is a modified/simplified version of your test program. In this version of the program, I've removed the big vectors, and instead the computation is just a series of calls to the (relatively expensive) sin() function. Note that in this version, the loop is only accessing a few memory locations, rather than thousands, and thus a core that is running the computation loop will not have to periodically wait for more data to be copied in from RAM to its local cache:
#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <math.h>
using namespace std;
static int iterationNumber = 1000000;
unsigned long long threadElapsedTimeMicros[10];
unsigned long threadResults[10];
void myFunction(const int index, const int numberInThread)
{
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i=0; i<numberInThread; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
threadResults[index] = result;
threadElapsedTimeMicros[index] = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();
// We'll print out the value of threadElapsedTimeMicros[index] later on,
// after all the threads have been join()'d.
// If we printed it out now it might affect the timing of the other threads
// that may still be executing
}
int main(void)
{
vector<thread> t;
const int NumberOfThreads = thread::hardware_concurrency();
const int numberInThread = iterationNumber / NumberOfThreads;
// Multithreaded approach
std::chrono::steady_clock::time_point allBegin = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) t.push_back(thread(myFunction, j, numberInThread));
for(int j = 0 ; j < NumberOfThreads; j++) t[j].join();
std::chrono::steady_clock::time_point allEnd = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) cout << " The computations in thread #" << j << ": result=" << threadResults[j] << ", took " << threadElapsedTimeMicros[j] << " microseconds" << std::endl;
cout << " Total time spent doing multithreaded computations was " << std::chrono::duration_cast<std::chrono::microseconds>(allEnd - allBegin).count() << " microseconds in total" << std::endl;
// And now, the single-threaded approach, for comparison
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i = 0 ; i < iterationNumber; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
cout << "result=" << result << ", single-threaded computation took " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds" << std::endl;
return 0;
}
When I run the above program on my dual-core Mac mini (i7 with hyperthreading), here are the results I get:
Jeremys-Mac-mini:~ lcsuser1$ g++ -std=c++11 -O3 ./temp.cpp
Jeremys-Mac-mini:~ lcsuser1$ ./a.out
The computations in thread #0: result=1062, took 11718 microseconds
The computations in thread #1: result=1062, took 11481 microseconds
The computations in thread #2: result=1062, took 11525 microseconds
The computations in thread #3: result=1062, took 11230 microseconds
Total time spent doing multithreaded computations was 16492 microseconds in total
result=1181, single-threaded computation took 49846 microseconds
So in this case the results are more like what you'd expect -- because memory access was not a bottleneck, each core was able to run at full speed, and complete its 25% portion of the total calculations in about 25% of the time that it took a single thread to complete 100% of the calculations... and since the four cores were running truly in parallel, the total time spent doing the calculations was about 33% of the time it took for the single-threaded routine to complete (ideally it would be 25% but there's some overhead involved in starting up and shutting down the threads, etc).

This is an explanation, for the beginner.
It's not technically accurate, but IMHO not that far from it that anyone takes damage from reading it.
It provides an entry into understanding the parallel processing terms.
Threads, Tasks, and Processes
It is important to know the difference between threads, and processes.
By default starting a new process, allocates a dedicated memory for that process. So they share memory with no other processes, and could (in theory) be run on separate computers.
(You can share memory with other processes, via operating system, or "shared memory", but you have to add these features, they are not by default available for your process)
Having multiple cores means that the each running process can be executed on any idle core.
So basically one program runs on one core, another program runs on a second core, and the background service doing something for you, runs on a third, (and so on and so forth)
Threads is something different.
For instance all processes will run in a main thread.
The operating system implements a scheduler, that is supposed to allocate cpu time for programs. In principle it will say:
Program A, get 0.01 seconds, than pause!
Program B, get 0.01 seconds, then pause!
Program A, get 0.01 seconds, then pause!
Program B, get 0.01 seconds, then pause!
you get the idea..
The scheduler typically can prioritize between threads, so some programs get more CPU time than others.
The scheduler can of course schedule threads on all cores, but if it does this within a process, (splits a process's threads over multiple cores) there can be a performance penalty as each core holds it's own very fast memory cache.
Since threads from the same process can access the same cache, sharing memory between threads is quite fast.
Accessing another cores cache is not as fast, (if even possible without going via RAM), so in general schedulers will not split a process over multiple cores.
The result is that all the threads belonging to a process runs on the same core.
| Core 1 | Core 2 | Core 3 |
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
A process can spawn multiple threads, they all share the parent threads memory area, and will normally all run on the core that the parent was running on.
It makes sense to spawn threads within a process, if you have an application that needs to respond to something that it cannot control the timing of.
I.E. the users presses on a cancel button, or attempts to move a window, while the application is running calculations that takes a long time to complete.
Responsiveness of the UI, requires the application to spend time reading, and handling what the user is attempting to do. This could be achieved in a main loop, if the program does parts of the calculation in each iteration.
However that get's complicated real fast, so instead of having the calculation code, exit in the middle of a calculation to check the UI, and update the UI, and then continue. You run the calculation code in another thread.
The scheduler then makes sure that the UI thread, and the calculation thread gets CPU time, so the UI responds to user input, while the calculation continues..
And your code stays fairly simple.
But I want to run my calculations another core to gain speed
To distribute calculations on multiple cores, you could spawn a new process for each calculation job. In this way the scheduler will know that each process get's it's own memory, and it can easily be launched on an idle core.
However you have a problem, you need to share memory with the other process, so it knows what to do.
A simple way of doing this, is sharing memory via the filesystem.
You could create a file with the data for the calculation, and then spawn a thread governing the execution (and communication) with another program, (so your UI is responsive, while we wait for the results).
The governing thread runs the other program via system commands, which starts it as another process.
The other program will be written such that it runs with the input file as input argument, so we can run it in multiple instances, on different files.
If the program self terminates when it's done, and creates an output file, it can run on any core, (or multiple) and your process can read the output file.
This actually works, and should the calculation take a long time (like many minutes) this is perhaps ok, even though we use files to communicate between our processes.
For calculations that only takes seconds, however, the file system is slow, and waiting for it will almost remove the gained performance of using processes instead of just using threads. So other more efficient memory sharing is used in real life. For instance creating a shared memory area in RAM.
The "create governing thread, and spawn subprocess, allow communication with process via governing thread, collect data when process is complete, and expose via governing thread" can be implemented in multiple ways.
Tasks
Well "tasks" is ambiguous.
In general it means "Process or thread that solves a task".
However, in certain languages like C#, it is something that implements a thread like thing, that the scheduler can treat as a process. Other languages that provide a similar feature typically dubs this either tasks or workers.
So with workers/tasks it appears to the programmer as if it was merely a thread, that you can share memory with easily, via references, and control like any other thread, by invoking methods on the thread.
But it appears to the scheduler as if it's a process that can be run on any core.
It implements the shared memory problem in a fairly efficient way, as part of the language, so the programmer won't have to re-invent this wheel for all tasks.
This is often referred to as "Hybrid threading" or simply "parallel threads"

Seems that you have some misconception about multi-threading. Simply using two threads cannot halve the processing time.
Multi-threading is a kind of complicated concept but you can easily find related materials on the web. You should read one of them first. But I will try to give a simple explanation with an example.
No matter how many CPUs(or cores) you have, the total handling capacity of the CPU will be always the same whether you use multi-thread or not, right? Then, where does the performance difference come from?
When a program runs on a device(computer) it uses not only CPU but also other system resources such as Networks, RAM, Hard drives, etc. If the flow of the program is serialized there will be a certain point of time when the CPU is idle waiting for other system resources to get done. But, in the case that the program runs with multiple threads(multiple flow), if a thread turns to idle(waiting some tasks done by other system resources) the other threads can use the CPU. Therefore, you can minimize the idle time of the CPU and improve the time performance. This is one of the most simple example about multi-threading.
Since your sample code is almost 'only CPU-consuming', using multi-thread could bring little improvement of performance. Sometimes it can be worse because multi-threading also comes with time cost of context-switching.
FYI, parallel processing is not the same as multi-threading.

This is very good to point out the problems with macs.
Provided you use a o.s. that can schedule threads in a useful manner, you have to consider if a problem is basically the product of 1 problem many times. An example is matrix multiplication. When you multiply 2 matrices there is a certain parts of it which are independent of the others. A 3x3 matrix times another 3x3 requires 9 dot products which can be computed independently of the others, which themselves require 3 multiplications and 2 additions but here the multiplications must be done first. So we see if we wanted to utilize multithreaded processor for this task we could use 9 cores or threads and given they get equal compute time or have same priority level (which is adjustable on windows) you would reduce the time to multiply a 3x3 matrices by 9. This is because we are essentially doing something 9 times which can be done at the same time by 9 people.
now for each of 9 threads we could have 3 cores perform multiplications totaling 3x9=24 cores all together now. Reducing time by t/24. But we have 18 additions and here we can get no gain from more cores. One addition must be piped into another. And the problem takes time t with one core or time t/24 ideally with 24 cores working together. Now you can see why problems are often seeked out if they are 'linear' because they can be done in parallel pretty good like graphics for example (some things like backside culling are sorting problems and inherently not linear so parallel processing has diminished performance boosts).
Then there is added overhead of starting threads and how they are scheduled by the o.s. and processor. Hope this helps.

c++ pthread limit number of thread

I tried to use pthread to do some task faster. I have thousands files (in args) to process and i want to create just a small number of thread many times.
Here's my code :
void callThread(){
int nbt = 0;
pthread_t *vp = (pthread_t*)malloc(sizeof(pthread_t)*NBTHREAD);
for(int i=0;i<args.size();i+=NBTHREAD){
for(int j=0;j<NBTHREAD;j++){
if(i+j<args.size()){
pthread_create(&vp[j],NULL,calcul,&args[i+j]);
nbt++;
}
}
for(int k=0;k<nbt;k++){
if(pthread_join(vp[k], NULL)){
cout<<"ERROR pthread_join()"<<endl;
}
}
}
}
It returns error, i don't know if it's a good way to solve my problem. All the resources are in args (vector of struct) and are independants.
Thanks for help.

You're better off making a thread pool with as many threads as the number of cores the cpu has. Then feed the tasks to this pool and let it do its job. You should take a look at this blog post right here for a great example of how to go about creating such thread pool.
A couple of tips that are not mentioned in that post:
Use std::thread::hardware_concurrency() to get the number of cores.
Figure out a way how to store the tasks, hint: std::packaged_task or something along
those lines wrapped in a class so you can track things such as when a task is done, or implement task.join().
Also, github with the code of his implementation plus some extra stuff such as std::future support can be found here.

You can use a semaphore to limit the number of parallel threads, here is a pseudo code:
Semaphore S = MAX_THREADS_AT_A_TIME // Initial semaphore value
declare handle_array[NUM_ITERS];
for(i=0 to NUM_ITERS)
{
wait-while(S<=0);
Acquire-Semaphore; // S--
handle_array[i] = Run-Thread(MyThread);
}
for(i=0 to NUM_ITERS)
{
Join_thread(handle_array[i])
Close_handle(handle_array[i])
}
MyThread()
{
mutex.lock
critical-section
mutex.unlock
release-semaphore // S++
}

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js