Mixing user threads and OpenCV built-in multithreading

Mixing user threads and OpenCV built-in multithreading - c++

I develop a C++ application that needs to process different images at the same time. The processing algorithm is built on top of OpenCV and uses parallelism functionalities.
The application works in the following way: for each image it has, it spawns a thread to execute the processing algorithm. Unfortunately it seems that this scheme does not work well with OpenCV internal multithreading.
Minimal example:
#include <iostream>
#include <thread>
#include <chrono>
#include <opencv2/core.hpp>
void run(int thread_id, cv::Mat& mat)
{
auto start = std::chrono::steady_clock::now();
// multithreaded operation on mat
mat.forEach<float>([](float& pixel, int const* position) {
std::this_thread::sleep_for(std::chrono::milliseconds(1));
});
auto end = std::chrono::steady_clock::now();
std::cout << "thread " << thread_id << " took "
<< (end - start).count() * 1e-9 << " sec"
<< std::endl;
}
int main()
{
cv::Mat mat1(100, 100, CV_32F), mat2(100, 100, CV_32F);
std::thread t1(run, 1, std::ref(mat1));
std::thread t2(run, 2, std::ref(mat2));
t1.join();
t2.join();
return 0;
}
Output on my machine:
thread 1 took 1.42477 sec
thread 2 took 12.1963 sec
It seems that the second operation is not taking advantage of multithreading. Looking at my CPU usage, I have the feeling that OpenCV assigns all its internal threads to the first operation and, when the second one arrives, there is no internal thread left. Thus, the second operation is executed sequentially in the application thread body.
Firstly, I would appreciate if someone that already faced similar issues with OpenCV can confirm that my hypothesis is correct.
Secondly, is there a way to dispatch internal OpenCV resources more intelligently ? For example, by assigning half of the threads to the first operation and half to the second one ?
Multithreading objective
After writing my question, I realize that the purpose of doing multithreading at the application level might be unclear. Some people may argue that it suffices to run the two operations sequentially at the application level to take full advantage of internal OpenCV multithreading. This is true for the minimal example I posted here, but typically not all parts of processing algorithms can be run in parallel.
The idea behind multithreading at application level is to try to run a maximum of 'unparallelisable' operations at the same time:
Operations 1 and 2 sequentially:
[-----seq 1----][-par 2 (full power)-][-----seq 2----][-par 2 (full power)-]
Operations 1 and 2 in parallel:
[-----seq 1----][------------par 2 (half power)------------]
[-----seq 2----][------------par 2 (half power)------------]
seq X = sequential task of operation X
par X = parallelisable task of operation X
We can see that application level multithreading reduce the total computation time, because sequential parts of different operations are run concurrently.

I think your approach to multi threading is correct. I ran the code you provided and here's my output:
thread 1 took 2.30654 sec
thread 2 took 2.63872 sec
Maybe you should check the number of available threads for your program?
std::cout << std::thread::hardware_concurrency() << std::endl;

Related

Why is my thread execution jumping between CPU cores?

I recently started experimenting with std::thread and I tried running a small program that displays the webcam feed in a separate thread and I am using OpenCV. I am just doing this for "educational" purposes. What I noticed was that the thread seemed to keep jumping between cores which striked me as odd since I thought that the overhead of this change would not be worth it from an efficiency/performance side of view. Does anybody know the root/reason for such behavior?
Short disclaimer --> I am new to StackOverflow so if I missed something, please let me know.
A snapshot of my system monitor - Ubuntu
#include <stdio.h>
#include <opencv2/opencv.hpp> //openCV functionality
#include <time.h> //timing functionality
#include <thread>
using namespace cv;
using namespace std;
void webcam_func(){
Mat image;
namedWindow("Display window");
VideoCapture cap(0);
if (!cap.set(CAP_PROP_AUTO_EXPOSURE , 10)){
std::cout <<"Exposure could not be set!" <<std::endl;
//return -1 ;
}
if (!cap.isOpened()) {
cout << "cannot open camera";
}
int i = 0;
while (i < 1000000) {
cap >> image;
Size s = image.size();
int rows = s.height;
int cols = s.width;
imshow("Display window", image);
double fps = cap.get(CAP_PROP_FPS);
//cout << "Frames per second using video.get(CAP_PROP_FPS) : " << fps << endl;
//cout <<"The height of the video is " <<rows <<endl;
//cout <<"The width of the video is " <<cols <<endl;
std::thread::id this_id = std::this_thread::get_id();
std::cout << "thread id --> " << this_id <<std::endl;
waitKey(25);
i++ ;
std::cout <<"Counter value " <<i <<std::endl;
}
}
int main() {
std::thread t1(webcam_func);
while(true){
}
return 0;
}

The default Linux scheduler schedule tasks (eg. threads) for a given quantum (time slice) on available processing units (eg. cores or hardware threads). This quantum can be interrupted if a task enters in sleeping mode or wait for something (inputs, locks, etc.). waitKey(25) exactly does that: it causes your thread to wait for a short period of time. The thread execution is interrupted and a context-switch is done. The OS can execute other tasks during this time. When the computing thread is ready again (because >25 ms has elapsed), the scheduler can schedule it again. It tries to execute the task on the same processing unit so to reduce overheads (eg. cache misses) but the previous processing unit can be still used by another thread when the computing task is being scheduled back. This is unlikely to be the case when there is not many ready tasks or just greedy ones though. Additionally, some processors supports SMT (aka. hyper-threading). For example, many x86-64 Intel processors supports 2 hardware threads per core sharing the same caches. Context-switches between 2 hardware threads lying on the same core are significantly cheaper (eg. far less cache-misses). Also note that the Linux scheduler is not perfect like most other schedulers. In fact, it was bogus few years ago and not even able to fill all available cores when it was possible (see: The Linux Scheduler: a Decade of Wasted Cores). Finally, note that the (direct) overhead of a context-switch is no more than few dozens of micro-seconds on a mainstream Linux PC so having them every few dozens of milliseconds is fine (<1% overhead).

Using threads to split up task is slowing down my work

So I had this question: simple-division-of-labour-over-threads-is-not-reducing-the-time-taken. I thought I had it sorted, but going back to re-visit this work, I am not getting crazy slow down like I was before (due to mutex within rand()) but nor am I getting any improvement in total time taken.
In the code I am splitting up a task of x iteration of work over y threads.
So if I want to do 100'000'000 calculations, in one thread that might take ~350ms, then my hope its that in 2 threads (doing 50'000'000 calcs each) that would take ~175ms and in the three threads ~115ms and so on...
I know using threads won't perfectly split the work due to thread overheads and such. But i want hoping for some performance gain at least.
My slightly updated code is here:
Reults
1thread:
starting thread: 1 workload: 100000000
thread: 1 finished after: 303ms val: 7.02066
==========================
thread overall_total_time time: 304ms
3 threads
starting thread: 1 workload: 33333333
starting thread: 3 workload: 33333333
starting thread: 2 workload: 33333333
thread: 3 finished after: 363ms val: 6.61467
thread: 1 finished after: 368ms val: 6.61467
thread: 2 finished after: 365ms val: 6.61467
==========================
thread overall_total_time time: 368ms
You can see the 3 threads actually takes slightly longer then 1 thread, but each thread is only doing 1/3 of the work iterations. I see similar lack of performance gain on my PC at home which has 8 CPU cores.
Its not like threading overhead should take more then a few milliseconds (IMO) so I can't see what is going on here. I don't believe there is any resource sharing conflicts because this code is quite simple and uses no external outputs/inputs (other then RAM).
Code For Reference
in godbolt: https://godbolt.org/z/bGWdxE
In main() you can tweak the number of threads and amount of work (loop iterations).
#include <iostream>
#include <vector>
#include <thread>
#include <math.h>
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
double val{0};
for (auto i = 1u; i <= interations; i++)
{
val += i / (2.2 * i) / (1.23 * i); // some work
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << " val: " << val << std::endl;
}
int main()
{
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Update
I think I have narrowed down my issue:
On my 64Bit VM I see:
Compiling for 32-bit no optimisation: more threads = runs slower!
Compiling for 32-bit with optimisation: more threads = runs a bit faster
Compiling for 64-bit no optimisation: more threads = runs faster (as expected)
Compiling for 64-bit with optimisation: more threads = same as with out opt, except everything takes less time in general.
So my issue might just be from running 32-bit code on a 64-bit VM. But I don't really understand why adding threads does not work very well if my executable is 32-bit running on a 64-bit architecture...

There are many possible reasons that could explain the observed results, so I do not think that anyone could give you a definitive answer. Also, the majority of reasons have to do with peculiarities of the hardware architecture, so different answers might be right or wrong on different machines.
As already mentioned in the comments, it could very well be that there is something wrong with thread allocation, so you are not really enjoying any benefit from using multiple threads. Godbolt.org is a cloud service, so it is most probably very heavily virtualized, meaning that your threads are competing against who knows how many hundreds of other threads, so I would assign a zero amount of trust to any results from running on godbolt.
A possible reason for the bad performance of unoptimized 32-bit code on a 64-bit VM could be that the unoptimized 32-bit code is not making efficient use of registers, so it becomes memory-bound. The code looks like it would all nicely fit within the CPU cache, but even the cache is considerably slower than direct register access, and the difference is more pronounced in a multi-threaded scenario where multiple threads are competing for access to the cache.
A possible reason for the still not stellar performance of optimized 32-bit code on a 64-bit VM could be that the CPU is optimized for 64-bit use, so instructions are not efficiently pipelined when running in 32-bit mode, or that the arithmetic unit of the CPU is not being used efficiently. It could be that these divisions in your code make all threads contend for the divider circuitry, of which the CPU may have only one, or of which the CPU may have only one when running in 32-bit mode. That would mean that most threads do nothing but wait for the divider to become available.
Note that these situations where a thread is being slowed down due to contention for CPU circuitry are very different from situations where a thread is being slowed down due to waiting for some device to respond. When a device is busy, the thread that waits for it is placed by the scheduler in I/O wait mode, so it is not consuming any CPU. When you have contention for CPU circuitry, the stalling happens inside the CPU; there is no thread context switch, so the thread slows down while it appears as if it is running full speed (consuming a full CPU core.) This may be an explanation for your CPU utilization observations.

Simple division of labour over threads is not reducing the time taken

I have been trying to improve computation times on a project by splitting the work into tasks/threads and it has not been working out very well. So I decided to make a simple test project to see if I can get it working in a very simple case and this also is not working out as I expected it to.
What I have attempted to do is:
do a task X times in one thread - check the time taken.
do a task X / Y times in Y threads - check the time taken.
So if 1 thread takes T seconds to do 100'000'000 iterations of "work" then I would expect:
2 threads doing 50'000'000 iterations each would take ~ T / 2 seconds
3 threads doing 33'333'333 iterations each would take ~ T / 3 seconds
and so on until I reach some threading limit (number of cores or whatever).
So I wrote the code and tested it on my 8 core system (AMD Ryzen) plenty of RAM >16GB doing nothing else at the time.
1 Threads took: ~6.5s
2 Threads took: ~6.7s
3 Threads took: ~13.3s
8 Threads took: ~16.2s
So clearly something is not right here!
I ported the code into Godbolt and I see similar results. Godbolt only allows 3 threads, and for 1, 2 or 3 threads it takes ~8s (this varies by about 1s) to run. Here is the godbolt live code: https://godbolt.org/z/6eWKWr
Finally here is the code for reference:
#include <iostream>
#include <math.h>
#include <vector>
#include <thread>
#define randf() ((double) rand()) / ((double) (RAND_MAX))
void thread_func(uint32_t interations, uint32_t thread_id)
{
// Print the thread id / workload
std::cout << "starting thread: " << thread_id << " workload: " << interations << std::endl;
// Get the start time
auto start = std::chrono::high_resolution_clock::now();
// do some work for the required number of interations
for (auto i = 0u; i < interations; i++)
{
double value = randf();
double calc = std::atan(value);
(void) calc;
}
// Get the time taken
auto total_time = std::chrono::high_resolution_clock::now() - start;
// Print it out
std::cout << "thread: " << thread_id << " finished after: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(total_time).count()
<< "ms" << std::endl;
}
int main()
{
// Note these numbers vary by about probably due to godbolt servers load (?)
// 1 Threads takes: ~8s
// 2 Threads takes: ~8s
// 3 Threads takes: ~8s
uint32_t num_threads = 3; // Max 3 in godbolt
uint32_t total_work = 100'000'000;
// Seed rand
std::srand(static_cast<unsigned long>(std::chrono::steady_clock::now().time_since_epoch().count()));
// Store the start time
auto overall_start = std::chrono::high_resolution_clock::now();
// Start all the threads doing work
std::vector<std::thread> task_list;
for (uint32_t thread_id = 1; thread_id <= num_threads; thread_id++)
{
task_list.emplace_back(std::thread([=](){ thread_func(total_work / num_threads, thread_id); }));
}
// Wait for the threads to finish
for (auto &task : task_list)
{
task.join();
}
// Get the end time and print it
auto overall_total_time = std::chrono::high_resolution_clock::now() - overall_start;
std::cout << "\n==========================\n"
<< "thread overall_total_time time: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(overall_total_time).count()
<< "ms" << std::endl;
return 0;
}
Note: I have tried using std::async also with no difference (not that I was expecting any). I also tried compiling for release - no difference.
I have read such questions as: why-using-more-threads-makes-it-slower-than-using-less-threads and I can't see an obvious (to me) bottle neck:
CPU bound (needs lots of CPU resources): I have 8 cores
Memory bound (needs lots of RAM resources): I have assigned my VM 10GB ram, running nothing else
I/O bound (Network and/or hard drive resources): No network trafic involved
There is no sleeping/mutexing going on here (like there is in my real project)
Questions are:
Why might this be happening?
What am I doing wrong?
How can I improve this?

The rand function is not guaranteed to be thread safe. It appears that, in your implementation, it is by using a lock or mutex, so if multiple threads are trying to generate a random number that take turns. As your loop is mostly just the call to rand, the performance suffers with multiple threads.
You can use the facilities of the <random> header and have each thread use it's own engine to generate the random numbers.

Never mind that rand() is or isn't thread safe. That might be the explanation if a statistician told you that the "random" numbers you were getting were defective in some way, but it doesn't explain the timing.
What explains the timing is that there is only one random state object, it's out in memory somewhere, and all of your threads are competing with each other to access it.
No matter how many CPUs your system has, only one thread at a time can access the same location in main memory.
It would be different if each of the threads had its own independent random state object. Then, most of the accesses from any given CPU to its own private random state would only have to go as far as the CPU's local cache, and they would not conflict with what the other threads, running on other CPUs, each with their own local cache were doing.

How do you set the proper number of threads when executing concurrency task on iOS device?

I developed a cross-platform c++ library which spawn threads at runtime.
I used a concurrency queue to dispatch computing tasks, thus every thread will be busy at most of the time.
Now the question is how to get a proper number of threads at runtime. As my task has no I/O or networking operation but calculations and heap-memory allocations, the best strategy would be spawn thread per CPU core:
My code looks like below:
#include "concurrentqueue.h"
#include <algorithm>
#include <thread>
#include <vector>
#include <iostream>
#include <mutex>
std::mutex io_m;
struct Task {
int n;
};
void some_time_consuming_operations(Task &t) {
std::vector<int> vec;
for (int i = 0; i < t.n; ++i)
vec.push_back(1);
{
std::lock_guard<std::mutex> g(io_m);
std::cout << "thread " << std::this_thread::get_id() << " done, vec size:" << vec.size() << std::endl;
}
}
int main() {
// moodycamel's lockfree queue: https://github.com/cameron314/concurrentqueue
moodycamel::ConcurrentQueue<Task> tasks;
for (int i = 0; i < 100; ++i)
tasks.enqueue(Task{(i % 5) * 1000000 + 1000000});
// I left 2 threads for ui and other usages
std::vector<std::thread> jobs(std::max((size_t)2, (size_t)std::thread::hardware_concurrency() - 2));
std::cout << "thread num:" << jobs.size() << std::endl;
for (auto &job : jobs) {
job = std::thread([&tasks]() {
Task task;
while (tasks.try_dequeue(task))
some_time_consuming_operations(task);
});
}
for (auto &job : jobs)
job.join();
return 0;
}
However, when enabling multi-threading on my iOS device(iPhone XR, A12), the test program is 2-times slower than the single thread mode. I 've test it on My windows machine with a 4-core 8-thread intel CPU, and it is 6-times faster than the single thread mode.
On my iPhone, the hardware_concurrency function returns 6, which is the exactly core number of Apple A12. On my windows machine, the number is 8.
I understand there are 4 energy-efficient cores called Tempest lies i Apple's A12, but since they claimed that A11/A12 will use all six cores simultaneously (I kept the charge on during test). I have no idea why its slower than the single thread mode.
The test program is a game app build by UE4.

The four slower cores are a lot slower than the fast cores. So if you took a task that takes 6 seconds on a fast core, and ran one second worth of work on each core, then the two fast cores would finish after a second, while the four slow cores would take maybe ten seconds.
If you use GCD, iOS will shuffle these six threads between the cores, so you can gain up to a factor 2.4 in speed. If your thread implementation doesn't do this, then you are slowing down things.
The solutions: Either use GCD (and get a speedup of 2.4) or use only two threads (and get a speedup of 2.0). That's on an iPhone XR; you'd need to find out the number of fast cores somehow.

C++ multithreads run time issue

I have been studying C++ multithreads and get a question about it.
Here is what I am understanding about multithreads.
One of the reasons we use multithreads is to reduce the run time, right?
For example, I think if we use two threads we can expect half of the execution time.
So, I tried to code to prove it.
Here is the code.
#include <vector>
#include <iostream>
#include <thread>
#include <future>
using namespace std;
#define iterationNumber 1000000
void myFunction(const int index, const int numberInThread, promise<unsigned long>&& p, const vector<int>& numberList) {
clock_t begin,end;
int firstIndex = index * numberInThread;
int lastIndex = firstIndex + numberInThread;
vector<int>::const_iterator first = numberList.cbegin() + firstIndex;
vector<int>::const_iterator last = numberList.cbegin() + lastIndex;
vector<int> numbers(first,last);
unsigned long result = 0;
begin = clock();
for(int i = 0 ; i < numbers.size(); i++) {
result += numbers.at(i);
}
end = clock();
cout << "thread" << index << " took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
p.set_value(result);
}
int main(void)
{
vector<int> numberList;
vector<thread> t;
vector<future<unsigned long>> futures;
vector<unsigned long> result;
const int NumberOfThreads = thread::hardware_concurrency() ?: 2;
int numberInThread = iterationNumber / NumberOfThreads;
clock_t begin,end;
for(int i = 0 ; i < iterationNumber ; i++) {
int randomN = rand() % 10000 + 1;
numberList.push_back(randomN);
}
for(int j = 0 ; j < NumberOfThreads; j++){
promise<unsigned long> promises;
futures.push_back(promises.get_future());
t.push_back(thread(myFunction, j, numberInThread, std::move(promises), numberList));
}
for_each(t.begin(), t.end(), std::mem_fn(&std::thread::join));
for (int i = 0; i < futures.size(); i++) {
result.push_back(futures.at(i).get());
}
unsigned long RRR = 0;
begin = clock();
for(int i = 0 ; i < numberList.size(); i++) {
RRR += numberList.at(i);
}
end = clock();
cout << "not by thread took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;
}
Because the hardware concurrency of my laptop is 4, it will create 4 threads and each takes a quarter of numberList and sum up the numbers.
However, the result was different than I expected.
thread0 took 0.007232
thread1 took 0.007402
thread2 took 0.010035
thread3 took 0.011759
not by thread took 0.009654
Why? Why it took more time than serial version(not by thread).

For example, I think if we use two threads we can expect half of the
execution time.
You'd think so, but sadly, that is often not the case in practice. The ideal "N cores means 1/Nth the execution time" scenario occurs only when the N cores can execute completely in parallel, without any core's actions interfering with the performance of the other cores.
But what your threads are doing is just summing up different sub-sections of an array... surely that can benefit from being executed in parallel? The answer is that in principle it can, but on a modern CPU, simple addition is so blindingly fast that it isn't really a factor in how long it takes a loop to complete. What really does limit the execute speed of a loop is access to RAM. Compared to the speed of the CPU, RAM access is very slow -- and on most desktop computers, each CPU has only one connection to RAM, regardless of how many cores it has. That means that what you are really measuring in your program is the speed at which a big array of integers can be read in from RAM to the CPU, and that speed is roughly the same -- equal to the CPU's memory-bus bandwidth -- regardless of whether it's one core doing the reading-in of the memory, or four.
To demonstrate how much RAM access is a factor, below is a modified/simplified version of your test program. In this version of the program, I've removed the big vectors, and instead the computation is just a series of calls to the (relatively expensive) sin() function. Note that in this version, the loop is only accessing a few memory locations, rather than thousands, and thus a core that is running the computation loop will not have to periodically wait for more data to be copied in from RAM to its local cache:
#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <math.h>
using namespace std;
static int iterationNumber = 1000000;
unsigned long long threadElapsedTimeMicros[10];
unsigned long threadResults[10];
void myFunction(const int index, const int numberInThread)
{
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i=0; i<numberInThread; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
threadResults[index] = result;
threadElapsedTimeMicros[index] = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();
// We'll print out the value of threadElapsedTimeMicros[index] later on,
// after all the threads have been join()'d.
// If we printed it out now it might affect the timing of the other threads
// that may still be executing
}
int main(void)
{
vector<thread> t;
const int NumberOfThreads = thread::hardware_concurrency();
const int numberInThread = iterationNumber / NumberOfThreads;
// Multithreaded approach
std::chrono::steady_clock::time_point allBegin = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) t.push_back(thread(myFunction, j, numberInThread));
for(int j = 0 ; j < NumberOfThreads; j++) t[j].join();
std::chrono::steady_clock::time_point allEnd = std::chrono::steady_clock::now();
for(int j = 0 ; j < NumberOfThreads; j++) cout << " The computations in thread #" << j << ": result=" << threadResults[j] << ", took " << threadElapsedTimeMicros[j] << " microseconds" << std::endl;
cout << " Total time spent doing multithreaded computations was " << std::chrono::duration_cast<std::chrono::microseconds>(allEnd - allBegin).count() << " microseconds in total" << std::endl;
// And now, the single-threaded approach, for comparison
unsigned long result = 666;
std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
for(int i = 0 ; i < iterationNumber; i++) result += 100*sin(result);
std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();
cout << "result=" << result << ", single-threaded computation took " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds" << std::endl;
return 0;
}
When I run the above program on my dual-core Mac mini (i7 with hyperthreading), here are the results I get:
Jeremys-Mac-mini:~ lcsuser1$ g++ -std=c++11 -O3 ./temp.cpp
Jeremys-Mac-mini:~ lcsuser1$ ./a.out
The computations in thread #0: result=1062, took 11718 microseconds
The computations in thread #1: result=1062, took 11481 microseconds
The computations in thread #2: result=1062, took 11525 microseconds
The computations in thread #3: result=1062, took 11230 microseconds
Total time spent doing multithreaded computations was 16492 microseconds in total
result=1181, single-threaded computation took 49846 microseconds
So in this case the results are more like what you'd expect -- because memory access was not a bottleneck, each core was able to run at full speed, and complete its 25% portion of the total calculations in about 25% of the time that it took a single thread to complete 100% of the calculations... and since the four cores were running truly in parallel, the total time spent doing the calculations was about 33% of the time it took for the single-threaded routine to complete (ideally it would be 25% but there's some overhead involved in starting up and shutting down the threads, etc).

This is an explanation, for the beginner.
It's not technically accurate, but IMHO not that far from it that anyone takes damage from reading it.
It provides an entry into understanding the parallel processing terms.
Threads, Tasks, and Processes
It is important to know the difference between threads, and processes.
By default starting a new process, allocates a dedicated memory for that process. So they share memory with no other processes, and could (in theory) be run on separate computers.
(You can share memory with other processes, via operating system, or "shared memory", but you have to add these features, they are not by default available for your process)
Having multiple cores means that the each running process can be executed on any idle core.
So basically one program runs on one core, another program runs on a second core, and the background service doing something for you, runs on a third, (and so on and so forth)
Threads is something different.
For instance all processes will run in a main thread.
The operating system implements a scheduler, that is supposed to allocate cpu time for programs. In principle it will say:
Program A, get 0.01 seconds, than pause!
Program B, get 0.01 seconds, then pause!
Program A, get 0.01 seconds, then pause!
Program B, get 0.01 seconds, then pause!
you get the idea..
The scheduler typically can prioritize between threads, so some programs get more CPU time than others.
The scheduler can of course schedule threads on all cores, but if it does this within a process, (splits a process's threads over multiple cores) there can be a performance penalty as each core holds it's own very fast memory cache.
Since threads from the same process can access the same cache, sharing memory between threads is quite fast.
Accessing another cores cache is not as fast, (if even possible without going via RAM), so in general schedulers will not split a process over multiple cores.
The result is that all the threads belonging to a process runs on the same core.
| Core 1 | Core 2 | Core 3 |
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
| Process A, Thread 1 | Process C, Thread 1 | Process F, Thread 1|
| Process A, Thread 2 | Process D, Thread 1 | Process F, Thread 2|
| Process B, Thread 1 | Process E, Thread 1 | Process F, Thread 3|
A process can spawn multiple threads, they all share the parent threads memory area, and will normally all run on the core that the parent was running on.
It makes sense to spawn threads within a process, if you have an application that needs to respond to something that it cannot control the timing of.
I.E. the users presses on a cancel button, or attempts to move a window, while the application is running calculations that takes a long time to complete.
Responsiveness of the UI, requires the application to spend time reading, and handling what the user is attempting to do. This could be achieved in a main loop, if the program does parts of the calculation in each iteration.
However that get's complicated real fast, so instead of having the calculation code, exit in the middle of a calculation to check the UI, and update the UI, and then continue. You run the calculation code in another thread.
The scheduler then makes sure that the UI thread, and the calculation thread gets CPU time, so the UI responds to user input, while the calculation continues..
And your code stays fairly simple.
But I want to run my calculations another core to gain speed
To distribute calculations on multiple cores, you could spawn a new process for each calculation job. In this way the scheduler will know that each process get's it's own memory, and it can easily be launched on an idle core.
However you have a problem, you need to share memory with the other process, so it knows what to do.
A simple way of doing this, is sharing memory via the filesystem.
You could create a file with the data for the calculation, and then spawn a thread governing the execution (and communication) with another program, (so your UI is responsive, while we wait for the results).
The governing thread runs the other program via system commands, which starts it as another process.
The other program will be written such that it runs with the input file as input argument, so we can run it in multiple instances, on different files.
If the program self terminates when it's done, and creates an output file, it can run on any core, (or multiple) and your process can read the output file.
This actually works, and should the calculation take a long time (like many minutes) this is perhaps ok, even though we use files to communicate between our processes.
For calculations that only takes seconds, however, the file system is slow, and waiting for it will almost remove the gained performance of using processes instead of just using threads. So other more efficient memory sharing is used in real life. For instance creating a shared memory area in RAM.
The "create governing thread, and spawn subprocess, allow communication with process via governing thread, collect data when process is complete, and expose via governing thread" can be implemented in multiple ways.
Tasks
Well "tasks" is ambiguous.
In general it means "Process or thread that solves a task".
However, in certain languages like C#, it is something that implements a thread like thing, that the scheduler can treat as a process. Other languages that provide a similar feature typically dubs this either tasks or workers.
So with workers/tasks it appears to the programmer as if it was merely a thread, that you can share memory with easily, via references, and control like any other thread, by invoking methods on the thread.
But it appears to the scheduler as if it's a process that can be run on any core.
It implements the shared memory problem in a fairly efficient way, as part of the language, so the programmer won't have to re-invent this wheel for all tasks.
This is often referred to as "Hybrid threading" or simply "parallel threads"

Seems that you have some misconception about multi-threading. Simply using two threads cannot halve the processing time.
Multi-threading is a kind of complicated concept but you can easily find related materials on the web. You should read one of them first. But I will try to give a simple explanation with an example.
No matter how many CPUs(or cores) you have, the total handling capacity of the CPU will be always the same whether you use multi-thread or not, right? Then, where does the performance difference come from?
When a program runs on a device(computer) it uses not only CPU but also other system resources such as Networks, RAM, Hard drives, etc. If the flow of the program is serialized there will be a certain point of time when the CPU is idle waiting for other system resources to get done. But, in the case that the program runs with multiple threads(multiple flow), if a thread turns to idle(waiting some tasks done by other system resources) the other threads can use the CPU. Therefore, you can minimize the idle time of the CPU and improve the time performance. This is one of the most simple example about multi-threading.
Since your sample code is almost 'only CPU-consuming', using multi-thread could bring little improvement of performance. Sometimes it can be worse because multi-threading also comes with time cost of context-switching.
FYI, parallel processing is not the same as multi-threading.

This is very good to point out the problems with macs.
Provided you use a o.s. that can schedule threads in a useful manner, you have to consider if a problem is basically the product of 1 problem many times. An example is matrix multiplication. When you multiply 2 matrices there is a certain parts of it which are independent of the others. A 3x3 matrix times another 3x3 requires 9 dot products which can be computed independently of the others, which themselves require 3 multiplications and 2 additions but here the multiplications must be done first. So we see if we wanted to utilize multithreaded processor for this task we could use 9 cores or threads and given they get equal compute time or have same priority level (which is adjustable on windows) you would reduce the time to multiply a 3x3 matrices by 9. This is because we are essentially doing something 9 times which can be done at the same time by 9 people.
now for each of 9 threads we could have 3 cores perform multiplications totaling 3x9=24 cores all together now. Reducing time by t/24. But we have 18 additions and here we can get no gain from more cores. One addition must be piped into another. And the problem takes time t with one core or time t/24 ideally with 24 cores working together. Now you can see why problems are often seeked out if they are 'linear' because they can be done in parallel pretty good like graphics for example (some things like backside culling are sorting problems and inherently not linear so parallel processing has diminished performance boosts).
Then there is added overhead of starting threads and how they are scheduled by the o.s. and processor. Hope this helps.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js