In the following example the C++11 threads take about 50 seconds to execute, but the OMP threads only 5 seconds. Any ideas why? (I can assure you it still holds true if you are doing real work instead of doNothing, or if you do it in a different order, etc.) I'm on a 16 core machine, too.
#include <iostream>
#include <omp.h>
#include <chrono>
#include <vector>
#include <thread>
using namespace std;
void doNothing() {}
int run(int algorithmToRun)
{
auto startTime = std::chrono::system_clock::now();
for(int j=1; j<100000; ++j)
{
if(algorithmToRun == 1)
{
vector<thread> threads;
for(int i=0; i<16; i++)
{
threads.push_back(thread(doNothing));
}
for(auto& thread : threads) thread.join();
}
else if(algorithmToRun == 2)
{
#pragma omp parallel for num_threads(16)
for(unsigned i=0; i<16; i++)
{
doNothing();
}
}
}
auto endTime = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = endTime - startTime;
return elapsed_seconds.count();
}
int main()
{
int cppt = run(1);
int ompt = run(2);
cout<<cppt<<endl;
cout<<ompt<<endl;
return 0;
}
OpenMP thread-pools for its Pragmas (also here and here). Spinning up and tearing down threads is expensive. OpenMP avoids this overhead, so all it's doing is the actual work and the minimal shared-memory shuttling of the execution state. In your Threads code you are spinning up and tearing down a new set of 16 threads every iteration.
I tried a code of an 100 looping at
Choosing the right threading framework and it took
OpenMP 0.0727, Intel TBB 0.6759 and C++ thread library 0.5962 mili-seconds.
I also applied what AruisDante suggested;
void nested_loop(int max_i, int band)
{
for (int i = 0; i < max_i; i++)
{
doNothing(band);
}
}
...
else if (algorithmToRun == 5)
{
thread bristle(nested_loop, max_i, band);
bristle.join();
}
This code looks like taking less time than your original C++ 11 thread section.
Related
I have a program where many threads do some computations and write a boolean true value in a shared array to tag the corresponding item as "dirty". This is a data race, reported by ThreadSanitizer. Nevertheless, the flag is never read from these threads, and since the same value is written by all threads, I wonder if it is an actually problematic data race.
Here is a minimal working example:
#include <array>
#include <cstdio>
#include <thread>
#include <vector>
int
main()
{
constexpr int N = 64;
std::array<bool, N> dirty{};
std::vector<std::thread> threads;
threads.reserve(3 * N);
for (int j = 0; j != 3; ++j)
for (int i = 0; i != N; ++i)
threads.emplace_back([&dirty, i]() -> void {
if (i % 2 == 0)
dirty[i] = true; // data race here.
});
for (std::thread& t : threads)
if (t.joinable())
t.join();
for (int i = 0; i != N; ++i)
if (dirty[i])
printf("%d\n", i);
return 0;
}
Compiled with g++ -fsanitize=thread, a data race is reported on the marked line. Under which conditions can this be an actual problem, i.e. the dirty flag for an item i would not be the expected value?
I am working on a project that uses multiple threads to parallelize tasks. During development, I noticed that performance of operations on a std::list container, e.g. pop_front() or push_back(), are significantly slower while executed in the thread, compared to single-thread execution. See the code snipped below:
#include <iostream>
#include <chrono>
#include <vector>
#include <string>
#include <thread>
#include <list>
using namespace std;
void SingleThreadedList()
{
auto t1 = std::chrono::high_resolution_clock::now();
for(int i=0; i<2; i++)
{
list<char> l;
for(int j=0; j<10000; j++)
{
l.push_back('c');
l.pop_front();
}
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << "duration single thread: " << duration << endl;
}
void MultiThreadedList()
{
auto t1 = std::chrono::high_resolution_clock::now();
auto lambda_fkt = []() {
list<char> l;
for(int i=0; i<10000; i++)
{
l.push_back('c');
l.pop_front();
}
};
vector<thread*> thread_array;
for (int i=0; i<2; ++i)
{
thread *th = new thread(lambda_fkt);
thread_array.push_back(th);
}
for(auto t : thread_array)
{
t->join();
}
auto t2 = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>( t2 - t1 ).count();
std::cout << "duration multi thread: " << duration << endl;
for(auto t : thread_array)
{
delete t;
}
}
int main() {
SingleThreadedList();
MultiThreadedList();
}
the code produces the following output:
duration single thread: 4589
duration multi thread: 245483
Single Thread variant takes 4 ms, but as soon as the threads are created, execution takes more than 200ms! I cannot imagine that standard library containers show such a performance difference, depending on the execution context. Therefore I wonder if maybe someone can explain to me, what happens in this code and how I can avoid the performance decrease within thread? Thank you!
PS: When I remove the list operations from this sample and add e.g. some simple math, the code quickly shows the expected behavior: Multi-thread variant gets faster, if the computation is split among multiple threads, thus using multiple cores to get the result.
I am looking for the fastest way to have multiple threads reading from the same small vector (one which is not static but will only ever be changed by the main thread and only ever when the child threads are not reading from it) of pointers.
I've tried using a shared std::vector of pointers which is somewhat faster than a shared array of pointers but still slower per thread... I thought that the reason for that is the threads reading so close together in memory causing false sharing, but I am unsure.
I'm hoping there is either a way around that since the data is read only when the threads are accessing it or there's an entirely different approach that is faster. Below is a minimal example
#include <thread>
#include <iostream>
#include <iomanip>
#include <vector>
#include <atomic>
#include <chrono>
namespace chrono=std::chrono;
class A {
public:
A(int n=1) {
a=n;
}
int a;
};
void tfunc();
int nelements=10;
int nthreads=1;
std::vector<A*> elements;
std::atomic<int> complete;
std::atomic<int> remaining;
std::atomic<int> next;
std::atomic<int> tnow;
int tend=1000000;
int main() {
complete=false;
remaining=0;
next=0;
tnow=0;
for (int i=0; i < nelements; i++) {
A* a=new A();
elements.push_back(a);
}
std::thread threads[nthreads];
for (int i=0; i < nthreads; i++) {
threads[i]=std::thread(tfunc);
}
auto begin=chrono::high_resolution_clock::now();
while (tnow < tend) {
remaining=nthreads;
next=0;
tnow += 1;
while (remaining > 0) {}
// if {elements} is changed it is changed here
}
complete=true;
for (int i=0; i < nthreads; i++) {
threads[i].join();
}
auto complete=chrono::high_resolution_clock::now();
auto elapsed=chrono::duration_cast<chrono::microseconds>(complete-begin).count();
std::cout << std::setw(2) << nthreads << "Time - " << elapsed << std::endl;
}
void tfunc() {
int sum=0;
int tpre=0;
int curr=0;
while (tnow == 0) {}
while (!complete) {
if (tnow-tpre > 0) {
tpre=tnow;
while (remaining > 0) {
curr=next++;
if (curr > nelements) break;
for (int i=0; i < nelements; i++) {
if (i != curr) {
sum += elements[i] -> a;
}
}
remaining--;
}
}
}
}
Which for nthreads between 1 and 10 on my system outputs (the times are in microseconds)
1 Time - 281548
2 Time - 404926
3 Time - 546826
4 Time - 641898
5 Time - 714259
6 Time - 812776
7 Time - 922391
8 Time - 994909
9 Time - 1147579
10 Time - 1199838
I am wondering if there is a faster way to do this or if such a parallel operation will always be slower than serial due to the smallness of the vector.
What is the performance cost of call omp_get_thread_num(), compared to look up the value of a variable?
How to avoid calling omp_get_thread_num() for many times in a simd openmp loop?
I can use #pragma omp parallel, but will that make a simd loop?
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp for simd
for (int i = 0; i < a_size; ++i) {
a[i] = omp_get_thread_num();
}
}
I wouldn't be too worried about the cost of the call, but for code clarity you can do:
#include <vector>
#include <omp.h>
int main() {
std::vector<int> a(100);
auto a_size = a.size();
#pragma omp parallel
{
const auto threadId = omp_get_thread_num();
#pragma omp for
for (int i = 0; i < a_size; ++i) {
a[i] = threadId;
}
}
}
As long as you use #pragma omp for (and don't put an extra `parallel in there! otherwise each of your n threads will spawn n more threads... that's bad) it will ensure that inside your parallel region that for loop is split up amongst the n threads. Make sure omp compiler flag is turned on.
I wrote an simple code that should make 1000 of threads, do some job, join them, and replay everything 1000 times.
I have a memory leak with this piece of code and I don't understand why. I've been looking for solution pretty much everywhere and can't find one.
#include <iostream>
#include <thread>
#include <string>
#include <windows.h>
#define NUM_THREADS 1000
std::thread t[NUM_THREADS];
using namespace std;
//This function will be called from a threads
void checkString(string str)
{
//some stuff to do
}
void START_THREADS(string text)
{
//Launch a group of threads
for (int i = 0; i < NUM_THREADS; i++)
{
t[i] = std::thread(checkString, text);
}
//Join the threads with the main thread
for (int i = 0; i < NUM_THREADS; i++) {
if (t[i].joinable())
{
t[i].join();
}
}
system("cls");
}
int main()
{
for(int i = 0; i < 1000; i++)
{
system("cls");
cout << i << "/1000" << endl;
START_THREADS("anything");
}
cout << "Launched from the main\n";
return 0;
}
I'm not sure about memory leaks, but you certainly have a memory error. You shouldn't be doing this:
delete &t[i];
t[i] was not allocated with new and it can't be deleted. You can safely remove that line.
As for memory consumption, you need to ask yourself whether you really need to spawn 1 million threads. Spawning threads isn't cheap, and it is unlikely that your platform will be able to run more than a handful of them concurrently.