OpenMP - Starting a new thread in each loop iteration - c++

I'm having trouble adjusting my thinking to suit OpenMP's way of doing things.
Roughly, what I want is:
for(int i=0; i<50; i++)
{
doStuff();
thread t;
t.start(callback(i)); //each time around the loop create a thread to execute callback
}
I think I know how this would be done in c++11, but I need to be able to accomplish something similar with OpenMP.

The closest thing to what you want are OpenMP tasks, available in OpenMP v3.0 and later compliant compilers. It goes like:
#pragma omp parallel
{
#pragma omp single
for (int i = 0; i < 50; i++)
{
doStuff();
#pragma omp task
callback(i);
}
}
This code will make the loop execute in one thread only and it will create 50 OpenMP tasks that will call callback() with different parameters. Then it will wait for all tasks to finish before exiting the parallel region. Tasks will be picked (possibly at random) by idle threads to be executed. OpenMP imposes an implicit barrier at the end of each parallel region since its fork-join execution model mandates that only the main thread runs outside of parallel regions.
Here is a sample program (ompt.cpp):
#include <stdio.h>
#include <unistd.h>
#include <omp.h>
void callback (int i)
{
printf("[%02d] Task stated with thread %d\n", i, omp_get_thread_num());
sleep(1);
printf("[%02d] Task finished\n", i);
}
int main (void)
{
#pragma omp parallel
{
#pragma omp single
for (int i = 0; i < 10; i++)
{
#pragma omp task
callback(i);
printf("Task %d created\n", i);
}
}
printf("Parallel region ended\n");
return 0;
}
Compilation and execution:
$ g++ -fopenmp -o ompt.x ompt.cpp
$ OMP_NUM_THREADS=4 ./ompt.x
Task 0 created
Task 1 created
Task 2 created
[01] Task stated with thread 3
[02] Task stated with thread 2
Task 3 created
Task 4 created
Task 5 created
Task 6 created
Task 7 created
[00] Task stated with thread 1
Task 8 created
Task 9 created
[03] Task stated with thread 0
[01] Task finished
[02] Task finished
[05] Task stated with thread 2
[04] Task stated with thread 3
[00] Task finished
[06] Task stated with thread 1
[03] Task finished
[07] Task stated with thread 0
[05] Task finished
[08] Task stated with thread 2
[04] Task finished
[09] Task stated with thread 3
[06] Task finished
[07] Task finished
[08] Task finished
[09] Task finished
Parallel region ended
Note that tasks are not executed in the same order they were created in.
GCC does not support OpenMP 3.0 in versions older than 4.4. Unrecognised OpenMP directives are silently ignored and the resulting executable will that code section in serial:
$ g++-4.3 -fopenmp -o ompt.x ompt.cpp
$ OMP_NUM_THREADS=4 ./ompt.x
[00] Task stated with thread 3
[00] Task finished
Task 0 created
[01] Task stated with thread 3
[01] Task finished
Task 1 created
[02] Task stated with thread 3
[02] Task finished
Task 2 created
[03] Task stated with thread 3
[03] Task finished
Task 3 created
[04] Task stated with thread 3
[04] Task finished
Task 4 created
[05] Task stated with thread 3
[05] Task finished
Task 5 created
[06] Task stated with thread 3
[06] Task finished
Task 6 created
[07] Task stated with thread 3
[07] Task finished
Task 7 created
[08] Task stated with thread 3
[08] Task finished
Task 8 created
[09] Task stated with thread 3
[09] Task finished
Task 9 created
Parallel region ended

For example have a look to http://en.wikipedia.org/wiki/OpenMP.
#pragma omp for
is your friend. OpenMP does not need you to think about threading. You just declare(!) what you want to be run in parallel and the OpenMP compatible compiler performs the needed transformations in your code during compile time.
The specifications of OpenMP are also very enligthing. They explain quite well what can be done and how: http://openmp.org/wp/openmp-specifications/
Your sample could look like:
#pragma omp parallel for
for(int i=0; i<50; i++)
{
doStuff();
thread t;
t.start(callback(i)); //each time around the loop create a thread to execute callback
}
Everything in the for loop is run in parallel. You have to pay attention to data dependency. The 'doStuff()' function is run sequentielly in your pseudo code, but would be run in parallel in my sample. You also need to specifiy which variables are thread private and something like that which would also go into the #pragma statement.

Related

Zombie/trash thread on MacOS with Rosetta

I have a very weird issue with a zombie/garbage thread crashing the process although this thread should have been joined a while ago.
Application written in C++, built for x86_64, runs on MacOS using Rosetta. Googletest runs app multiple times: initializes the engine, performs tests and uninitializes it for every test - no new processes are started - all happens within googletest process. In about one goggletest start of about ten, one of the threads that my application has not created during current run crashes with invalid resource access which triggers SIGABRT and crashes the complete process.
Threads in the pool are created like this during thread pool construction, done once during application initialization:
// threadsNum is a constant, e.g. 8
for (size_t i = 0; i < threadsNum; ++i){
// this->workers has type std::vector< std::thread > workers;
workers.emplace_back(
[this, i]
{
workerThreadFunction(i);
}
);
}
workerThreadFunction() sets thread name and enters an infinete loop for doing it's job while checking if it should stop in every iteration:
for (;;) {
// Mutex lock to access this->stop
std::unique_lock<std::mutex> lock(this->queue_mutex);
if (this->stop) {
break;
}
...
}
During thread pool destruction all threads are joined:
for(std::thread &worker: workers) {
worker.join();
}
To sum up:
Application starts 8 threads for a specfic task using a thread pool
Threads have their names for debugging purposes - e.g. "MyThread 1", "MyThread 2", etc.
I have verified that all threads are joined upon thread pool destruction (threadX.join() returns)
Application shutdown is clean after test iteration execution before the next test with the crash starts
In problematic runs with the crash there are more threads present than application has created - one of the threads is present two times (thread name duplicated). Crash dump also shows there are two threads with the same name and one of them has crashed.
That duplicated thread has corrupt stack and crashes due to invalid resource access (locking this->queue_mutex to be precise)
Additional main thread sleep for e.g. 100ms after engine uninitialization does not help - does not look like a timing issue
To me it looks like that thread has survived join() somehow and reappeared in the process. But I cannot imagine how could it be possible.
The question is, am I missing something here? Are there any tools to debug this issue besides what I have already done?

Why does the loop in openmp run sequentially?

I try run example for scheduling in openmp, but its work sequentially.
omp_set_num_threads(4);
#pragma omp parallel for schedule(static, 3)
for (int i = 0; i < 20; i++)
{
printf("Thread %d is running number %d\n", omp_get_thread_num(), i);
}
Result:
Thread 0 is running number 0
Thread 0 is running number 1
Thread 0 is running number 2
Thread 0 is running number 3
Thread 0 is running number 4
Thread 0 is running number 5
Thread 0 is running number 6
Thread 0 is running number 7
Thread 0 is running number 8
Thread 0 is running number 9
Thread 0 is running number 10
Thread 0 is running number 11
Thread 0 is running number 12
Thread 0 is running number 13
Thread 0 is running number 14
Thread 0 is running number 15
Thread 0 is running number 16
Thread 0 is running number 17
Thread 0 is running number 18
Thread 0 is running number 19
How can I get the code to work in parallel?
I am using Microsoft Visual Studio 2017.
In Microsoft Visual Studio, OpenMP support is disabled by default. You can enable it with the /openmp compiler option.
This option can be enabled in the project properties, under C/C++->Language->Open MP Support.

PPL - How to configure the number of native threads?

I am trying to manage the count of native threads in PPL by using its Scheduler class, here is my code:
for (int i = 0; i < 2000; i ++)
{
// configure concurrency count 16 to 32.
concurrency::SchedulerPolicy policy = concurrency::SchedulerPolicy(2, concurrency::MinConcurrency, 16,
concurrency::MaxConcurrency, 32);
concurrency::Scheduler *pScheduler = concurrency::Scheduler::Create(policy);
HANDLE hShutdownEvent = CreateEvent(NULL, FALSE, FALSE, NULL);
pScheduler->RegisterShutdownEvent(hShutdownEvent);
pScheduler->Attach();
//////////////////////////////////////////////////////////////////////////
//for (int i = 0; i < 2000; i ++)
{
concurrency::create_task([]{
concurrency::wait(1000);
OutputDebugString(L"Task Completed\n");
});
}
//////////////////////////////////////////////////////////////////////////
concurrency::CurrentScheduler::Detach();
pScheduler->Release();
WaitForSingleObject(hShutdownEvent, INFINITE);
CloseHandle(hShutdownEvent);
}
The usage of SchedulerPolicy is from MSDN, but it didn't work at all. The expected result of my code above is, PPL will launch 16 to 32 threads to execute the 2000 tasks, but the fact is:
By observing the speed of console output, only one task was processed within a second. I also tried to comment the outter for loop and uncomment the inner for loop, however, this will cause 300 threads being created, still incorrect. If I wait a longer time, the threads created will be even more.
Any ideas on what is the correct way to configure concurrency in PPL?
It has been proved that I should not do concurrency::wait within the task body, PPL works in work stealing mode, when the current task was suspended by wait, it will start to schedule the rest of tasks in queue to maximize the use of computing resources.
When I use concurrency::create_task in real project, since there are a couple of real calculations within the task body, PPL won't create hundreds of threads any more.
Also, SchedulePolicy can be used to configure the number of virtual processors that PPL may use to process the tasks, which is not always same as the number of native threads PPL will create.
Saying my CPU has 8 virtual processors, by default PPL will just create 8 threads in pool, but when some of those threads were suspended by wait or lock, and also there are more tasks pending in the queue, PPL will immediately create more threads to execute them (if the virtual processors were not fully loaded).

overhead of thread-synchronization via Events

I am experimenting with multithreaded synchronization at the moment. For a backround I have a set of about 100000 objects - possibly more - I want to process in different ways multiple times per second.
Now the thing concerning me most is the performance of the synchronization.
This is what I think should work just fine (I omitted all security aspects as this is just a testprogram and in case of an error the program will just crash ..). I wrote two funktions, the first to be executed by the main thread of the program, the second to be run by all additional threads.
void SharedWorker::Start()
{
while (bRunning)
{
// Send the command to start task1
SetEvent(hTask1Event);
// Do task1 (on a subset of all objects) here
// Wait for all workers to finish task1
WaitForMultipleObjects(<NumberOfWorkers>, <ListOfTask1WorkerEvents>, TRUE, INFINITE);
// Reset the command for task1
ResetEvent(hTask1Event);
// Send the command to start task2
SetEvent(hTask2Event);
// Do task2 (on a subset of all objects) here
// Wait for all workers to finish task2
WaitForMultipleObjects(<NumberOfWorkers>, <ListOfTask2WorkerEvents>, TRUE, INFINITE);
// Reset the command for task2
ResetEvent(hTask2Event);
// Send the command to do cleanup
SetEvent(hCleanupEvent);
// Do some (on a subset of all objects) cleanup
// Wait for all workers to finish cleanup
WaitForMultipleObjects(<NumberOfWorkers>, <ListOfCleanupWorkerEvents>, TRUE, INFINITE);
// Reset the command for cleanup
ResetEvent(hCleanupEvent);
}
}
DWORD WINAPI WorkerThreads(LPVOID lpParameter)
{
while (bRunning)
{
WaitForSingleObject(hTask1Event, INFINITE);
// Unset finished cleanup
ResetEvent(hCleanedUp);
// Do task1 (on a subset of all objects) here
// Signal finished task1
SetEvent(hTask1);
WaitForSingleObject(hTask2Event, INFINITE);
// Reset task1 event
ResetEvent(hTask1);
// Do task2 (on a subset of all objects) here
// Signal finished task2
SetEvent(hTask2);
WaitForSingleObject(hCleanupEvent, INFINITE);
// Reset update event
ResetEvent(hTask2);
// Do cleanup (on a subset of all objects) here
// Signal finished cleanup
SetEvent(hCleanedUp);
}
return 0;
}
To point out my requirements, I'll just give you a little example:
Say we got the 100000 objects from above, split into 8 subsets of 12500 objects each, a modern multicore processor with 8 logical cores. The relevant part is the time. All tasks must be performed in about 8ms.
My questions are now, can I get a significant boost in time from split processing or is the synchronization via events too expensive? or is there maybe even another way of synchronizing threads with less effort or process time if all the tasks need to be done this way?
If your processing of a single object is fast, do not split it between threads. The thread synchronization on windows will eat well over 50 ms on every context switch. This time is not used by system, but just the time when something else is running on a system.
However, if every object processing will take around 8ms, there is a point of scheduling the work across pool of threads. However, object processing may vary a bit, and in large counts worker threads would complete the work in a different moment.
Better approach is to organize a synchronized object queue, to which you add objects to process, and from which you take them from processing. Furthermore, as processing of a single object considerably lower, than scheduling interval of a thread, it is good to take them into processing thread in batches (like 10-20). You can estimate the best number of worker threads in your pool and the best size of a batch with tests.
So the pseudocode can look like:
main_thread:
init queue
start workers
set counter to 100000
add 100000 objects to queue
while (counter) wait();
worker_thread:
while (!done)
get up to 10 objects from queue
process objects
counter -= processed count
if (counter == 0) notify done

c++ Handling multiple threads in a main thread

I am a bit new to multi threading, so forgive me if these questions are too trivial.
My application needs to create multiple threads in a thread and perform actions from each thread.
For example, I have a set of files to read, say 50 and I create a thread to read these files using CreateThread() function.
Now this main thread creates 4 threads to access the file. 1st thread is given file 1, second file 2 and so on.
After 1st thread completed reading file 1 and gives main thread the required data, main thread needs to invoke it with file 5 and obtain data from it. Similar goes for all other threads until all 50 files are read.
After that, each thread is destroyed and finally my main thread is destroyed.
The issue I am facing is:
1) How to stop a thread to exit after file reading?
2) How to invoke the thread again with other file name?
3) How would my child thread give information to main thread?
4) After a thread completes reading the file and returns the main thread a data, how main thread would know which thread has provided the data?
Thanks
This is a very common problem in multi-threaded programming. You can view this as a producer-consumer problem: the main thread "produces" tasks which are "consumed" by the worker threads (s. e.g. http://www.mario-konrad.ch/blog/programming/multithread/tutorial-06.html) . You might also want to read about "thread pools".
I would highly recommend to read into boost's Synchronization (http://www.boost.org/doc/libs/1_50_0/doc/html/thread.html) and use boost's threading functionality as it is platform independent and good to use.
To be more specific to your question: You should create a queue with operations to be done (usually it's the same queue for all worker threads. If you really want to ensure thread 1 is performing task 1, 5, 9 ... you might want to have one queue per worker thread). Access to this queue must be synchronized by a mutex, waiting threads can be notified by condition_variables when new data is added to the mutex.
1.) don't exit the thread function but wait until a condition is fired and then restart using a while ([exit condition not true]) loop
2.) see 1.
3.) through any variable to which both have access and which is secured by a mutex (e.g. a result queue)
4.) by adding this information as the result written to the result queue.
Another advice: It's always hard to get multi-threading correct. So try to be as careful as possible and write tests to detect deadlocks and race conditions.
The typical solution for this kind of problem is using a thread pool and a queue. The main thread pushes all files/filenames to a queue, then starts a thread pool, ie different threads, in which each thread takes an item from the queue and processes it. When one item is processed, it goes on to the next one (if by then the queue is not yet empty). The main thread knows everything is processed when the queue is empty and all threads have exited.
So, 1) and 2) are somewhat conflicting: you don't stop the thread and invoke it again, it just keeps running as long as it finds items on the queue.
For 3) you can again use a queue in which the thread puts information, and from which the main thread reads. For 4) you could give each thread an id and put that together with the data. However normally the main thread should not need to know which thread exactly processed data.
Some very basic pseudocode to give you an idea, locking for threadsafety omitted:
//main
for( all filenames )
queue.push_back( filename );
//start some thread
threadPool.StartThreads( 4, CreateThread( queue ) );
//wait for threads to end
threadPool.Join();
//thread
class Thread
{
public:
Thread( queue q ) : q( q ) {}
void Start();
bool Join();
void ThreadFun()
{
auto nextQueueItem = q.pop_back();
if( !nextQueuItem )
return; //q empty
ProcessItem( nextQueueItem );
}
}
Whether you use a thread pool or not to execute your synchronies file reads, it boils down to a chain of functions or groups of functions that have to run serialized. So let's assume, you find a way to execute functions in parallel (be it be starting one thread per function or by using a thread pool), to wait for the first 4 files to read, you can use a queue, where the reading threads push there results into, the fifth function now pulls 4 results out of the queue (the queue blocks when empty) and processes. If there are more dependencies between functions, you can add more queues between them. Sketch:
void read_file( const std::string& name, queue& q )
{
file_content f= .... // read file
q.push( f )
}
void process4files( queue& q )
{
std::vector< file_content > result;
for ( int i = 0; i != 4; ++i )
result.push_back( q.pop() );
// now 4 files are read ...
assert( result.size() == 4u );
}
queue q;
thread t1( &read_file, "file1", q );
thread t2( &read_file, "file2", q );
thread t3( &read_file, "file3", q );
thread t4( &read_file, "file4", q );
thread t5( &process4files, q );
t5.join();
I hope you get the idea.
Torsten