I am optimizing a for loop with openMP. In each thread, a large array will be temporarily used (not needed when this thread finishes). Since I don't want to repeatedly allocate & delete these arrays, so I plan to allocate a large block of memory, and assign a part to each thread. To avoid conflicting, I should have a unique ID for each running thread, which should not change and cannot be equal to another thread. So my question is, can I use the thread ID return by function omp_get_thread_num() for this purpose? Or is there any efficient solution for such memory allocation & assignment task? Thanks very much!
You can start the parallel section and then start allocating variables/memory. Everything that is declared within the parallel section is thread private on their own stack. Example:
#pragma omp parallel
{
// every variable declared here is thread private
int * temp_array_pointer = calloc(sizeof(int), num_elements);
int temp_array_on_stack[num_elements];
#pragma omp for
for (...) {
// whatever my loop does
}
// if you used dynamic allocation
free(temp_array_pointer);
}
Once your program encounters a parallel region, that is once it hits
#pragma omp parallel
the threads (which may have been started at program initialisation or not until the first parallel construct) will become active. Inside the parallel region any thread which allocates memory, to an array for example, will be allocating that memory inside it's own, private, address space. Unless the thread deallocates the memory it will remain allocated for the entirety of the parallel region.
If your program first, in serial, allocates memory for an array and then, on entering the parallel region, copies that array to all threads, use the firstprivate clause and let the run time take care of copying the array into the private address space of each thread.
Given all that, I don't see the point of allocating, presumably before encountering the parallel region, a large amount of memory then sharing it between threads using some roll-your-own approach to dividing it based on calculations on the thread id.
Related
Near the start of my c++ application, my main thread uses OMP to parallelize several for loops. After the first parallelized for loop, I see that the threads used remain in existence for the duration of the application, and are reused for subsequent OMP for loops executed from the main thread, using the command (working in CentOS 7):
for i in $(pgrep myApplication); do ps -mo pid,tid,fname,user,psr -p $i;done
Later in my program, I launch a boost thread from the main thread, in which I parallelize a for loop using OMP. At this point, I see an entirely new set of threads are created, which has a decent amount of overhead.
Is it possible to make the OMP parallel for loop within the boost thread reuse the original OMP thread pool created by the main thread?
Edit: Some pseudo code:
myFun(data)
{
// Want to reuse OMP thread pool from main here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// Work on data
}
}
main
{
// Thread pool created here.
omp parallel for
for(int i = 0; i < N; ++i)
{
// do stuff
}
boost::thread myThread(myFun) // Constructor starts thread.
// Do some serial stuff, no OMP.
myThread.join();
}
The interaction of OpenMP with other threading mechanisms is deliberately left out of the specification and is therefore dependent heavily on the implementation. The GNU OpenMP runtime keeps a pointer to the thread pool in TLS and propagates it down the (nested) teams. Threads started via pthread_create (or boost::thread or std::thread) do not inherit the pointer and therefore spawn a fresh pool. It is probably the case with other OpenMP runtimes too.
There is a requirement in the standard that basically forces such behaviour in most implementations. It is about the semantics of the threadprivate variables and how their values are retained across the different parallel regions forked from the same thread (OpenMP standard, 2.15.2 threadprivate Directive):
The values of data in the threadprivate variables of non-initial threads are guaranteed to persist between two consecutive active parallel regions only if all of the following conditions hold:
Neither parallel region is nested inside another explicit parallel region.
The number of threads used to execute both parallel regions is the same.
The thread affinity policies used to execute both parallel regions are the same.
The value of the dyn-var internal control variable in the enclosing task region is false at entry to both parallel regions.
If these conditions all hold, and if a threadprivate variable is referenced in both regions, then threads with the same thread number in their respective regions will reference the same copy of that variable.
This, besides performance, is probably the main reason for using thread pools in OpenMP runtimes.
Now, imagine that two parallel regions forked by two separate threads share the same worker thread pool. A parallel region was forked by the first thread and some threadprivate variables were set. Later a second parallel region is forked by the same thread, where those threadprivate variables are used. But somewhere between the two parallel regions, a parallel region is forked by the second thread and worker threads from the same pool are utilised. Since most implementations keep threadprivate variables in TLS, the above semantics can no longer be asserted. A possible solution would be to add new worker threads to the pool for each separate thread, which is not much different than creating new thread pools.
I'm not aware of any workarounds to make the worker thread pool shared. And if possible, it will not be portable, therefore the main benefit of OpenMP will be lost.
Lets say there are 4 consumer threads that run in a loop continuously
function consumerLoop(threadIndex)
{
int myArray[100];
main loop {
..process data..
myArray[myIndex] += newValue
}
}
I have another monitor thread which does other background tasks.
I need to access the myArray for each of these threads from the monitor thread.
Assume that the loops will run for ever(so the local variables would exist) and the only operation required from the monitor thread is to read the array contents of all the threads.
One alternative is to change myArray to a global array of arrays. But i am guessing that would slow down the consumer loops.
What are the ill effects of declaring a global pointer array
int *p[4]; and assigning each element to the address of the local variable by adding a line in consumerLoop like so p[threadIndex] = myArray and accessing p from monitor thread?
Note: I am running It in a linux system and the language is C++. I am not concerned about synchronization/validity of the array contents when i am accessing it from the monitor thread.Lets stay away from a discussion of locking
If you are really interested in the performance difference, you have to measure. I would guess, that there are nearly no differenced.
Both approaches are correct, as long as the monitor thread doesn't access stack local variables that are invalid because the function returned.
You can not access myArray from different thread because it is local variable.
You can do 1) Use glibal variable or 2) Malloca and pass the address to all threads.
Please protect the critical section when all threads rush to use the common memory.
Does OMP ensure that the contents of an dynamic array is up-to-date and is visible to all threads after an OMP barrier?
Yes. A barrier causes all threads' view of all accessible memory to be made consistent; that is, it implicitly flushes the entire state of the program.
if your array is out of the #pragma omp parallel construct, it will automatically accessible & share by all the thread.
But the way he is update by the thread only depend if your algo and the synchro mechanism you use to ensure the correctness.
I have one function that I'm attempting to parallelize with OpenMP. I has a big for loop where every iteration is independent of the others, and I'd like to use something like
#pragma omp for private(j)
to parallelize the loop.
One problem is that each iteration of the loop requires a substantial amount of temporary workspace, enough that I think it will likely kill performance if I allocate and deallocate this temporary workspace with once per iteration. My environment has "workspace" objects in it, and there are no problem associated with reusing an old workspace object as-is.
How can I allocate workspace for each thread before the threads are made (and I don't know how many of them there are)? How can I tell each thread to pick a unique workspace object from the pool?
You can use omp_get_max_threads() and allocate enough workspaces for all threads (e.g., an array of workspaces with omp_get_max_threads() elements.), and then on each thread use omp_get_thread_num() to know which thread is running on so it can get its own workspace.
Maybe I am missing the point, but wouldn't the following strategy work for you?
void foo() {
#pragma omp parallel
{
// allocate work-space here, so to make it private to the thread
thread_workspace t;
#pragma omp for
for(int j = 0; j < N; j++) {
// Each thread has its local work-space allocated outside the for loop
}
} // End of the parallel region
}
I recommend using the Object Pool design pattern. Here's a description. You would obviously need to make the acquire and release methods for the workspaces thread safe (3 methods in the ReusablePool need synchronization). The number of workspaces would grow to the total number needed at any one time. Reclaimed workspaces would be reused by the ReusablePool.
Although the object pool is handling the object instantiation it's
main purpose is to provide a way for the clients to reuse the objects
like they are new objects.
I have a question on memory allocation for containers on C++.
Look at the pseudocode for a multi threaded application (assume it is in c++). I declare vector object in the main method. Then I run a thread and pass this object to the thread. The thread runs in another processor. Now, I insert 100000 elements into the vector.
typedef struct myType
{
int a;
int b;
}myType;
ThreadRoutine()
{
Run Thread in processor P;
insert 1000000 elements into myTypeObject
}
int main()
{
std::vector<myType> myTypeObject;
CALLTHREAD and pass myTypeObject
}
I want to know where the memory will be allocated for the 100000 elements:
-From the main itself
-From the thread
The reason I ask this is because, I want to run the thread in a different processor. And my machine is a NUMA machine. So if the memory is allocated from the thread, it will be in the local memory bank of the thread. But if the memory is allocated from main, it will be allocated from the local memory bank of the main thread.
By my intuition, I would say that the memory is allocated only in the threads. Please let me know your thoughts.
the reallocation(s) will be called from ThreadRoutine() -- so, whichever thread calls that (the secondary in your example).
of course, you could reserve on the main thread before passing it around, if you want to avoid resizing on the secondary thread.