I need to make 4 teams with 4 threads each one with contiguous processors.
The result I'm expecting is, for example:
Team 0 Thread 0 Processor: 0
Team 0 Thread 1 Processor: 1
Team 0 Thread 2 Processor: 2
Team 0 Thread 3 Processor: 3
Team 1 Thread 0 Processor: 4
Team 1 Thread 1 Processor: 5
Team 1 Thread 2 Processor: 6
Team 1 Thread 3 Processor: 7
Team 2 Thread 0 Processor: 8
Team 2 Thread 1 Processor: 9
Team 2 Thread 2 Processor: 10
Team 2 Thread 3 Processor: 11
Team 3 Thread 0 Processor: 12
Team 3 Thread 1 Processor: 13
Team 3 Thread 2 Processor: 14
Team 3 Thread 3 Processor: 15
I can handle Processor Affinity in GCC using the GOMP_CPU_AFFINITY variable.
I'm using:
#pragma omp parallel num_threads(4)
twice in order to get 2 fork levels.
At the moment I'm having this order in GOMP_CPU_AFFINITY:
0 4 8 12 1 2 3 5 6 7 9 10 11 13 14 15
So the first fork, the "fathers fork", gets:
Team 0 Thread 0 Processor: 0
Team 1 Thread 0 Processor: 4
Team 2 Thread 0 Processor: 8
Team 3 Thread 0 Processor: 12
The problem I'm having is that the second group of forks make without any order so, for example I could have this situation (I'm using #pragma omp atomic so only one 'father' can ask for more processors at any time):
Team 0 Thread 0 Processor: 0
Team 0 Thread 1 Processor: 5
Team 0 Thread 2 Processor: 6
Team 0 Thread 3 Processor: 7
Team 1 Thread 0 Processor: 4
Team 1 Thread 1 Processor: 13
Team 1 Thread 2 Processor: 14
Team 1 Thread 3 Processor: 15
Team 2 Thread 0 Processor: 8
Team 2 Thread 1 Processor: 1
Team 2 Thread 2 Processor: 2
Team 2 Thread 3 Processor: 3
Team 3 Thread 0 Processor: 12
Team 3 Thread 1 Processor: 9
Team 3 Thread 2 Processor: 10
Team 3 Thread 3 Processor: 11
The question is: Is there any way to make this second petition in order?
I think I would have to make some sinchronization method with locks or something...
Thanks in advance!
Javier
Finally I could make this works, this is my code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
int main(int argc, char *argv[]){
int padre, hijo;
int contador = 0;
omp_set_nested(1);
int suma;
#pragma omp parallel private(padre) shared(contador) num_threads(4)
{
padre = omp_get_thread_num();
{
while(contador != padre){
// Don't know what to put here
};
#pragma omp parallel private(hijo) shared(padre, contador) num_threads(4)
{
hijo = omp_get_thread_num();
printf("\nFather: %d Son: %d Processor: %d\n", padre, hijo, sched_getcpu());
#pragma omp master
{
contador++;
}
}
}
}
}
Note: Padre is Father, Hijo is Son and Contador is Counter in Spanish :P
The problem I'm facing now is that if I compile my code with -O3 optimizations, the while loop 'dissapear' unless I put, for example, a printf line inside the loop. I think I should ask it in another question!
Thanks to you all!
Javier
Related
I used the following statement from OpenMP:
omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
printf("id is %3d thread is %d\n",i, omp_get_thread_num());
}
I found out (in a blog post): each thread will be evenly allocated iterations, and, when the number of threads is not divisible by the number of iterations, it will be rounded up.
Well, I first set NUMS=17, the result is as follows:
id is 12 thread is 4
id is 13 thread is 4
id is 14 thread is 4
id is 9 thread is 3
id is 10 thread is 3
id is 11 thread is 3
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 6 thread is 2
id is 7 thread is 2
id is 8 thread is 2
id is 15 thread is 5
id is 16 thread is 5
id is 3 thread is 1
id is 4 thread is 1
id is 5 thread is 1
As can be seen, \lceil 17/6 \rceil = 3 (round up, forgive me, I don't know how to insert Latex formulas), the result is as expected.
However, if I set NUMS=19, according to rounding up 19/6 = 4, each thread should be allocated 4 iterations, however:
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 3 thread is 0
id is 10 thread is 3
id is 11 thread is 3
id is 12 thread is 3
id is 7 thread is 2
id is 8 thread is 2
id is 9 thread is 2
id is 13 thread is 4
id is 14 thread is 4
id is 15 thread is 4
id is 4 thread is 1
id is 5 thread is 1
id is 6 thread is 1
id is 16 thread is 5
id is 17 thread is 5
id is 18 thread is 5
As you can see, only the first one is assigned 4 iterations.
So I can't figure it out now, what exactly is the reason for this? What exactly is OpenMP's default thread scheduling?
Summarising all the comments to create an answer (so thanks to all who commented).
First, what the standard says/requires :-
It says nothing about which schedule should be used when it is unspecified.
schedule(static) with no chunk_size is only specified as allocating "approximately equal" numbers of iterations to each available thread.
Second, what happens in reality at present :-
Compilers default to using schedule(static) when no schedule is specified. (Though schedule(nonmonotonic:dynamic) might, now, be a better choice.)
At least the LLVM OpenMP runtime allocates iterations to threads as this Python code shows (the critical part is myCount, the rest is just to test it and show your test cases).
#
# Show number of iterations allocated to each thread
# by schedule(static) in the LLVM runtime
#
def myCount(me, numThreads, total):
base = total // numThreads
remainder = total % numThreads
return base+1 if me < remainder else base
def test(numThreads, total):
print("Threads: ",numThreads, " Iterations: ",total)
allocated = 0
for thread in range(0,numThreads):
mine = myCount(thread,numThreads,total)
allocated += mine
print (thread, ": ", mine)
if allocated != total:
print ("***ERROR*** ", allocated," allocated, ", total," requested")
test(6,17)
test(6,19)
If you run that you can see the result for your two test cases:-
% python3 static.py
Threads: 6 Iterations: 17
0 : 3
1 : 3
2 : 3
3 : 3
4 : 3
5 : 2
Threads: 6 Iterations: 19
0 : 4
1 : 3
2 : 3
3 : 3
4 : 3
5 : 3
If you want to get into the full horror of loop scheduling, there is a whole chapter on this in "High Performance Parallel Runtimes -- Design and Implementation"
p.s. It's worth noting that the schedule shown above cannot be explicitly requested by setting a block_size on a static schedule, since the standard does not then allow the remainder iterations to be split up as they are here. (E.g. If we try to allocate 10 iterations to 4 threads, if we set block_size(2) we'd get (4,2,2,2) whereas if we set it to 3 we'd get (3,3,3,1), whereas the scheme above gives (3,3,2,2), which has imbalance of 1 whereas the explicit schemes each have imbalance of 2).
I'm trying to implement the following codes to see how OpenMP threads are managed over the nested loop where each inner/outer loops are separately implemented in a function and its caller.
Each loop is implemented with the statement
#pragma omp parallel for and I'm assuming the pragma for the inner loop is ignored.
To see this, I printed the thread number in each loop.
Then, what I could see is the following, where the thread id in the inner loop is always zero not identical to the thread number corresponding to the caller. Why does this happen?
Calling 0 from 0
Calling 2 from 1
Calling 6 from 4
Calling 8 from 6
Calling 4 from 2
Calling 7 from 5
Calling 5 from 3
Calling 0 from 0 // Expecting 3
Calling 1 from 0
Calling 2 from 0
Calling 3 from 0
Calling 0 from 0
Calling 1 from 0
Calling 2 from 0
Calling 3 from 0
Calling 0 from 0
Calling 0 from 0
Calling 0 from 0
Calling 1 from 0
Calling 2 from 0
Calling 3 from 0
Calling 9 from 7
Calling 1 from 0 // Expecting 7
Calling 2 from 0
Calling 3 from 0
Calling 0 from 0
Calling 3 from 1
Calling 0 from 0 // Expecting 1
Calling 1 from 0
Calling 2 from 0
Calling 1 from 0
Calling 2 from 0
Calling 3 from 0
Calling 3 from 0
Calling 1 from 0
Calling 2 from 0
Calling 3 from 0
Calling 0 from 0
Calling 1 from 0
Calling 2 from 0
Calling 3 from 0
Calling 0 from 0
Calling 1 from 0
Calling 2 from 0
Calling 3 from 0
Calling 1 from 0
Calling 0 from 0
Calling 1 from 0
Calling 2 from 0
Calling 3 from 0
#include <vector>
#include <omp.h>
#include <iostream>
#include <cstdio>
#include <limits>
#include <cstdint>
#include <cinttypes>
using namespace std;
const size_t kM = 4;
struct Mat
{
int elem[kM];
Mat(const Mat& copy)
{
for (size_t i = 0; i<kM; ++i)
this->elem[i] = copy.elem[i];
}
Mat()
{
for (size_t i = 0; i<kM; ++i)
elem[i] = 0;
}
void do_mat(Mat& m)
{
#pragma omp parallel for
for (int i = 0; i<kM; ++i)
{
printf(" \tCalling %d from %d\n", i, omp_get_thread_num());
elem[i] += m.elem[i];
}
}
};
int main ()
{
const int kN = 10;
vector<Mat> matrices(kN);
Mat m;
#pragma omp parallel for
for (int i = 0; i < kN; i++)
{
int tid = omp_get_thread_num();
printf("Calling %d from %d\n", i, tid);
matrices[i].do_mat(m);
}
return 0;
}
I'm not sure I understand what is that you expected, but the result you get is perfectly expected.
By default, OpenMP nested parallelism is disabled, meaning that any nested parallel region will create as many teams of 1 thread as there are of threads from the outer level encountering them.
In your case, you outermost parallel region creates a team of 8 threads. Each of these will reach the innermost parallel region, and create a second level 1-thread team. Each of these second level thread, in its own team, is ranked 0, hence the printed 0s you have.
With the very same code, compiled with g++ 9.3.0, by setting the 2 environment variables OMP_NUM_THREADS and OMP_NESTED, I get the following:
OMP_NUM_THREADS="2,3" OMP_NESTED=true ./a.out
Calling 0 from 0
Calling 5 from 1
Calling 0 from 0
Calling 1 from 0
Calling 2 from 1
Calling 0 from 0
Calling 1 from 0
Calling 3 from 2
Calling 3 from 2
Calling 2 from 1
Calling 6 from 1
Calling 1 from 0
Calling 0 from 0
Calling 1 from 0
Calling 3 from 2
Calling 2 from 1
Calling 2 from 0
Calling 0 from 0
Calling 1 from 0
Calling 2 from 1
Calling 3 from 2
Calling 0 from 0
Calling 1 from 0
Calling 3 from 2
Calling 2 from 1
Calling 3 from 0
Calling 7 from 1
Calling 0 from 0
Calling 3 from 2
Calling 2 from 1
Calling 3 from 2
Calling 0 from 0
Calling 1 from 0
Calling 1 from 0
Calling 2 from 1
Calling 4 from 0
Calling 8 from 1
Calling 0 from 0
Calling 3 from 2
Calling 2 from 1
Calling 2 from 1
Calling 0 from 0
Calling 1 from 0
Calling 3 from 2
Calling 1 from 0
Calling 9 from 1
Calling 2 from 1
Calling 0 from 0
Calling 1 from 0
Calling 3 from 2
Maybe that corresponds better to what you expected to see?
Unless you provide special options to OpenMP, it tries to split the work at compile time, and it's hard to do with nested parallelism, so it doesn't even try.
You can refer to this StackOverflow question for suggestions (e.g. using collapse in OpenMP 3.0+)
I expected to output be 0 0 1 1 2 2 ... but the output was 0 1 2 3 ... 0 1 2 3
class Runner: Thread() {
override fun run() {
var i = 0
while (i < 10) {
println("${Thread.currentThread().name} $i")
i++
}
try {
Thread.sleep(100)
} catch (e: Exception) {
e.printStackTrace()
}
}
}
fun main(args: Array<String>) {
val nah = Runner()
val blah = Runner()
nah.start()
blah.start()
}
What is incorrect?
Well, the output of this code can basically be any interleaved version of the two threads' numbers. They'll both individually print the numbers from 1 to 10 in sequence, but there is no guarantee for when this happens in relation to each other, since there's no synchronization between them, and without the delay inside the loop, they are both just trying to print the numbers as fast possible.
These are all perfectly valid outcomes for this code:
T1 1 2 3
T2 1 2 3
out 1 1 2 2 3 3
T1 1 2 3
T2 1 2 3
out 1 2 3 1 2 3
T1 1 2 3
T2 1 2 3
out 1 2 3 1 2 3
T1 1 2 3
T2 1 2 3
out 1 1 2 3 2 3
If you put the delay inside the loop, and it's long enough, you'll somewhat guarantee that you get a sequence like 1 1 2 2 3 3, but the order in which the two copies of a number are printed between the threads is still up to how the threads happen to get scheduled in a particular run.
< ~100 ms >
T1 1 2 3
T2 1 2 3
out 11 22 33
T1 1 2 3
T2 1 2 3
out 11 22 33
T1 1 2 3
T2 1 2 3
out 11 22 33
Note that Thread.sleep itself isn't perfectly accurate either, and using it for this purpose will just happen to work out most of the time if you put in a long enough delay here.
If you want to write multithreaded code that runs predictably, look into different synchronization methods (synchronized, locks, semaphores, etc).
Change to
class Runner: Thread() {
override fun run() {
var i = 0
while (i < 10) {
println("${Thread.currentThread().name} $i")
i++
try {
Thread.sleep(100)
} catch (e: Exception) {
e.printStackTrace()
}
}
}
}
Maybe it's close to you, but this result is not always there.
T1 1
T2 1
T1 2
T2 2
T1 3
T2 3
.
.
.
I have a function that creates a vector of size N, and shuffles it:
void rand_vector_generator(int N) {
srand(time(NULL));
vector <int> perm(N);
for (unsigned k=0; k<N; k++) {
perm[k] = k;
}
random_shuffle(perm.begin(),perm.end());
}
I'm calling this from my main function with the loop:
for(int i=0; i<20; i++)
rand_vector_generator(10);
I expected this to not give me sufficient randomness in my shuffling because I'm calling srand(time(NULL)); with every function call and the seed is not too different from successive call to call. My understanding is that I call srand(time(NULL)); once and not multiple times so the seed doesn't "reset".
This thread somewhat affirms what I was expecting the result to be.
Instead, I get:
6 0 3 5 7 8 4 1 2 9
0 8 6 4 2 3 7 9 1 5
8 2 4 9 5 0 6 7 1 3
0 6 1 8 7 4 5 2 3 9
2 5 1 0 3 7 6 4 8 9
4 5 3 0 1 7 2 9 6 8
8 5 2 9 7 0 6 3 4 1
8 4 9 3 1 5 7 0 6 2
3 7 6 0 9 8 2 4 1 5
8 5 2 3 7 4 6 9 1 0
5 4 0 1 2 6 8 7 3 9
2 5 7 9 6 0 4 3 1 8
5 8 3 7 0 2 1 6 9 4
7 4 9 5 1 8 2 3 0 6
1 9 2 3 8 6 0 7 5 4
0 6 4 3 1 2 9 7 8 5
9 3 8 4 7 5 1 6 0 2
1 9 6 5 3 0 2 4 8 7
7 5 1 8 9 3 4 0 2 6
2 9 6 5 4 0 3 7 8 1
These vectors seem pretty randomly shuffled to me. What am I missing? Does the srand call somehow exist on a different scope than the function call so it doesn't get reset every call? Or am I misunderstanding something more fundamental here?
According to standard the use of std::rand in both std::random_shuffle and std::shuffle is implementation-defined (though it is often the case that an std::rand is used this is not guaranteed). Try it on another compiler? Another platform?
If you want to make sure the std::rand is used you should let your code use it explicitly (for example, using lambda expression):
random_shuffle(perm.begin(), perm.end(), []{return std::rand();});
On a somewhat unrelated note, the time()'s precision is one whole second, your code runs way faster than that (I would hope) so those multiple calls to srand() result in resetting to the same-ish seed
So I'm not sure where the issue is occurring, I feel like this is a basic locking setup. Each thread tries to grab the lock and the one that does can go through the critical section while the other has to wait until the thread releases the lock.
But the odd thing is that my 'numLanesLeft' is sometimes equal for for threads.
For example, it should be:
Thread 1 - 16 - Thread 2 - 16
Thread 1 - 15 - Thread 2 - 14
Thread 1 - 12 - Thread 2 - 13
or something similar all the way down to 0 but I tend to get:
Thread 1 - 16 - Thread 2 - 16
Thread 1 - 14 - Thread 2 - 14
Thread 1 - 12 - Thread 2 - 13
where both threads have the same numLanesLeft
-
void ShooterAction(int rate,Color PlayerColor, int numRound) {
numLanesLeft = Gallery->Count();
// used to get the random lane
int randLane;
int getColor;
// while there are still lanes that haven't been shot at
while(numLanesLeft > 0) {
// only one operation, share lock
pthread_mutex_lock(&mutexLock);
printf("color: %i numLanesLeft: %i \n", PlayerColor, numLanesLeft);
// randomly pick a lane between 0 - 15
randLane = rand() % 16;
// if Rouge picked a lane that has already been fired at, look for a free lane
while(Gallery->Get(randLane) != white) {
randLane = rand() % 16;
}
// set the lane's colour
getColor = Gallery->Set(randLane, PlayerColor);
// set the thread to wait X amount of time to simulate a shot
usleep(1000000/rate);
// decrement the number of lanes
numLanesLeft--;
// upon lock for other threads
pthread_mutex_unlock(&mutexLock);
}
}
Results for printf("color: %i numLanesLeft: %i \n", PlayerColor, numLanesLeft)
color: 1 numLanesLeft: 16
color: 2 numLanesLeft: 15
color: 1 numLanesLeft: 14
color: 2 numLanesLeft: 12
color: 1 numLanesLeft: 12
color: 2 numLanesLeft: 11
color: 1 numLanesLeft: 10
color: 2 numLanesLeft: 9
color: 1 numLanesLeft: 9
color: 2 numLanesLeft: 8
color: 1 numLanesLeft: 7
color: 2 numLanesLeft: 6
color: 1 numLanesLeft: 5
color: 2 numLanesLeft: 4
color: 1 numLanesLeft: 3
color: 2 numLanesLeft: 2
color: 1 numLanesLeft: 1
the problem is in the first line in the method
numLanesLeft = Gallery->Count();
numLanesLeft is a global variable than being accessed and updated by more than one thread without any lock.
if first thread start and did 16, 15, 14 and only after than second thread starting running he will reset the value of numLanesLeft at the beginning of the method.
second issue: the condition (numLanesLeft > 0) might change between the time the thread checked it and starting waiting for the lock and the lock being released, so inside the critical second it more safe to check if the condition (numLanesLeft > 0) is still valid before starting to run it.
other than that i can't find any reason why you are having such results !!
Note: its recommended to use "cout" instead of "printf" in c++