Deadlock in threads - c++

So I'm not sure where the issue is occurring, I feel like this is a basic locking setup. Each thread tries to grab the lock and the one that does can go through the critical section while the other has to wait until the thread releases the lock.
But the odd thing is that my 'numLanesLeft' is sometimes equal for for threads.
For example, it should be:
Thread 1 - 16 - Thread 2 - 16
Thread 1 - 15 - Thread 2 - 14
Thread 1 - 12 - Thread 2 - 13
or something similar all the way down to 0 but I tend to get:
Thread 1 - 16 - Thread 2 - 16
Thread 1 - 14 - Thread 2 - 14
Thread 1 - 12 - Thread 2 - 13
where both threads have the same numLanesLeft
-
void ShooterAction(int rate,Color PlayerColor, int numRound) {
numLanesLeft = Gallery->Count();
// used to get the random lane
int randLane;
int getColor;
// while there are still lanes that haven't been shot at
while(numLanesLeft > 0) {
// only one operation, share lock
pthread_mutex_lock(&mutexLock);
printf("color: %i numLanesLeft: %i \n", PlayerColor, numLanesLeft);
// randomly pick a lane between 0 - 15
randLane = rand() % 16;
// if Rouge picked a lane that has already been fired at, look for a free lane
while(Gallery->Get(randLane) != white) {
randLane = rand() % 16;
}
// set the lane's colour
getColor = Gallery->Set(randLane, PlayerColor);
// set the thread to wait X amount of time to simulate a shot
usleep(1000000/rate);
// decrement the number of lanes
numLanesLeft--;
// upon lock for other threads
pthread_mutex_unlock(&mutexLock);
}
}
Results for printf("color: %i numLanesLeft: %i \n", PlayerColor, numLanesLeft)
color: 1 numLanesLeft: 16
color: 2 numLanesLeft: 15
color: 1 numLanesLeft: 14
color: 2 numLanesLeft: 12
color: 1 numLanesLeft: 12
color: 2 numLanesLeft: 11
color: 1 numLanesLeft: 10
color: 2 numLanesLeft: 9
color: 1 numLanesLeft: 9
color: 2 numLanesLeft: 8
color: 1 numLanesLeft: 7
color: 2 numLanesLeft: 6
color: 1 numLanesLeft: 5
color: 2 numLanesLeft: 4
color: 1 numLanesLeft: 3
color: 2 numLanesLeft: 2
color: 1 numLanesLeft: 1

the problem is in the first line in the method
numLanesLeft = Gallery->Count();
numLanesLeft is a global variable than being accessed and updated by more than one thread without any lock.
if first thread start and did 16, 15, 14 and only after than second thread starting running he will reset the value of numLanesLeft at the beginning of the method.
second issue: the condition (numLanesLeft > 0) might change between the time the thread checked it and starting waiting for the lock and the lock being released, so inside the critical second it more safe to check if the condition (numLanesLeft > 0) is still valid before starting to run it.
other than that i can't find any reason why you are having such results !!
Note: its recommended to use "cout" instead of "printf" in c++

Related

Default Loop Iteration Scheduling in OpenMP

I used the following statement from OpenMP:
omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
printf("id is %3d thread is %d\n",i, omp_get_thread_num());
}
I found out (in a blog post): each thread will be evenly allocated iterations, and, when the number of threads is not divisible by the number of iterations, it will be rounded up.
Well, I first set NUMS=17, the result is as follows:
id is 12 thread is 4
id is 13 thread is 4
id is 14 thread is 4
id is 9 thread is 3
id is 10 thread is 3
id is 11 thread is 3
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 6 thread is 2
id is 7 thread is 2
id is 8 thread is 2
id is 15 thread is 5
id is 16 thread is 5
id is 3 thread is 1
id is 4 thread is 1
id is 5 thread is 1
As can be seen, \lceil 17/6 \rceil = 3 (round up, forgive me, I don't know how to insert Latex formulas), the result is as expected.
However, if I set NUMS=19, according to rounding up 19/6 = 4, each thread should be allocated 4 iterations, however:
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 3 thread is 0
id is 10 thread is 3
id is 11 thread is 3
id is 12 thread is 3
id is 7 thread is 2
id is 8 thread is 2
id is 9 thread is 2
id is 13 thread is 4
id is 14 thread is 4
id is 15 thread is 4
id is 4 thread is 1
id is 5 thread is 1
id is 6 thread is 1
id is 16 thread is 5
id is 17 thread is 5
id is 18 thread is 5
As you can see, only the first one is assigned 4 iterations.
So I can't figure it out now, what exactly is the reason for this? What exactly is OpenMP's default thread scheduling?
Summarising all the comments to create an answer (so thanks to all who commented).
First, what the standard says/requires :-
It says nothing about which schedule should be used when it is unspecified.
schedule(static) with no chunk_size is only specified as allocating "approximately equal" numbers of iterations to each available thread.
Second, what happens in reality at present :-
Compilers default to using schedule(static) when no schedule is specified. (Though schedule(nonmonotonic:dynamic) might, now, be a better choice.)
At least the LLVM OpenMP runtime allocates iterations to threads as this Python code shows (the critical part is myCount, the rest is just to test it and show your test cases).
#
# Show number of iterations allocated to each thread
# by schedule(static) in the LLVM runtime
#
def myCount(me, numThreads, total):
base = total // numThreads
remainder = total % numThreads
return base+1 if me < remainder else base
def test(numThreads, total):
print("Threads: ",numThreads, " Iterations: ",total)
allocated = 0
for thread in range(0,numThreads):
mine = myCount(thread,numThreads,total)
allocated += mine
print (thread, ": ", mine)
if allocated != total:
print ("***ERROR*** ", allocated," allocated, ", total," requested")
test(6,17)
test(6,19)
If you run that you can see the result for your two test cases:-
% python3 static.py
Threads: 6 Iterations: 17
0 : 3
1 : 3
2 : 3
3 : 3
4 : 3
5 : 2
Threads: 6 Iterations: 19
0 : 4
1 : 3
2 : 3
3 : 3
4 : 3
5 : 3
If you want to get into the full horror of loop scheduling, there is a whole chapter on this in "High Performance Parallel Runtimes -- Design and Implementation"
p.s. It's worth noting that the schedule shown above cannot be explicitly requested by setting a block_size on a static schedule, since the standard does not then allow the remainder iterations to be split up as they are here. (E.g. If we try to allocate 10 iterations to 4 threads, if we set block_size(2) we'd get (4,2,2,2) whereas if we set it to 3 we'd get (3,3,3,1), whereas the scheme above gives (3,3,2,2), which has imbalance of 1 whereas the explicit schemes each have imbalance of 2).

parallel processing for C+11's vector

When I try to do average images input a vector asynchronously(for example, concurrency::concurrent_vector<cv::Mat>), How Can I parallelize sum about points or batches(1 row or 1 col or 3*3 array) of the same coordinates or Area?
I would appreciate it if you could tell me how to calculate the values in vector in columns or batch rather than in single units(ex. nested for).
(Edit)
For example
If I have 3 thread for image processing, and Each result are
thread 1
1 1 1
1 1 1
1 1 1
and thread 2
2 2 2
2 2 2
2 2 2
thread 3
6 6 6
6 6 6
6 6 6
then, just I want is
3 3 3
3 3 3
3 3 3
I thought two way for calculate average all thread's image.
1. just sum each thread result derivered to main thread and
count how much result derivered.
If thread1&2 result derivered to main thread.
(1) sum
3 3 3
3 3 3
3 3 3
(2) save count of sum and coordinate
In this example, save value will
1 - count of sum
Rect(0, 0, 2, 2) - coordinate to nested area
(3) If all thread's result coming, do average about nested area
9 9 9
9 9 9
9 9 9
if count of sum value is 2, find nested area and do average.
2(count of sum) / Rect(0, 0, 2, 2)
result will be
3 3 3
3 3 3
3 3 3
2. Just wait all thread's result derivered and do average in batches.
like
1|1 1
1|1 1
1|1 1
2|2 2
2|2 2
2|2 2
6|6 6
6|6 6
6|6 6
|
9
9
9
|
3
3
3
But, I don't know how to access and calculate by Memory References each thread's images. If thread 1 image address (in this case, 0,0 pixel data address in image) is 100, and thread 2 image address start is 200. then (0,0) pixel data in result images will calculate *100+*200.
Of course, before doing this operation, I will have to check that the memory matching the coordinates has the correct value.
And, Who told If I use std::reduce, will easy to implementation about this.
But, I have no idea how to apply that function in this way.

Keep first record when event occurrs

I have the following data in Stata:
clear
* Input data
input grade id exit time
1 1 . 10
2 1 . 20
3 1 2 30
4 1 0 40
5 1 . 50
1 2 0 10
2 2 0 20
3 2 0 30
4 2 0 40
5 2 0 50
1 3 1 10
2 3 1 20
3 3 0 30
4 3 . 40
5 3 . 50
1 4 . 10
2 4 . 20
3 4 . 30
4 4 . 40
5 4 . 50
1 5 1 10
2 5 2 20
3 5 1 30
4 5 1 40
5 5 1 50
end
The objective is to take the first row foreach id when a event occurs and if no event occur then take the last report foreach id. Here is a example for the data I hope to attain
* Input data
input grade id exit time
3 1 2 30
5 2 0 50
1 3 1 10
5 4 . 50
1 5 1 10
end
The definition of an event appears to be that exit is not zero or missing. If so, then all you need to do is tweak the code in my previous answer:
bysort id (time): egen when_first_e = min(cond(exit > 0 & exit < ., time, .))
by id: gen tokeep = cond(when_first_e == ., time == time[_N], time == when_first_e)
Previous thread was here.

Threads unexpectead running in sequence

I expected to output be 0 0 1 1 2 2 ... but the output was 0 1 2 3 ... 0 1 2 3
class Runner: Thread() {
override fun run() {
var i = 0
while (i < 10) {
println("${Thread.currentThread().name} $i")
i++
}
try {
Thread.sleep(100)
} catch (e: Exception) {
e.printStackTrace()
}
}
}
fun main(args: Array<String>) {
val nah = Runner()
val blah = Runner()
nah.start()
blah.start()
}
What is incorrect?
Well, the output of this code can basically be any interleaved version of the two threads' numbers. They'll both individually print the numbers from 1 to 10 in sequence, but there is no guarantee for when this happens in relation to each other, since there's no synchronization between them, and without the delay inside the loop, they are both just trying to print the numbers as fast possible.
These are all perfectly valid outcomes for this code:
T1 1 2 3
T2 1 2 3
out 1 1 2 2 3 3
T1 1 2 3
T2 1 2 3
out 1 2 3 1 2 3
T1 1 2 3
T2 1 2 3
out 1 2 3 1 2 3
T1 1 2 3
T2 1 2 3
out 1 1 2 3 2 3
If you put the delay inside the loop, and it's long enough, you'll somewhat guarantee that you get a sequence like 1 1 2 2 3 3, but the order in which the two copies of a number are printed between the threads is still up to how the threads happen to get scheduled in a particular run.
< ~100 ms >
T1 1 2 3
T2 1 2 3
out 11 22 33
T1 1 2 3
T2 1 2 3
out 11 22 33
T1 1 2 3
T2 1 2 3
out 11 22 33
Note that Thread.sleep itself isn't perfectly accurate either, and using it for this purpose will just happen to work out most of the time if you put in a long enough delay here.
If you want to write multithreaded code that runs predictably, look into different synchronization methods (synchronized, locks, semaphores, etc).
Change to
class Runner: Thread() {
override fun run() {
var i = 0
while (i < 10) {
println("${Thread.currentThread().name} $i")
i++
try {
Thread.sleep(100)
} catch (e: Exception) {
e.printStackTrace()
}
}
}
}
Maybe it's close to you, but this result is not always there.
T1 1
T2 1
T1 2
T2 2
T1 3
T2 3
.
.
.

Specific thread order in C using GCC and OMP

I need to make 4 teams with 4 threads each one with contiguous processors.
The result I'm expecting is, for example:
Team 0 Thread 0 Processor: 0
Team 0 Thread 1 Processor: 1
Team 0 Thread 2 Processor: 2
Team 0 Thread 3 Processor: 3
Team 1 Thread 0 Processor: 4
Team 1 Thread 1 Processor: 5
Team 1 Thread 2 Processor: 6
Team 1 Thread 3 Processor: 7
Team 2 Thread 0 Processor: 8
Team 2 Thread 1 Processor: 9
Team 2 Thread 2 Processor: 10
Team 2 Thread 3 Processor: 11
Team 3 Thread 0 Processor: 12
Team 3 Thread 1 Processor: 13
Team 3 Thread 2 Processor: 14
Team 3 Thread 3 Processor: 15
I can handle Processor Affinity in GCC using the GOMP_CPU_AFFINITY variable.
I'm using:
#pragma omp parallel num_threads(4)
twice in order to get 2 fork levels.
At the moment I'm having this order in GOMP_CPU_AFFINITY:
0 4 8 12 1 2 3 5 6 7 9 10 11 13 14 15
So the first fork, the "fathers fork", gets:
Team 0 Thread 0 Processor: 0
Team 1 Thread 0 Processor: 4
Team 2 Thread 0 Processor: 8
Team 3 Thread 0 Processor: 12
The problem I'm having is that the second group of forks make without any order so, for example I could have this situation (I'm using #pragma omp atomic so only one 'father' can ask for more processors at any time):
Team 0 Thread 0 Processor: 0
Team 0 Thread 1 Processor: 5
Team 0 Thread 2 Processor: 6
Team 0 Thread 3 Processor: 7
Team 1 Thread 0 Processor: 4
Team 1 Thread 1 Processor: 13
Team 1 Thread 2 Processor: 14
Team 1 Thread 3 Processor: 15
Team 2 Thread 0 Processor: 8
Team 2 Thread 1 Processor: 1
Team 2 Thread 2 Processor: 2
Team 2 Thread 3 Processor: 3
Team 3 Thread 0 Processor: 12
Team 3 Thread 1 Processor: 9
Team 3 Thread 2 Processor: 10
Team 3 Thread 3 Processor: 11
The question is: Is there any way to make this second petition in order?
I think I would have to make some sinchronization method with locks or something...
Thanks in advance!
Javier
Finally I could make this works, this is my code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
int main(int argc, char *argv[]){
int padre, hijo;
int contador = 0;
omp_set_nested(1);
int suma;
#pragma omp parallel private(padre) shared(contador) num_threads(4)
{
padre = omp_get_thread_num();
{
while(contador != padre){
// Don't know what to put here
};
#pragma omp parallel private(hijo) shared(padre, contador) num_threads(4)
{
hijo = omp_get_thread_num();
printf("\nFather: %d Son: %d Processor: %d\n", padre, hijo, sched_getcpu());
#pragma omp master
{
contador++;
}
}
}
}
}
Note: Padre is Father, Hijo is Son and Contador is Counter in Spanish :P
The problem I'm facing now is that if I compile my code with -O3 optimizations, the while loop 'dissapear' unless I put, for example, a printf line inside the loop. I think I should ask it in another question!
Thanks to you all!
Javier