Default Loop Iteration Scheduling in OpenMP - c++

I used the following statement from OpenMP:
omp_set_num_threads(6);
#pragma omp parallel for
for(int i = 0; i < NUMS; ++i){
printf("id is %3d thread is %d\n",i, omp_get_thread_num());
}
I found out (in a blog post): each thread will be evenly allocated iterations, and, when the number of threads is not divisible by the number of iterations, it will be rounded up.
Well, I first set NUMS=17, the result is as follows:
id is 12 thread is 4
id is 13 thread is 4
id is 14 thread is 4
id is 9 thread is 3
id is 10 thread is 3
id is 11 thread is 3
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 6 thread is 2
id is 7 thread is 2
id is 8 thread is 2
id is 15 thread is 5
id is 16 thread is 5
id is 3 thread is 1
id is 4 thread is 1
id is 5 thread is 1
As can be seen, \lceil 17/6 \rceil = 3 (round up, forgive me, I don't know how to insert Latex formulas), the result is as expected.
However, if I set NUMS=19, according to rounding up 19/6 = 4, each thread should be allocated 4 iterations, however:
id is 0 thread is 0
id is 1 thread is 0
id is 2 thread is 0
id is 3 thread is 0
id is 10 thread is 3
id is 11 thread is 3
id is 12 thread is 3
id is 7 thread is 2
id is 8 thread is 2
id is 9 thread is 2
id is 13 thread is 4
id is 14 thread is 4
id is 15 thread is 4
id is 4 thread is 1
id is 5 thread is 1
id is 6 thread is 1
id is 16 thread is 5
id is 17 thread is 5
id is 18 thread is 5
As you can see, only the first one is assigned 4 iterations.
So I can't figure it out now, what exactly is the reason for this? What exactly is OpenMP's default thread scheduling?

Summarising all the comments to create an answer (so thanks to all who commented).
First, what the standard says/requires :-
It says nothing about which schedule should be used when it is unspecified.
schedule(static) with no chunk_size is only specified as allocating "approximately equal" numbers of iterations to each available thread.
Second, what happens in reality at present :-
Compilers default to using schedule(static) when no schedule is specified. (Though schedule(nonmonotonic:dynamic) might, now, be a better choice.)
At least the LLVM OpenMP runtime allocates iterations to threads as this Python code shows (the critical part is myCount, the rest is just to test it and show your test cases).
#
# Show number of iterations allocated to each thread
# by schedule(static) in the LLVM runtime
#
def myCount(me, numThreads, total):
base = total // numThreads
remainder = total % numThreads
return base+1 if me < remainder else base
def test(numThreads, total):
print("Threads: ",numThreads, " Iterations: ",total)
allocated = 0
for thread in range(0,numThreads):
mine = myCount(thread,numThreads,total)
allocated += mine
print (thread, ": ", mine)
if allocated != total:
print ("***ERROR*** ", allocated," allocated, ", total," requested")
test(6,17)
test(6,19)
If you run that you can see the result for your two test cases:-
% python3 static.py
Threads: 6 Iterations: 17
0 : 3
1 : 3
2 : 3
3 : 3
4 : 3
5 : 2
Threads: 6 Iterations: 19
0 : 4
1 : 3
2 : 3
3 : 3
4 : 3
5 : 3
If you want to get into the full horror of loop scheduling, there is a whole chapter on this in "High Performance Parallel Runtimes -- Design and Implementation"
p.s. It's worth noting that the schedule shown above cannot be explicitly requested by setting a block_size on a static schedule, since the standard does not then allow the remainder iterations to be split up as they are here. (E.g. If we try to allocate 10 iterations to 4 threads, if we set block_size(2) we'd get (4,2,2,2) whereas if we set it to 3 we'd get (3,3,3,1), whereas the scheme above gives (3,3,2,2), which has imbalance of 1 whereas the explicit schemes each have imbalance of 2).

Related

Circular traverse of parameter values algorithm [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 days ago.
Improve this question
Could you pls. help with algorithm (I use C++) which seemed so simple at first glance.
I need a total enumeration of all combinations of N parameters to run some target function for all combination and choose one optimal combination.
Let say there are three parameters (it can be any number set by user but for this example let’s assume 3)
Possible values (can be any number for any parameter - NOT a fixed number - number of values set a user before program starts)
name
value
value
value
value
param1
1
2
..
param2
10
20
..
param3
100
200
300
..
So the number of combinations = 12 (2 * 2 * 3 = 12)
All combinations
#
param1
param2
param3
1
1
10
100
2
2
10
100
3
1
20
100
4
2
20
100
5
1
10
200
6
2
10
200
7
1
20
200
8
2
20
200
9
1
10
300
10
2
10
300
11
1
20
300
12
2
20
300
Ok – let say the order may be different
#
param1
param2
param3
1
1
10
100
2
1
20
100
3
1
10
200
4
1
20
200
5
1
10
300
6
1
20
300
7
2
10
100
8
2
20
100
9
2
10
200
10
2
20
200
11
2
10
300
12
2
20
300
However it’s obviously one counter should change while others fixed and after one counter finish increment next in the chain
This approach seems quite simple but I still can’t find an implementation. I thought use a list for the parameters and as one counter finish itereates it's values it call the next parameter and increment next parameter value and reset the first parameter counter. But how put it in couple of loops … ? I intuitively feels it should be quite simple solution.
The another approach I think – use all combinations to build a graph and after that traverse the whole graph and in the end get an optimal combination. But if I fill the graph it means I already solve this problem and building a graph just a waste of time and memory.
For now there is a sketch (in pseudo-code) like this:
std::list<param> params
bool isDone = false
func(node* n)
{
if(n->prev)
{
n->GetCurrentValue() // return current value and increment by one
n->prev->reset();
func(n->prev)
Run(); // target function
if(n->IsDone()) // finish end of the values
{
if(n->next)
func(n->next);
else
isDone = true;
}
}
else // first node in the list
{
while(n->IsDone()) // finish end of the values
{
n->GetCurrentValue() // return current value and increment by one
Run() // target function
}
n.reset() // set internal counter for the node to 0
func(n->next())
}
}
while(!isDone)
{
for(p : params)
{
func(p)
}
}

parallel processing for C+11's vector

When I try to do average images input a vector asynchronously(for example, concurrency::concurrent_vector<cv::Mat>), How Can I parallelize sum about points or batches(1 row or 1 col or 3*3 array) of the same coordinates or Area?
I would appreciate it if you could tell me how to calculate the values in vector in columns or batch rather than in single units(ex. nested for).
(Edit)
For example
If I have 3 thread for image processing, and Each result are
thread 1
1 1 1
1 1 1
1 1 1
and thread 2
2 2 2
2 2 2
2 2 2
thread 3
6 6 6
6 6 6
6 6 6
then, just I want is
3 3 3
3 3 3
3 3 3
I thought two way for calculate average all thread's image.
1. just sum each thread result derivered to main thread and
count how much result derivered.
If thread1&2 result derivered to main thread.
(1) sum
3 3 3
3 3 3
3 3 3
(2) save count of sum and coordinate
In this example, save value will
1 - count of sum
Rect(0, 0, 2, 2) - coordinate to nested area
(3) If all thread's result coming, do average about nested area
9 9 9
9 9 9
9 9 9
if count of sum value is 2, find nested area and do average.
2(count of sum) / Rect(0, 0, 2, 2)
result will be
3 3 3
3 3 3
3 3 3
2. Just wait all thread's result derivered and do average in batches.
like
1|1 1
1|1 1
1|1 1
2|2 2
2|2 2
2|2 2
6|6 6
6|6 6
6|6 6
|
9
9
9
|
3
3
3
But, I don't know how to access and calculate by Memory References each thread's images. If thread 1 image address (in this case, 0,0 pixel data address in image) is 100, and thread 2 image address start is 200. then (0,0) pixel data in result images will calculate *100+*200.
Of course, before doing this operation, I will have to check that the memory matching the coordinates has the correct value.
And, Who told If I use std::reduce, will easy to implementation about this.
But, I have no idea how to apply that function in this way.

shifting with re-sampling in time series data

assume that i have this time-series data:
A B
timestamp
1 1 2
2 1 2
3 1 1
4 0 1
5 1 0
6 0 1
7 1 0
8 1 1
i am looking for a re-sample value that would give me specific count of occurrences at least for some frequency
if I would use re sample for the data from 1 to 8 with 2S, i will get different maximum if i would start from 2 to 8 for the same window size (2S)
ds = series.resample( str(tries) +'S').sum()
for shift in range(1,100):
tries = 1
series = pd.read_csv("file.csv",index_col='timestamp') [shift:]
ds = series.resample( str(tries) +'S').sum()
while ( (ds.A.max + ds.B.max < 4) & (tries < len(ds))):
ds = series.resample( str(tries) +'S').sum()
tries = tries + 1
#other lines
i am looking for performance improvement as it takes prohibitively long to finish for large data

Threads unexpectead running in sequence

I expected to output be 0 0 1 1 2 2 ... but the output was 0 1 2 3 ... 0 1 2 3
class Runner: Thread() {
override fun run() {
var i = 0
while (i < 10) {
println("${Thread.currentThread().name} $i")
i++
}
try {
Thread.sleep(100)
} catch (e: Exception) {
e.printStackTrace()
}
}
}
fun main(args: Array<String>) {
val nah = Runner()
val blah = Runner()
nah.start()
blah.start()
}
What is incorrect?
Well, the output of this code can basically be any interleaved version of the two threads' numbers. They'll both individually print the numbers from 1 to 10 in sequence, but there is no guarantee for when this happens in relation to each other, since there's no synchronization between them, and without the delay inside the loop, they are both just trying to print the numbers as fast possible.
These are all perfectly valid outcomes for this code:
T1 1 2 3
T2 1 2 3
out 1 1 2 2 3 3
T1 1 2 3
T2 1 2 3
out 1 2 3 1 2 3
T1 1 2 3
T2 1 2 3
out 1 2 3 1 2 3
T1 1 2 3
T2 1 2 3
out 1 1 2 3 2 3
If you put the delay inside the loop, and it's long enough, you'll somewhat guarantee that you get a sequence like 1 1 2 2 3 3, but the order in which the two copies of a number are printed between the threads is still up to how the threads happen to get scheduled in a particular run.
< ~100 ms >
T1 1 2 3
T2 1 2 3
out 11 22 33
T1 1 2 3
T2 1 2 3
out 11 22 33
T1 1 2 3
T2 1 2 3
out 11 22 33
Note that Thread.sleep itself isn't perfectly accurate either, and using it for this purpose will just happen to work out most of the time if you put in a long enough delay here.
If you want to write multithreaded code that runs predictably, look into different synchronization methods (synchronized, locks, semaphores, etc).
Change to
class Runner: Thread() {
override fun run() {
var i = 0
while (i < 10) {
println("${Thread.currentThread().name} $i")
i++
try {
Thread.sleep(100)
} catch (e: Exception) {
e.printStackTrace()
}
}
}
}
Maybe it's close to you, but this result is not always there.
T1 1
T2 1
T1 2
T2 2
T1 3
T2 3
.
.
.

Specific thread order in C using GCC and OMP

I need to make 4 teams with 4 threads each one with contiguous processors.
The result I'm expecting is, for example:
Team 0 Thread 0 Processor: 0
Team 0 Thread 1 Processor: 1
Team 0 Thread 2 Processor: 2
Team 0 Thread 3 Processor: 3
Team 1 Thread 0 Processor: 4
Team 1 Thread 1 Processor: 5
Team 1 Thread 2 Processor: 6
Team 1 Thread 3 Processor: 7
Team 2 Thread 0 Processor: 8
Team 2 Thread 1 Processor: 9
Team 2 Thread 2 Processor: 10
Team 2 Thread 3 Processor: 11
Team 3 Thread 0 Processor: 12
Team 3 Thread 1 Processor: 13
Team 3 Thread 2 Processor: 14
Team 3 Thread 3 Processor: 15
I can handle Processor Affinity in GCC using the GOMP_CPU_AFFINITY variable.
I'm using:
#pragma omp parallel num_threads(4)
twice in order to get 2 fork levels.
At the moment I'm having this order in GOMP_CPU_AFFINITY:
0 4 8 12 1 2 3 5 6 7 9 10 11 13 14 15
So the first fork, the "fathers fork", gets:
Team 0 Thread 0 Processor: 0
Team 1 Thread 0 Processor: 4
Team 2 Thread 0 Processor: 8
Team 3 Thread 0 Processor: 12
The problem I'm having is that the second group of forks make without any order so, for example I could have this situation (I'm using #pragma omp atomic so only one 'father' can ask for more processors at any time):
Team 0 Thread 0 Processor: 0
Team 0 Thread 1 Processor: 5
Team 0 Thread 2 Processor: 6
Team 0 Thread 3 Processor: 7
Team 1 Thread 0 Processor: 4
Team 1 Thread 1 Processor: 13
Team 1 Thread 2 Processor: 14
Team 1 Thread 3 Processor: 15
Team 2 Thread 0 Processor: 8
Team 2 Thread 1 Processor: 1
Team 2 Thread 2 Processor: 2
Team 2 Thread 3 Processor: 3
Team 3 Thread 0 Processor: 12
Team 3 Thread 1 Processor: 9
Team 3 Thread 2 Processor: 10
Team 3 Thread 3 Processor: 11
The question is: Is there any way to make this second petition in order?
I think I would have to make some sinchronization method with locks or something...
Thanks in advance!
Javier
Finally I could make this works, this is my code:
#include <omp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sched.h>
int main(int argc, char *argv[]){
int padre, hijo;
int contador = 0;
omp_set_nested(1);
int suma;
#pragma omp parallel private(padre) shared(contador) num_threads(4)
{
padre = omp_get_thread_num();
{
while(contador != padre){
// Don't know what to put here
};
#pragma omp parallel private(hijo) shared(padre, contador) num_threads(4)
{
hijo = omp_get_thread_num();
printf("\nFather: %d Son: %d Processor: %d\n", padre, hijo, sched_getcpu());
#pragma omp master
{
contador++;
}
}
}
}
}
Note: Padre is Father, Hijo is Son and Contador is Counter in Spanish :P
The problem I'm facing now is that if I compile my code with -O3 optimizations, the while loop 'dissapear' unless I put, for example, a printf line inside the loop. I think I should ask it in another question!
Thanks to you all!
Javier