Local sender-receiver reduction operation MPI

Local sender-receiver reduction operation MPI - c++

I have a periodic cartesian grid of MPI proc's, for 4 proc's the layout looks like this
__________________________
| | | |
| 0 | 1 | 2 |
|_______|________|_______ |
| | | |
| 3 | 4 | 5 |
|_______|________|________|
where the numbers are the ranks of the proc's in the communicator.
During the calculation all proc'have to send a number to its left neighbour, and this number should be summed up with the one that the left neighbour already has :
int a[2];
a[0] = calculateSomething1();
a[1] = calculateSomething2();
int tempA;
MPI_Request recv_request, send_request;
//Revceive from the right neighbour
MPI_Irecv(&tempA, 1, MPI_INT, myMPI.getRightNeigh(), 0, Cart_comm, &recv_request);
//Send to the left neighbour
MPI_Isend(&a[0], 1, MPI_INT, myMPI.getLeftNeigh(), 0, Cart_comm, &send_request);
MPI_Status status;
MPI_Wait(&recv_request, &status);
MPI_Wait(&send_request, &status);
//now I have to do something like this
a[1] += tempA;
I am wondering if there is a sort of "local" reduction operation only for a pair sender-receiver or the only solution is to have "local" communicators and use collective operations there?

You can use MPI_Sendrecv in this case. It is basically made for this case.
I don't think you would get any benefit from using collectives.
BTW: Your code is not probably correct. You are sending from a local stack variable &a[0]. You must complete the communication send_request before exiting scope and reusing a's memory. This is done by some form of MPI_Wait(all) or a successful MPI_Test.

Related

Strange behaviour in multithread analysis

for a university project we are implementing an algorithm capable of bruteforcing on an AES key that we assume is partially known.
We have implemented several versions including one that exploits the multithreading mechanism in C++.
The implementation is done by allocating a variable number of threads, to be passed as input at launch, and dividing the key space equally for each thread that will cycle through the respective range attempting each key. De facto the implementation works, as it succeeds in finding the key for any combination #bitsToHack/#threads but returns strange timing results.
//Structs for threads and respective data
pthread_t threads[num_of_threads];
struct bf_data td [num_of_threads];
int rc;
//Space division
uintmax_t index = pow (BASE_NUMBER, num_bits_to_hack);
uintmax_t step = index/num_of_threads;
if(sem_init(&s, 1, 0)!=0){
printf("Error during semaphore initialization\n");
return -1;
}
for(int i = 0; i < num_of_threads; i++){
//Structure initialization
td[i].ciphertext = ciphertext;
td[i].hacked_key = hacked_key;
td[i].iv_aes = iv_aes;
td[i].key = key_aes;
td[i].num_bits_to_hack = num_bits_to_hack;
td[i].plaintext = plaintext;
td[i].starting_point = step*i;
td[i].step = step;
td[i].num_of_threads = num_of_threads;
if(DEBUG)
printf("Starting point for thread %d is: %lu, using step: %lu\n", i , td[i].starting_point, td[i].step);
rc = pthread_create(&threads[i], NULL, decryption_brute_force, (void*)&td[i]);
if (rc){
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
}
sem_wait(&s);
for(int i = 0; i < num_of_threads; i++){
pthread_join(threads[i], NULL);
}
For the decryption_brute_force function (The body of each thread):
void* decryption_brute_force(void* data){
** Copy data on local thread memory
** Build the key to begin the search from starting point
** for each key from starting_point to starting_point + step
** Try decryption
** if obtained plaintext corresponds to the expected one
** Print results, wake up main thread and terminate
** else
** increment the key and continue
}
To conclude the project we intended to conduct a study of the optimal number of threads expecting an increase in performance as the number of threads increased up to a threshold, after which the system would no longer benefit from the increase in threads assigned to it.
At the end of the analysis (a simulation lasting about 9 hours), the results obtained were as follows in figure.
Click here to see the plot.
We cannot understand why 8 threads performs better than 16. Could it be due to the CPU architecture? Could it be able to schedule 32 and 8 threads better than 16?

From comments, I think it could be the linear-search pattern in each thread yields to different results for different number of threads. Because when you double the threads, the actual linear point to find in a thread may shift to a further point. But once you double again, it can not go much further due to too many threads. Because you said you are using only same encrypted data always. Did you try different inputs?
this variable is integer (so it may not be exact distribution)
^
8 threads & step=7 (56 work total)
index-16 (0-based)
v
01234567 89abcdef 01234567 89abcdef
| | |. | ...
500 seconds as its the first loop iteration
16 threads & step=3 (56 work total)
index-16 again, but at second-iteration now
v
012 345 678 9ab cde f01 234 567 8
| | | | | | . | | | ...
1000 seconds as it finds only after second iteration in the thread
Another example with 2 threads and 3 threads:
x to found at 51-th element of 100-element-work:
2 threads
| |x(1st iteration) |
3 threads
| |........x | |
5x slower than 2 threads

How can I create a trace which shows the call order where there is recursion

I want to know the details about the execution of recursive functions.
#include<iostream>
int a=0;
int fac(int n) {
if (n <= 1)
return n;
int temp = fac(n-2) + fac(n - 1);
a++;
return temp;
}
int main() {
fac(4);
std::cout<<a;
}
The output is 4.
I want to know when int temp = fac(n-2) + fac(n - 1)； is executed, for example fac(4-2)+fac(4-1) ---> fac(2)+fac(3), at this time, compiler will finish fac(2) first? or finish it together?
I'm not good at English, I hope that there is no obstacle to your reading.

Analysing this code purely in an algorithmic sense with no respect to C++ implementation intricacies,
fac(4)
fac(2) + fac(3)
|----------------------------|
fac(0) + fac(1) fac(1) + fac(2)
1 + 1 1 + fac(0) + fac(1)
+ 1 + 1
How can I create a trace which shows the call order where there is recursion?
First, I want to make note that the compiler output produced by the compiler will not match one-to-one with the code you write. The compiler applies different levels of optimization based on the flags provided to it with the highest level being -O3 and the default being -O0 but those seem out of scope of this question. Creating a trace influences this process itself as the compiler now needs to meet your expectations of what the trace looks like. The only true way to trace the actual execution flow is to read the assembly produced by the compiler.
Knowing that, we can apply a trace to see the call order by printing to screen when execution enters the called method. Note, I have removed a as it does not really trace anything and only adds to the complexity of the explanation.
int fac(int n) {
std::cout << "fac(" << n << ")" << std::endl;
if (n <= 1)
return n;
int temp = fac(n-2) + fac(n - 1);
return temp;
}
int main() {
fac(4);
}
// Output
fac(4)
fac(2)
fac(0)
fac(1)
fac(3)
fac(1)
fac(2)
fac(0)
fac(1)
As seen by this output on my PC, execution has proceeded from left to right depth first. We can number our call tree with this order to obtain a better picture,
// Call order
1. fac(4)
2. fac(2) + 5. fac(3)
|----------------------------|
3. fac(0) + 4. fac(1) 6. fac(1) + 7. fac(2)
+ 8. fac(0) + 9. fac(1)
Note: This does not mean that the results will be the same on every implementation nor does it mean that the order of execution is preserved when you remove the trace and allow compiler optimisations but it demonstrates how recursion works in computer programming.

First fac(2) will be finished and then fac(1). The output is 4.
The call stack would be like this (from bottom to top) -
|---fac(1)
|--- fac(2) |
|---- fac(3) | |---fac(0)
| |
| |----fac(1)
|
fac(4) |
|
| |---- fac(1)
|---- fac(2) |
|---- fac(0)
fac() function will be returned when n <= 1

Overflowed value prints non-overflowed value on Arduino

First of all, I know the problem with the code and how to get it to work. I'm mainly looking for an explanation why my output is what it is.
The following piece of code replicates the behaviour:
void setup() {
Serial.begin(9600);
}
void loop() {
Serial.println("start");
for(int i = 0; i < 70000; i++) {
if((i % 2000) == 0)
Serial.println(i);
}
}
Obviously the for loop will run forever because i will overflow at 32,767. I would expect it to overflow and print -32000.
Expected| Actually printed
0 | 0
2000 | 2000
4000 | 4000
... | ...
30000 | 30000
32000 | 32000
-32000 | 33536
-30000 | 35536
It looks like it prints the actual iterations, since if you overflow and count to -32000 you would have 33536 iterations, but I can't figure out how it's able to print the 33536.
The same thing happens every 65536 iterations:
95536 | 161072 | 226608 | 292144 | 357680
97536 | 163072 | 228608 | 294144 | 359680
99072 | 164608 | 230144 | 295680 | 361216
101072 | 166608 | 232144 | 297680 | 363216
EDIT
When I change the loop to add 10.000 every iteration and only print every 1.000.000 to speed it up the Arduino crashes (or at least, the prints stop) at '2.147.000.000'. Which seems to point to the 32-bit idea of svtag**
EDIT2
The edits I made for the 2.147.000.000-check:
void loop() {
Serial.println("start");
for(int i = 0; i < 70000; i+=10000) {
if((i % 1000000) == 0)
Serial.println(i);
}
}
They work in the same trend with the previous example, printing 0, 1000000, 2000000,...
However, when I update the AVR package from 1.6.17 to the latest 1.6.23 I get only 0's. The original example (% 2000 & i++) still gives the same output.

The compiler may have automatically casted the i to usigned int when you are doing the println or the value is going to the wrong overload. Try using Serial.println((int)i);

What's the difference between "static" and "dynamic" schedule in OpenMP?

I started working with OpenMP using C++.
I have two questions:
What is #pragma omp for schedule?
What is the difference between dynamic and static?
Please, explain with examples.

Others have since answered most of the question but I would like to point to some specific cases where a particular scheduling type is more suited than the others. Schedule controls how loop iterations are divided among threads. Choosing the right schedule can have great impact on the speed of the application.
static schedule means that iterations blocks are mapped statically to the execution threads in a round-robin fashion. The nice thing with static scheduling is that OpenMP run-time guarantees that if you have two separate loops with the same number of iterations and execute them with the same number of threads using static scheduling, then each thread will receive exactly the same iteration range(s) in both parallel regions. This is quite important on NUMA systems: if you touch some memory in the first loop, it will reside on the NUMA node where the executing thread was. Then in the second loop the same thread could access the same memory location faster since it will reside on the same NUMA node.
Imagine there are two NUMA nodes: node 0 and node 1, e.g. a two-socket Intel Nehalem board with 4-core CPUs in both sockets. Then threads 0, 1, 2, and 3 will reside on node 0 and threads 4, 5, 6, and 7 will reside on node 1:
| | core 0 | thread 0 |
| socket 0 | core 1 | thread 1 |
| NUMA node 0 | core 2 | thread 2 |
| | core 3 | thread 3 |
| | core 4 | thread 4 |
| socket 1 | core 5 | thread 5 |
| NUMA node 1 | core 6 | thread 6 |
| | core 7 | thread 7 |
Each core can access memory from each NUMA node, but remote access is slower (1.5x - 1.9x slower on Intel) than local node access. You run something like this:
char *a = (char *)malloc(8*4096);
#pragma omp parallel for schedule(static,1) num_threads(8)
for (int i = 0; i < 8; i++)
memset(&a[i*4096], 0, 4096);
4096 bytes in this case is the standard size of one memory page on Linux on x86 if huge pages are not used. This code will zero the whole 32 KiB array a. The malloc() call just reserves virtual address space but does not actually "touch" the physical memory (this is the default behaviour unless some other version of malloc is used, e.g. one that zeroes the memory like calloc() does). Now this array is contiguous but only in virtual memory. In physical memory half of it would lie in the memory attached to socket 0 and half in the memory attached to socket 1. This is so because different parts are zeroed by different threads and those threads reside on different cores and there is something called first touch NUMA policy which means that memory pages are allocated on the NUMA node on which the thread that first "touched" the memory page resides.
| | core 0 | thread 0 | a[0] ... a[4095]
| socket 0 | core 1 | thread 1 | a[4096] ... a[8191]
| NUMA node 0 | core 2 | thread 2 | a[8192] ... a[12287]
| | core 3 | thread 3 | a[12288] ... a[16383]
| | core 4 | thread 4 | a[16384] ... a[20479]
| socket 1 | core 5 | thread 5 | a[20480] ... a[24575]
| NUMA node 1 | core 6 | thread 6 | a[24576] ... a[28671]
| | core 7 | thread 7 | a[28672] ... a[32768]
Now lets run another loop like this:
#pragma omp parallel for schedule(static,1) num_threads(8)
for (i = 0; i < 8; i++)
memset(&a[i*4096], 1, 4096);
Each thread will access the already mapped physical memory and it will have the same mapping of thread to memory region as the one during the first loop. It means that threads will only access memory located in their local memory blocks which will be fast.
Now imagine that another scheduling scheme is used for the second loop: schedule(static,2). This will "chop" iteration space into blocks of two iterations and there will be 4 such blocks in total. What will happen is that we will have the following thread to memory location mapping (through the iteration number):
| | core 0 | thread 0 | a[0] ... a[8191] <- OK, same memory node
| socket 0 | core 1 | thread 1 | a[8192] ... a[16383] <- OK, same memory node
| NUMA node 0 | core 2 | thread 2 | a[16384] ... a[24575] <- Not OK, remote memory
| | core 3 | thread 3 | a[24576] ... a[32768] <- Not OK, remote memory
| | core 4 | thread 4 | <idle>
| socket 1 | core 5 | thread 5 | <idle>
| NUMA node 1 | core 6 | thread 6 | <idle>
| | core 7 | thread 7 | <idle>
Two bad things happen here:
threads 4 to 7 remain idle and half of the compute capability is lost;
threads 2 and 3 access non-local memory and it will take them about twice as much time to finish during which time threads 0 and 1 will remain idle.
So one of the advantages for using static scheduling is that it improves locality in memory access. The disadvantage is that bad choice of scheduling parameters can ruin the performance.
dynamic scheduling works on a "first come, first served" basis. Two runs with the same number of threads might (and most likely would) produce completely different "iteration space" -> "threads" mappings as one can easily verify:
$ cat dyn.c
#include <stdio.h>
#include <omp.h>
int main (void)
{
int i;
#pragma omp parallel num_threads(8)
{
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[1] iter %0d, tid %0d\n", i, omp_get_thread_num());
#pragma omp for schedule(dynamic,1)
for (i = 0; i < 8; i++)
printf("[2] iter %0d, tid %0d\n", i, omp_get_thread_num());
}
return 0;
}
$ icc -openmp -o dyn.x dyn.c
$ OMP_NUM_THREADS=8 ./dyn.x | sort
[1] iter 0, tid 2
[1] iter 1, tid 0
[1] iter 2, tid 7
[1] iter 3, tid 3
[1] iter 4, tid 4
[1] iter 5, tid 1
[1] iter 6, tid 6
[1] iter 7, tid 5
[2] iter 0, tid 0
[2] iter 1, tid 2
[2] iter 2, tid 7
[2] iter 3, tid 3
[2] iter 4, tid 6
[2] iter 5, tid 1
[2] iter 6, tid 5
[2] iter 7, tid 4
(same behaviour is observed when gcc is used instead)
If the sample code from the static section was run with dynamic scheduling instead there will be only 1/70 (1.4%) chance that the original locality would be preserved and 69/70 (98.6%) chance that remote access would occur. This fact is often overlooked and hence suboptimal performance is achieved.
There is another reason to choose between static and dynamic scheduling - workload balancing. If each iteration takes vastly different from the mean time to be completed then high work imbalance might occur in the static case. Take as an example the case where time to complete an iteration grows linearly with the iteration number. If iteration space is divided statically between two threads the second one will have three times more work than the first one and hence for 2/3 of the compute time the first thread will be idle. Dynamic schedule introduces some additional overhead but in that particular case will lead to much better workload distribution. A special kind of dynamic scheduling is the guided where smaller and smaller iteration blocks are given to each task as the work progresses.
Since precompiled code could be run on various platforms it would be nice if the end user can control the scheduling. That's why OpenMP provides the special schedule(runtime) clause. With runtime scheduling the type is taken from the content of the environment variable OMP_SCHEDULE. This allows to test different scheduling types without recompiling the application and also allows the end user to fine-tune for his or her platform.

I think the misunderstanding comes from the fact that you miss the point about OpenMP.
In a sentence OpenMP allows you to execute you program faster by enabling parallelism.
In a program parallelism can be enabled in many ways and one of the is by using threads.
Suppose you have and array:
[1,2,3,4,5,6,7,8,9,10]
and you want to increment all elements by 1 in this array.
If you are going to use
#pragma omp for schedule(static, 5)
it means that to each of the threads will be assigned 5 contiguous iterations. In this case the first thread will take 5 numbers. The second one will take another 5 and so on until there are no more data to process or the maximum number of threads is reached (typically equal to the number of cores). Sharing of workload is done during the compilation.
In case of
#pragma omp for schedule(dynamic, 5)
The work will be shared amongst threads but this procedure will occur at a runtime. Thus involving more overhead. Second parameter specifies size of the chunk of the data.
Not being very familiar to OpenMP I risk to assume that dynamic type is more appropriate when compiled code is going to run on the system that has a different configuration that the one on which code was compiled.
I would recommend the page bellow where there are discussed techniques used for parallelizing the code, preconditions and limitations
https://computing.llnl.gov/tutorials/parallel_comp/
Additional links:
http://en.wikipedia.org/wiki/OpenMP
Difference between static and dynamic schedule in openMP in C
http://openmp.blogspot.se/

The loop partitioning scheme is different. The static scheduler would divide a loop over N elements into M subsets, and each subset would then contain strictly N/M elements.
The dynamic approach calculates the size of the subsets on the fly, which can be useful if the subsets' computation times vary.
The static approach should be used if computation times vary not much.

Fibonacci series in C++ : Control reaches end of non-void function

while practicing recursive functions, I have written this code for fibonacci series and (factorial) the program does not run and shows the error "Control reaches end of non-void function" i suspect this is about the last iteration reaching zero and not knowing what to do into minus integers. I have tried return 0, return 1, but no good. any suggestions?
#include <cstdlib>
#include <iomanip>
#include <iostream>
#include <ctime>
using namespace std;
int fib(int n) {
int x;
if(n<=1) {
cout << "zero reached \n";
x= 1;
} else {
x= fib(n-1)+fib(n-2);
return x;
}
}
int factorial(int n){
int x;
if (n==0){
x=1;
}
else {
x=(n*factorial(n-1));
return x;
}
}

Change
else if (n==1)
x=1;
to
else if (n==1)
return 1;
Then fib() should work for all non-negative numbers. If you want to simplify it and have it work for all numbers, go with something like:
int fib(int n) {
if(n<=1) {
cout << "zero reached \n";
return 1;
} else {
return fib(n-1)+fib(n-2);
}
}

"Control reaches end of non-void function"
This is a compile-time warning (which can be treated as an error with appropriate compiler flags). It means that you have declared your function as non-void (in this case, int) and yet there is a path through your function for which there is no return (in this case if (n == 1)).
One of the reasons that some programmers prefer to have exactly one return statement per function, at the very last line of the function...
return x;
}
...is that it is easy to see that their functions return appropriately. This can also be achieved by keeping functions very short.
You should also check your logic in your factorial() implementation, you have infinite recursion therein.

Presumably the factorial function should be returning n * factorial(n-1) for n > 0.
x=(factorial(n)*factorial(n-1)) should read x = n * factorial(n-1)

In your second base case (n == 1), you never return x; or 'return 1;'

The else section of your factorial() function starts:
x=(factorial(n)*factorial(n-1));
This leads to infinite recursion. It should be:
x=(n*factorial(n-1));

Sometimes, your compiler is not able to deduce that your function actually has no missing return. In such cases, several solutions exist:
Assume
if (foo == 0) {
return bar;
} else {
return frob;
}
Restructure your code
if (foo == 0) {
return bar;
}
return frob;
This works good if you can interpret the if-statement as a kind of firewall or precondition.
abort()
(see David Rodríguez's comment.)
if (foo == 0) {
return bar;
} else {
return frob;
}
abort(); return -1; // unreachable
Return something else accordingly. The comment tells fellow programmers and yourself why this is there.
throw
#include <stdexcept>
if (foo == 0) {
return bar;
} else {
return frob;
}
throw std::runtime_error ("impossible");
Not a counter measure: Single Function Exit Point
Do not fall back to one-return-per-function a.k.a. single-function-exit-point as a workaround. This is obsolete in C++ because you almost never know where the function will exit:
void foo(int&);
int bar () {
int ret = -1;
foo (ret);
return ret;
}
Looks nice and looks like SFEP, but reverse engineering the 3rd party proprietary libfoo reveals:
void foo (int &) {
if (rand()%2) throw ":P";
}
Also, this can imply an unnecessary performance loss:
Frob bar ()
{
Frob ret;
if (...) ret = ...;
...
if (...) ret = ...;
else if (...) ret = ...;
return ret;
}
because:
class Frob { char c[1024]; }; // copy lots of data upon copy
And, every mutable variable increases the cyclomatic complexity of your code. It means more code and more state to test and verify, in turn means that you suck off more state from the maintainers brain, in turn means less maintainer's brain quality.
Last, not least: Some classes have no default construction and you would have to write really bogus code, if possible at all:
File mogrify() {
File f ("/dev/random"); // need bogus init
...
}
Do not do this.

There is nothing wrong with if-else statements. The C++ code applying them looks similar to other languages. In order to emphasize expressiveness of C++, one could write for factorial (as example):
int factorial(int n){return (n > 1) ? n * factorial(n - 1) : 1;}
This illustration, using "truly" C/C++ conditional operator ?:, and other suggestions above lack the production strength. It would be needed to take measures against overfilling the placeholder (int or unsigned int) for the result and with recursive solutions overfilling the calling stack. Clearly, that the maximum n for factorial can be computed in advance and serve for protection against "bad inputs". However, this could be done on other indirection levels controlling n coming to the function factorial. The version above returns 1 for any negative n. Using unsigned int would prevent dealing processing negative inputs. However, it would not prevent possible conversion situation created by a user. Thus, measures against negative inputs might be desirable too.

While the author is asking what is technically wrong with the recursive function computing the Fibonacci numbers, I would like to notice that the recursive function outlined above will solve the task in time exponentially growing with n. I do not want to discourage creating perfect C++ from it. However, it is known that the task can be computed faster. This is less important for small n. You would need to refresh the matrix multiplication knowledge in order to understand the explanation. Consider, evaluation of powers of the matrix:
power n = 1 | 1 1 |
| 1 0 | = M^1
power n = 2 | 1 1 | | 1 1 | | 2 1 |
| 1 0 | * | 1 0 | = | 1 1 | = M^2
power n = 3 | 2 1 | | 1 1 | | 3 2 |
| 1 1 | * | 1 0 | = | 2 1 | = M^3
power n = 4 | 3 2 | | 1 1 | | 5 3 |
| 2 1 | * | 1 0 | = | 3 2 | = M^4
Do you see that the matrix elements of the result resemble the Fibonacci numbers? Continue
power n = 5 | 5 3 | | 1 1 | | 8 5 |
| 3 2 | * | 1 0 | = | 5 3 | = M^5
Your guess is right (this is proved by mathematical induction, try or just use)
power n | 1 1 |^n | F(n + 1) F(n) |
| 1 0 | = M^n * | F(n) F(n - 1) |
When multiply the matrices, apply at least the so-called "exponentiation by squaring". Just to remind:
if n is odd (n % 2 != 0), then M * (M^2)^((n - 1) / 2)
M^n =
if n is even (n % 2 == 0), then (M^2)^(n/2)
Without this, your implementation will get the time properties of an iterative procedure (which is still better than exponential growth). Add your lovely C++ and it will give a decent result. However, since there is no a limit of perfectness, this also can be improved. Particularly, there is a so-called "fast doubling" for the Fibonacci numbers. This would have the same asymptotic properties but with a better time coefficient for the dependence on time. When one say O(N) or O(N^2) the actual constant coefficients will determine further differences. One O(N) can be still better than another O(n).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js