C++ and MPI how to write part of code as parallel? - c++

I've been writing some code using PETSc library and now I'm going to change a part of it to be run as parallel. Most of the things what I want to parallelize is matrix initializings and the parts where I generate and calculate a large amount of values. Anyway my problem is following if I run the code with more than 1 core for some reason all parts of the code will be run as many times as how many cores I use.
This is just simple sample code where I tested PETSc and MPI
int main(int argc, char** argv)
{
time_t rawtime;
time ( &rawtime );
string sta = ctime (&rawtime);
cout << "Solving began..." << endl;
PetscInitialize(&argc, &argv, 0, 0);
Mat A; /* linear system matrix */
PetscInt i,j,Ii,J,Istart,Iend,m = 120000,n = 3,its;
PetscErrorCode ierr;
PetscBool flg = PETSC_FALSE;
PetscScalar v;
#if defined(PETSC_USE_LOG)
PetscLogStage stage;
#endif
/* - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Compute the matrix and right-hand-side vector that define
the linear system, Ax = b.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - */
/*
Create parallel matrix, specifying only its global dimensions.
When using MatCreate(), the matrix format can be specified at
runtime. Also, the parallel partitioning of the matrix is
determined by PETSc at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,m,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatMPIAIJSetPreallocation(A,5,PETSC_NULL,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSeqAIJSetPreallocation(A,5,PETSC_NULL);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
/*
Currently, all PETSc parallel matrix formats are partitioned by
contiguous chunks of rows across the processors. Determine which
rows of the matrix are locally owned.
*/
ierr = MatGetOwnershipRange(A,&Istart,&Iend);CHKERRQ(ierr);
/*
Set matrix elements for the 2-D, five-point stencil in parallel.
- Each processor needs to insert only elements that it owns
locally (but any non-local elements will be sent to the
appropriate processor during matrix assembly).
- Always specify global rows and columns of matrix entries.
Note: this uses the less common natural ordering that orders first
all the unknowns for x = h then for x = 2h etc; Hence you see J = Ii +- n
instead of J = I +- m as you might expect. The more standard ordering
would first do all variables for y = h, then y = 2h etc.
*/
PetscMPIInt rank; // processor rank
PetscMPIInt size; // size of communicator
MPI_Comm_rank(PETSC_COMM_WORLD,&rank);
MPI_Comm_size(PETSC_COMM_WORLD,&size);
cout << "Rank = " << rank << endl;
cout << "Size = " << size << endl;
cout << "Generating 2D-Array" << endl;
double temp2D[120000][3];
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
temp2D[Ii][J] = 1;
}
}
cout << "Processor " << rank << " set values : " << Istart << " - " << Iend << " into 2D-Array" << endl;
v = -1.0;
for (Ii=Istart; Ii<Iend; Ii++) {
for(J=0; J<n;J++){
MatSetValues(A,1,&Ii,1,&J,&v,INSERT_VALUES);CHKERRQ(ierr);
}
}
cout << "Ii = " << Ii << " processor " << rank << " and it owns: " << Istart << " - " << Iend << endl;
/*
Assemble matrix, using the 2-step process:
MatAssemblyBegin(), MatAssemblyEnd()
Computations can be done while messages are in transition
by placing code between these two statements.
*/
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
MPI_Finalize();
cout << "No more MPI" << endl;
return 0;
}
And my real program has a couple different .cpp files. I initialize MPI in the main program what calls a function in another .cpp file where I did implement same kind of matrix filling but all the cout's what the program does before filling the matrices will be printed as many times as the number of my cores.
I can run my test program as mpiexec -n 4 test and it runs successfully but for some reason I have to run my real program as mpiexec -n 4 ./myprog
Output of my test program is following
Solving began...
Solving began...
Solving began...
Solving began...
Rank = 0
Size = 4
Generating 2D-Array
Processor 0 set values : 0 - 30000 into 2D-Array
Rank = 2
Size = 4
Generating 2D-Array
Processor 2 set values : 60000 - 90000 into 2D-Array
Rank = 3
Size = 4
Generating 2D-Array
Processor 3 set values : 90000 - 120000 into 2D-Array
Rank = 1
Size = 4
Generating 2D-Array
Processor 1 set values : 30000 - 60000 into 2D-Array
Ii = 30000 processor 0 and it owns: 0 - 30000
Ii = 90000 processor 2 and it owns: 60000 - 90000
Ii = 120000 processor 3 and it owns: 90000 - 120000
Ii = 60000 processor 1 and it owns: 30000 - 60000
no more MPI
no more MPI
no more MPI
no more MPI
Edit after two comments:
So my goal is to run this on small cluster which has 20 nodes and each node has 2 cores. Later on this should be running on super computer so mpi is definitely the way I need to go. I'm currently testing this on two different machines one of them has 1 processor / 4 cores and second has 4 processor / 16 cores.

MPI is an implementation of the SPMD/MPMD model (single program multiple data / multiple programs multiple data). An MPI job consists of concurrently running processes that exchange messages between each other in order to cooperate on solving a problem. You cannot run only part of the code in parallel. You can only have parts of the code that do not communicate with each other but still execute concurrently. And you ought use mpirun or mpiexec to start your application in parallel mode.
If you'd like to make only parts of your code parallel and could live with the limitation that you can only run the code on a single machine, then what you need is OpenMP and not MPI. Or you can also use low-level POSIX threads programming as according to the PETSc web site, it supports pthreads. And OpenMP is built on top of pthreads so using PETSc with OpenMP might be possible.

To add to Hristo's answer, MPI is built to run in a distributed fashion, i.e. completely separate processes. They have to be separate, because they are supposed to be on different physical machines. You can run multiple MPI processes on one machine, for example one per core. That's perfectly OK, but MPI does not have any tools to take advantage of that shared memory context. In other words, you cannot have some MPI ranks (processes) do work on a matrix that is owned by another MPI process because you have no way to share the matrix.
When you start x MPI processes you get x copies of the same exact program running. You need code like
if (rank == 0)
do something
else
do something else
to have the different processes do different things. The processes can communicate with each other by send messages, but they all run the same exact binary.
If you don't have the code diverge, then you'll just get x copies of the same program give the same result x times.

Related

Strange behaviour in multithread analysis

for a university project we are implementing an algorithm capable of bruteforcing on an AES key that we assume is partially known.
We have implemented several versions including one that exploits the multithreading mechanism in C++.
The implementation is done by allocating a variable number of threads, to be passed as input at launch, and dividing the key space equally for each thread that will cycle through the respective range attempting each key. De facto the implementation works, as it succeeds in finding the key for any combination #bitsToHack/#threads but returns strange timing results.
//Structs for threads and respective data
pthread_t threads[num_of_threads];
struct bf_data td [num_of_threads];
int rc;
//Space division
uintmax_t index = pow (BASE_NUMBER, num_bits_to_hack);
uintmax_t step = index/num_of_threads;
if(sem_init(&s, 1, 0)!=0){
printf("Error during semaphore initialization\n");
return -1;
}
for(int i = 0; i < num_of_threads; i++){
//Structure initialization
td[i].ciphertext = ciphertext;
td[i].hacked_key = hacked_key;
td[i].iv_aes = iv_aes;
td[i].key = key_aes;
td[i].num_bits_to_hack = num_bits_to_hack;
td[i].plaintext = plaintext;
td[i].starting_point = step*i;
td[i].step = step;
td[i].num_of_threads = num_of_threads;
if(DEBUG)
printf("Starting point for thread %d is: %lu, using step: %lu\n", i , td[i].starting_point, td[i].step);
rc = pthread_create(&threads[i], NULL, decryption_brute_force, (void*)&td[i]);
if (rc){
cout << "Error:unable to create thread," << rc << endl;
exit(-1);
}
}
sem_wait(&s);
for(int i = 0; i < num_of_threads; i++){
pthread_join(threads[i], NULL);
}
For the decryption_brute_force function (The body of each thread):
void* decryption_brute_force(void* data){
** Copy data on local thread memory
** Build the key to begin the search from starting point
** for each key from starting_point to starting_point + step
** Try decryption
** if obtained plaintext corresponds to the expected one
** Print results, wake up main thread and terminate
** else
** increment the key and continue
}
To conclude the project we intended to conduct a study of the optimal number of threads expecting an increase in performance as the number of threads increased up to a threshold, after which the system would no longer benefit from the increase in threads assigned to it.
At the end of the analysis (a simulation lasting about 9 hours), the results obtained were as follows in figure.
Click here to see the plot.
We cannot understand why 8 threads performs better than 16. Could it be due to the CPU architecture? Could it be able to schedule 32 and 8 threads better than 16?
From comments, I think it could be the linear-search pattern in each thread yields to different results for different number of threads. Because when you double the threads, the actual linear point to find in a thread may shift to a further point. But once you double again, it can not go much further due to too many threads. Because you said you are using only same encrypted data always. Did you try different inputs?
this variable is integer (so it may not be exact distribution)
^
8 threads & step=7 (56 work total)
index-16 (0-based)
v
01234567 89abcdef 01234567 89abcdef
| | |. | ...
500 seconds as its the first loop iteration
16 threads & step=3 (56 work total)
index-16 again, but at second-iteration now
v
012 345 678 9ab cde f01 234 567 8
| | | | | | . | | | ...
1000 seconds as it finds only after second iteration in the thread
Another example with 2 threads and 3 threads:
x to found at 51-th element of 100-element-work:
2 threads
| |x(1st iteration) |
3 threads
| |........x | |
5x slower than 2 threads

Need your input Project Euler Q 8

Is there a better way of doing this ?
http://projecteuler.net/problem=8
I added a condition to check if the number is >6 (Eliminates small products and 0's)
#include <iostream>
#include <math.h>
#include "bada.h"
using namespace std;
int main()
{
int badanum[] { DATA };
int pro=0,highest=0;
for(int i=0;i<=996;++i)
{
if (badanum[i]>6 and badanum[i+1] > 6 and badanum[i+2] >6 and badanum[i+3]>6 and badanum[i+4]>6)
{
pro=badanum[i]*badanum[i+1]*badanum[i+2]*badanum[i+3]*badanum[i+4];
if(pro>highest)
{
cout << pro << " " << badanum[i] << badanum[i+1] << badanum[i+2] << badanum[i+3] << badanum[i+4] << endl;
highest = pro;
}
pro = 0;
}
}
}
bada.h is just a file containing the 1000 digit number.
#DEFINE DATA <1000 digit number>
http://projecteuler.net/problem=8
that if slows things down actually
causes branching the parallel pipeline of CPU execution
also as mentioned before it will invalidate the result
does not matter that your solution is the same as it should be (for another digits it could not)
On algorithmic side you can do:
if you have fast enough division you can lower the computations number
char a[]="7316717653133062491922511967442657474235534919493496983520312774506326239578318016984801869478851843858615607891129494954595017379583319528532088055111254069874715852386305071569329096329522744304355766896648950445244523161731856403098711121722383113622298934233803081353362766142828064444866452387493035890729629049156044077239071381051585930796086670172427121883998797908792274921901699720888093776657273330010533678812202354218097512545405947522435258490771167055601360483958644670632441572215539753697817977846174064955149290862569321978468622482839722413756570560574902614079729686524145351004748216637048440319989000889524345065854122758866688116427171479924442928230863465674813919123162824586178664583591245665294765456828489128831426076900422421902267105562632111110937054421750694165896040807198403850962455444362981230987879927244284909188845801561660979191338754992005240636899125607176060588611646710940507754100225698315520005593572972571636269561882670428252483600823257530420752963450\0";
int i=0,s=0,m=1,q;
for (i=0;i<4;i++)
{
q=a[i ]-'0'; if (q) m*=q;
}
for (i=0;i<996;i++)
{
q=a[i+4]-'0'; if (q) m*=q;
if (s<m) s=m;
q=a[i ]-'0'; if (q) m/=q;
}
also you can do a table for mul,div operations for speed (but that is not faster in all cases)
int mul_5digits[9*9*9*9*9+1][10]={ 0*0,0*1,0*2, ... ,9*9*9*9*9/9 };
int div_5digits[9*9*9*9*9+1][10]={ 0/0,0/1,0/2, ... ,9*9*9*9*9/9 };
// so a=b*c; is rewritten by a=mul_5digits[b][c];
// so a=b/c; is rewritten by a=div_5digits[b][c];
of course instead of values 0*0 have to add neutral value = 1 !!!
of course instead of values i/0 have to add neutral value = i !!!
int i=0,s=0,t=1;
for (i=0;i<4;i++)
{
t=mul_5digits[t][a[i ]-'0'];
}
for (i=0;i<996;i++)
{
t=mul_5digits[t][a[i+4]-'0'];
if (s<t) s=t;
t=div_5digits[t][a[i ]-'0'];
}
Run-time measurements on AMD 3.2GHz, 64bit Win7, 32 bit App BDS2006 C++:
0.022ms classic approach
0.013ms single mul,div per step (produce false outut if there is none product > 0 present)
0.054ms tabled single mul,div per step (is slower for my setup)
PS.
All code improvements should be measured so you see if you actually speed thing up or not.
Because what is faster for one compiler/platform/computer can be slower for another.
Use at least 0.1 ms resolution.
I prefer the use of RDTSC or PerformanceCounter for that.
Except for the errors pointed out in the comments, that much multiplications arenĀ“t necessary. If you start with the product of [0] * [1] * [2] * [3] * [4] for index 0, what would be the product starting at [1]? The old result divided by [0] and multiplied by [5]. One division and one multiplication could be faster than 4 multiplications
You don't need to store all the digits at once. Just current five of them (use an array with cyclic overwriting), one variable to store the current problem result and one to store the latest multiplication result(see below). If the number of digits in the input will grow you won't get any troubles with memory.
Also you could have the check if the oldest read digit equals zero. If it is, than you will really have to multiply all the five current digits, but if not - a better way will be to divide previous multiplication result by the oldest digit and multiply it by the latest read digit.

How would I implement a FCFS processor scheduling simulator?

I have a vector of structs, with the structs looking like this:
struct myData{
int ID;
int arrivalTime;
int burstTime;
};
After populating my vector with this data:
1 5 16
4 7 12
3 12 4
2 7 8
where each row is an individual struct's ID (arbitrary, doesn't denote order of arrival), arrivalTime and burstTime, how would I use "for" or "while" loops to step through my vector's indices and calculate the data in a way that I could print something like this out?
Time 0 Processor is Idle
Time 5 Process 1 starts running
Time 21 Process 2 is running
Time 29 Process 4 is running
Time 41 Process 3 is running
The way I thought I could do it was to have an integer keep track what the current time is (the current time being the sum of the burst times of processes that have already ran) but I can't seem to figure out an algorithm that accounts for Idle time (when the processor is not doing anything and a new task hasn't arrived yet) as well as keeping track of the other numbers as well. For simplicity's sake I just decided that when two processes arrive at the same time I would process the one with the lower ID number. I know I didn't put much code here to demonstrate what I'm trying to do, but I hope I've explained it fairly clearly. I'm looking for a psuedo-code algorithm solution to this problem, but I wouldn't say no to something that has been coded (In C++?).
As an additional note, in case I wasn't able to convey how I access my data clearly, this:
cout << structVector[0].ID << "\n";
cout << structVector[0].arrivalTime << "\n";
cout << structVector[0].burstTime << "\n";
would print out
1
5
16
Any help in psuedo-code or actual code would be GREATLY appreciated!!! After reading this post over a few times I realize I've been pretty generic with the question, but I would love some help just understanding how to calculate this data.
First, sort the vector based on arrival times.
Then the following code will accomplish what you are looking for.
int i = 0, time = 0;
while (i < vec.size())
{
if (vec[i]. arrivalTime > time)
cout << "Time " << time << "process is idle";
time += vec[i].arrivalTime;
cout << "Time " << time << " Process " << vec[i].ID << " is running" << endl;
time += vec[i].burstTime;
i++;
}

Filter strange C++ multimap values

I have this multimap in my code:
multimap<long, Note> noteList;
// notes are added with this method. measureNumber is minimum `1` and doesn't go very high
void Track::addNote(Note &note) {
long key = note.measureNumber * 1000000 + note.startTime;
this->noteList.insert(make_pair(key, note));
}
I'm encountering problems when I try to read the notes from the last measure. In this case the song has only 8 measures and it's measure number 8 that causes problems. If I go up to 16 measures it's measure 16 that causes the problem and so on.
// (when adding notes I use as key the measureNumber * 1000000. This searches for notes within the same measure)
for(noteIT = trackIT->noteList.lower_bound(this->curMsr * 1000000); noteIT->first < (this->curMsr + 1) * 1000000; noteIT++){
if(this->curMsr == 8){
cout << "_______________________________________________________" << endl;
cout << "ID:" << noteIT->first << endl;
noteIT->second.toString();
int blah = 0;
}
// code left out here that processes the notes
}
I have only added one note to the 8th measure and yet this is the result I'm getting in console:
_______________________________________________________
ID:8000001
note toString()
Duration: 8
Start Time: 1
Frequency: 880
_______________________________________________________
ID:1
note toString()
Duration: 112103488
Start Time: 44
Frequency: 0
_______________________________________________________
ID:8000001
note toString()
Duration: 8
Start Time: 1
Frequency: 880
_______________________________________________________
ID:1
note toString()
Duration: 112103488
Start Time: 44
Frequency: 0
This keeps repeating. The first result is a correct note which I've added myself but I have no idea where the note with ID: 1 is coming from.
Any ideas how to avoid this? This loop gets stuck repeating the same two results and I can't get out of it. Even if there are several notes within measure 8 (so that means several values within the multimap that start with 8xxxxxx it only repeats the first note and the non-existand one.
You aren't checking for the end of your loop correctly. Specifically there is no guarantee that noteIT does not equal trackIT->noteList.end(). Try this instead
for (noteIT = trackIT->noteList.lower_bound(this->curMsr * 1000000);
noteIT != trackIT->noteList.end() &&
noteIT->first < (this->curMsr + 1) * 1000000;
++noteIT)
{
For the look of it, it might be better to use some call to upper_bound as the limit of your loop. That would handle the end case automatically.

Timing a function in microseconds

Hey guys I'm trying to time some search functions I wrote in microseconds, and it needs to take long enough to get it to show 2 significant digits. I wrote this code to time my search function but it seems to go too fast. I always end up getting 0 microseconds unless I run the search 5 times then I get 1,000,000 microseconds. I'm wondering if I did my math wrong to get the time in micro seconds, or if there's some kind of formatting function I can use to force it to display two sig figs?
clock_t start = clock();
index = sequentialSearch.Sequential(TO_SEARCH);
index = sequentialSearch.Sequential(TO_SEARCH);
clock_t stop = clock();
cout << "number found at index " << index << endl;
int time = (stop - start)/CLOCKS_PER_SEC;
time = time * SEC_TO_MICRO;
cout << "time to search = " << time<< endl;
You are using integer division on this line:
int time = (stop - start)/CLOCKS_PER_SEC;
I suggest using a double or float type, and you'll likely need to cast the components of the division.
Use QueryPerformanceCounter and QueryPerformanceFrequency, assuming your on windows platform
here a link to ms KB How To Use QueryPerformanceCounter to Time Code