This piece of code was working fine before with mpi
#include <mpi.h>
#include <iostream>
using namespace std;
int id, p;
int main(int argc, char* argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &id);
MPI_Comm_size(MPI_COMM_WORLD, &p);
cout << "Processor " << id << " of " << p << endl;
cout.flush();
MPI_Barrier(MPI_COMM_WORLD);
if (id == 0) cout << "Every process has got to this point now!" << endl;
MPI_Finalize();
}
Giving the output:
Processor 0 of 4
Processor 1 of 4
Processor 2 of 4
Processor 3 of 4
Every process has got to this point now!
When run on 4 cores with the command mpiexec -n 4 ${executable filename}$
I restarted my laptop (i'm not sure if this is the cause) and ran the same code and it outputs on one core:
Processor 0 of 1
Every process has got to this point now!
Processor 0 of 1
Every process has got to this point now!
Processor 0 of 1
Every process has got to this point now!
Processor 0 of 1
Every process has got to this point now!
I'm using the microsoft mpi and the project configurations haven't changed.
I'm not really sure what to do about this.
I also installed intel parallel studio and integrated it with visual studio before restarting.
But i'm still compiling with Visual c++ (Same configurations as of when it was working fine)
The easy fix was to uninstall intel parallel studio
Related
I am new to Eigen and is writing some simple code to test its performance. I am using a MacBook Pro with M1 Pro chip (I do not know whether the ARM architecture causes the problem). The code is a simple Laplace equation solver
#include <iostream>
#include "mpi.h"
#include "Eigen/Dense"
#include <chrono>
using namespace Eigen;
using namespace std;
const size_t num = 1000UL;
MatrixXd initilize(){
MatrixXd u = MatrixXd::Zero(num, num);
u(seq(1, fix<num-2>), seq(1, fix<num-2>)).setConstant(10);
return u;
}
void laplace(MatrixXd &u){
setNbThreads(8);
MatrixXd u_old = u;
u(seq(1,last-1),seq(1,last-1)) =
(( u_old(seq(0,last-2,fix<1>),seq(1,last-1,fix<1>)) + u_old(seq(2,last,fix<1>),seq(1,last-1,fix<1>)) +
u_old(seq(1,last-1,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(1,last-1,fix<1>),seq(2,last,fix<1>)) )*4.0 +
u_old(seq(0,last-2,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(0,last-2,fix<1>),seq(2,last,fix<1>)) +
u_old(seq(2,last,fix<1>),seq(0,last-2,fix<1>)) + u_old(seq(2,last,fix<1>),seq(2,last,fix<1>)) ) /20.0;
}
int main(int argc, const char * argv[]) {
initParallel();
setNbThreads(0);
cout << nbThreads() << endl;
MatrixXd u = initilize();
auto start = std::chrono::high_resolution_clock::now();
for (auto i=0UL; i<100; i++) {
laplace(u);
}
auto stop = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
// cout << u(seq(0, fix<10>), seq(0, fix<10>)) << endl;
cout << "Execution time (ms): " << duration.count() << endl;
return 0;
}
Compile with gcc and enable OpenMPI
james#MBP14 tests % g++-11 -fopenmp -O3 -I/usr/local/include -I/opt/homebrew/Cellar/open-mpi/4.1.3/include -o test4 test.cpp
Direct run the binary file
james#MBP14 tests % ./test4
8
Execution time (ms): 273
Run with mpirun and specify 8 threads
james#MBP14 tests % mpirun -np 8 test4
8
8
8
8
8
8
8
8
Execution time (ms): 348
Execution time (ms): 347
Execution time (ms): 353
Execution time (ms): 356
Execution time (ms): 350
Execution time (ms): 353
Execution time (ms): 357
Execution time (ms): 355
So obviously the matrix operation is not running in parallel, instead, every thread is running the same copy of the code.
What should be done to solve this problem? Do I have some misunderstanding about using OpenMPI?
You are confusing OpenMPI with OpenMP.
The gcc flag -fopenmp enables OpenMP. It is one way to parallelize an application by using special #pragma omp statements in the code. The parallelization happens on a single CPU (or, to be precise, compute node, in case the compute node has multiple CPUs). This allows to employ all cores of that CPU. OpenMP cannot be used to parallelize an application over multiple compute nodes.
On the other hand, MPI (where OpenMPI is one particular implementation) can be used to parallelize a code over multiple compute nodes (i.e., roughly speaking, over multiple computers that are connected). It can also be used to parallelize some code over multiple cores on a single computer. So MPI is more general, but also much more difficult to use.
To use MPI, you need to call "special" functions and do the hard work of distributing data yourself. If you do not do this, calling an application with mpirun simply creates several identical processes (not threads!) that perform exactly the same computation. You have not parallelized your application, you just executed it 8 times.
There are no compiler flags that enable MPI. MPI is not built into any compiler. Rather, MPI is a standard and OpenMPI is one specific library that implements that standard. You should read a tutorial or book about MPI and OpenMPI (google turned up this one, for example).
Note: Usually, MPI libraries such as OpenMPI ship with executables/scripts (e.g. mpicc) that behave like compilers. But they are just thin wrappers around compilers such as gcc. These wrappers are used to automatically tell the actual compiler the include directories and libraries to link with. But again, the compilers themselves to not know anything about MPI.
I want to run OpenCL application under Windows 10 using my GTX970 graphics card. But the following code doesn't work =(
#define __CL_ENABLE_EXCEPTIONS
#include <CL/cl.hpp>
#include <CL/cl.h>
#include <vector>
#include <fstream>
#include <iostream>
#include <iomanip>
int main() {
std::vector<cl::Platform> platforms;
std::vector<cl::Device> devices;
try
{
cl::Platform::get(&platforms);
std::cout << platforms.size() << std::endl;
for (cl_uint i = 0; i < platforms.size(); ++i)
{
platforms[i].getDevices(CL_DEVICE_TYPE_GPU, &devices);
}
std::cout << devices.size() << std::endl;
}
catch (cl::Error e) {
std::cout << std::endl << e.what() << " : " << e.err() << std::endl;
}
return 0;
}
It gives me error code -1. I am using Visual Studio 2015 Community Edition to launch it with installed NVIDIA CUDA SDK v8.0 and configured paths, so compiler and linker knows about SDK.
Can someone please explain what's wrong with this snippet?
Thanks in advance!
EDIT: Can someone also explain me, why when i try to debug this code it falls when getting platform id, however, when i do not debug this code it prints that i have 2 platforms(my GPU card and integerated GPU)
Probably your iGPU is Intel(I assume you did a combo of gtx970 + intel cpu for gaming) which also has some experimental opencl 2.1 platform support that could give error for an opencl 1.2 app at device picking or platform picking(I had a similar problem).
You should check returned error codes from opencl api commands. Those give better info about what happened.
For example, my system has two platforms for Intel, one being experimental 2.1 only for cpu and one being normal 1.2 for both gpu and cpu.
To check that, query platform version and check its parameter-returned 7th and 9th char values against 1 and 2 for 1.2 or 2 and 0 for 2.0. This should elliminate experimental 2.1 which gives 2 at 7th char and "1" at 9th char (where indexing starts at 0 ofcourse)
https://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetPlatformInfo.html
CL_PLATFORM_VERSION
Check for both numpad number keycodes and left hand side numbers keycodes.
Nvidia must have 1.2 support already.
If I'm right, you may query for cpu devices and have 2 from Intel and 1 from Nvidia(if it has any) in return.
We are writing a code to solve non linear problem using an iterative method (Newton). Anyway, the problem is that we don't know a priori how many MPI processes will be needed from one iteration to another, due to e.g. remeshing, adaptivity, etc. And there is quite a lot of iterations...
We hence would like to use MPI_Comm_Spawn at each iteration to create as much MPI process as we need, gather the results and "destroy" the subprocesses. We know this limits the scalability of the code due to the gathering of information, however, we have been asked to do it :)
I did a couple of tests of MPI_Comm_Spawn on my laptop (on windows 7/64bit) using intel MPI and Visual Studio express 2013. I tried these simple codes
//StackMain
#include <iostream>
#include <mpi.h>
#include<vector>
int main(int argc, char *argv[])
{
int ierr = MPI_Init(&argc,& argv);
for (int i = 0; i < 10000; i++)
{
std::cout << "Loop number "<< i << std::endl;
MPI_Comm children;
std::vector<int> err(4);
ierr = MPI_Comm_spawn("StackWorkers.exe", NULL, 4, MPI_INFO_NULL, 0, MPI_COMM_WORLD, &children, &err[0]);
MPI_Barrier(children);
MPI_Comm_disconnect(&children);
}
ierr = MPI_Finalize();
return 0;
}
And the program launched by the spawned processes:
//StackWorkers
#include <mpi.h>
int main(int argc, char *argv[])
{
int ierr = MPI_Init(&argc,& argv);
MPI_Comm parent;
ierr = MPI_Comm_get_parent(&parent);
MPI_Barrier(parent);
ierr = MPI_Finalize();
return 0;
}
The program is launched using one MPI process:
mpiexec -np 1 StackMain.exe
It seems to work, I do have however some questions...
1- The program freezes during iteration 4096, this number do not change if I relaunch the program. If during each iteration I launch 2 times 4 process, then it will stop at iteration 2048th...
Is it a limitation from the operating system ?
2- When I look at the memory occupied by "mpiexec" during the program, it grows continuously (never going down). Do you know why ? I though that, when subprocess finnished their job, they would release the memory they used...
3- Should I disconnect/free the children communicator or not ? If yes, MPI_Disconnect(...) must be called on both spawned and spawnee processes ? Or only spawnee ?
Thanks a lot!
I have a C++ Date/Time library that I have literally used for decades. It's been rock solid without any issues. But today, as I was making some small enhancements, my test code started complaining violently. This following program demonstrates the problem :
#include <iostream>
#include <time.h>
void main(void) {
_tzset();
std::cout << "_tzname[ 0 ]=" << _tzname[ 0 ] << std::endl;
std::cout << "_tzname[ 1 ]=" << _tzname[ 1 ] << std::endl;
std::cout << "_timezone=" << _timezone << std::endl;
size_t ret;
char buf[ 64 ];
_get_tzname(&ret,buf,64,0);
std::cout << "_get_tzname[ 0 ]=" << buf << std::endl;
_get_tzname(&ret,buf,64,1);
std::cout << "_get_tzname[ 1 ]=" << buf << std::endl;
}
If I run this in the Visual Studio Debugger I get the following output :
_tzname[ 0 ]=SE Asia Standard Time
_tzname[ 1 ]=SE Asia Daylight Time
_timezone=-25200
_get_tzname[ 0 ]=SE Asia Standard Time
_get_tzname[ 1 ]=SE Asia Daylight Time
This is correct.
But if I run the program from the command line I get the following output :
_tzname[ 0 ]=Asi
_tzname[ 1 ]=a/B
_timezone=0
_get_tzname[ 0 ]=Asi
_get_tzname[ 1 ]=a/B
Note that the TZ environment variable is set to : Asia/Bangkok, which is a synonym for SE Asia Standard Time or UTC+7. You will notice in the command line output that the tzname[ 0 ] value is the first 3 characters of Asia/Bangkok and tzname[ 1 ] is the next 3 characters. I have some thoughts on this, but I cannot make sense of it, so I'll just stick to the facts.
Note that I included the calls to _get_tzname(...) to demonstrate that I am not getting caught in some kind deprecation trap given that _tzname and _timezone are deprecated.
I'm on Windows 7 Professional and I am linking statically to the runtime library (Multi-threaded Debug (/MTd)). I recently installed Visual Studio 2015 and while I am not using it yet, I compiled this program there and the results are the same. I thought there was a chance that I was somehow linking with the VS2015 libraries but I cannot verify this. The Platform Toolset setting in both projects reflects what I would expect.
Thank you for taking the time to look at this...
I have 3 function and 4 cores. I want execute each function in new thread using MPI and C++
I write this
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD,&rank);
MPI_Comm_size(MPI_COMM_WORLD,&size);
size--;
if (rank == 0)
{
Thread1();
}
else
{
if(rank == 1)
{
Thread2();
}
else
{
Thread3();
}
}
MPI_Finalize();
But it execute just Thread1(). How i must change code?
Thanks!
Print to screen the current value of variable size (possibly without decrementing it) and you will find 1. That is: "there is 1 process running".
You are likely running your compiled code the wrong way. Consider to use mpirun (or mpiexec, depending on your MPI implementation) to execute it, i.e.
mpirun -np 4 ./MyCompiledCode
the -np parameter specifies the number of processes you will start (doing so, your MPI_Comm_size will be 4 as you expect).
Currently, though, you are not using anything explicitly owing to C++. You can consider some C++ binding of MPI such as Boost.MPI.
I worked a little bit on the code you provided. I changed it a little bit producing this working mpi code (I provided some needed correction in capital letters).
FYI:
compilation (under gcc, mpich):
$ mpicxx -c mpi1.cpp
$ mpicxx -o mpi1 mpi1.o
execution
$ mpirun -np 4 ./mpi1
output
size is 4
size is 4
size is 4
2 function started.
thread2
3 function started.
thread3
3 function ended.
2 function ended.
size is 4
1 function started.
thread1
1 function ended.
be aware that stdout is likely messed out.
Are you sure you are compiling your code the right way?
You problem is that MPI provides no way to feed console input into many processes but only into process with rank 0. Because of the first three lines in main:
int main(int argc, char *argv[]){
int oper;
std::cout << "Enter Size:";
std::cin >> oper; // <------- The problem is right here
Operations* operations = new Operations(oper);
int rank, size;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
switch(tid)
{
all processes but rank 0 block waiting for console input which they cannot get. You should rewrite the beginning of your main function as follows:
int main(int argc, char *argv[]){
int oper;
MPI_Init(&argc, &argv);
int tid;
MPI_Comm_rank(MPI_COMM_WORLD, &tid);
if (tid == 0) {
std::cout << "Enter Size:";
std::cin >> oper;
}
MPI_Bcast(&oper, 1, MPI_INT, 0, MPI_COMM_WORLD);
Operations* operations = new Operations(oper);
switch(tid)
{
It works as follows: only rank 0 displays the prompt and then reads the console input into oper. Then a broadcast of the value of oper from rank 0 is performed so all other processes obtain the correct value, create the Operations object and then branch to the appropriate function.