MPI_Alltoallw call fails in WSL2 - c++

I have an MPI program written in C++, which I run on 2+ machines: one is a Fedora machine with an Intel i5-4590T CPU, the other is a Windows machine running Ubuntu 22.04 on WSL 2 (CPU: i7-7820x).
I have the same MPI program on both machines, which calls MPI_Alltoallw. On the Fedora machine, the Alltoallw call works fine and the the output buffer is correctly filled, while on WSL this does not happen and the output buffer is left unmodified by the MPI function call (no errors are given during compile or runtime).
I use in both cases mpic++ (g++ version is 11.3.0 on WSL, 11.3.1 on Fedora) and OpenMPI version 4.1.4.
I also tried the same code on an HPC system (RHEL 8.4, g++ 11.3.0, OpenMPI 4.1.4) and it works fine there as well.
Simpler MPI calls (e.g. initializing communicators, a MPI_Bcast) work fine also on WSL.
Is this a known problem with WSL2? Or something like this can arise from a fault in my code?
Edit: here is a minimal example of the code (which also uses Eigen).
#include <iostream>
#include <Eigen/Dense>
#include <unsupported/Eigen/CXX11/Tensor>
#include <cmath>
#include <complex>
#include <mpi.h>
int main() {
MPI_Init(NULL, NULL);
const int nc = 8;
const int nzc = 8;
const int nz2 = nzc*nc/2;
const int nzd = 3*nz2;
const int nxs2 = 32;
const int ny = 128;
const int nyc = ny/nc;
const int nbytes_cmplxd = 16;
Eigen::TensorFixedSize<std::complex<double>, Eigen::Sizes<ny, nxs2 + 1, nzc>> xc;
Eigen::TensorFixedSize<std::complex<double>, Eigen::Sizes<nyc, nxs2+1, nzd>> buf2;
xc.setRandom(); //xc is set to some values
buf2.setZero();
int count[nc];
//initialize counts to 1
for (int i = 0; i < nc; i++) {
count[i] = 1;
}
//first initializes mpi types and displacements for the transposition from xc to buf2
//init simple datatypes
MPI_Datatype sendloc1;
MPI_Datatype recvloc1;
MPI_Type_vector(nzc*(nxs2+1), nyc, ny, MPI_DOUBLE_COMPLEX, &sendloc1);
MPI_Type_commit(&sendloc1);
MPI_Type_vector(nzc*(nxs2+1)*nyc, 1, 1, MPI_DOUBLE_COMPLEX, &recvloc1);
MPI_Type_commit(&recvloc1);
int senddisp1[nc];
int recvdisp1[nc];
MPI_Datatype sendtypev1[nc];
MPI_Datatype recvtypev1[nc];
for (int i = 0; i<nc; i++) {
senddisp1[i] = nbytes_cmplxd * i * (nyc); //displacement due to column major ordering
if (i<(nc/2)) {
recvdisp1[i] = nbytes_cmplxd * i * (nzc*(nxs2+1)*nyc); //displacement equal to entire size of recvtype
} else { //add displacement to introduce padding (equal to 1/3 of the array size)
recvdisp1[i] = nbytes_cmplxd * (i * (nzc*(nxs2+1)*nyc) + nz2*(nxs2+1)*nyc);
}
sendtypev1[i] = sendloc1;
recvtypev1[i] = recvloc1;
}
//committing the MPI_Datatype vectors is not needed
MPI_Alltoallw(xc.data(), count, senddisp1, sendtypev1,
buf2.data(), count, recvdisp1, recvtypev1, MPI_COMM_WORLD);
std::cout << buf2 << std::endl; //buf2 is zero after the call
MPI_Finalize();
return 0;
}

Related

OpenMP GPU offload; Map scalar

I am trying to understand/test OpenMP with GPU offload. However, I am confused because some examples/info (1, 2, 3) in the internet are analogous or similar to mine but my example does not work as I think it should. I am using g++ 9.4 on Ubuntu 20.04 LTS and also installed gcc-9-offload-nvptx.
My example that does not work but is similar to this one:
#include <iostream>
#include <vector>
int main(int argc, char *argv[]) {
typedef double myfloat;
if (argc != 2) exit(1);
size_t size = atoi(argv[1]);
printf("Size: %zu\n", size);
std::vector<myfloat> data_1(size, 2);
myfloat *data1_ptr = data_1.data();
myfloat sum = -1;
#pragma omp target map(tofrom:sum) map(from: data1_ptr[0:size])
#pragma omp teams distribute parallel for simd reduction(+:sum) collapse(2)
for (size_t i = 0; i < size; ++i) {
for (size_t j = 0; j < size; ++j) {
myfloat term1 = data1_ptr[i] * i;
sum += term1 / (1 + term1 * term1 * term1);
}
}
printf("sum: %.2f\n", sum);
return 0;
}
When I compile it with: g++ main.cpp -o test -fopenmp -fcf-protection=none -fno-stack-protector I get the following
stack_example.cpp: In function ‘main._omp_fn.0.hsa.0’:
cc1plus: warning: could not emit HSAIL for the function [-Whsa]
cc1plus: note: support for HSA does not implement non-gridified OpenMP parallel constructs.
It does compile but when using it with
./test 10000
the printed sum is still -1. I think the sum value passed to the GPU was not returned properly but I explicitly map it, so shouldn't it be returned? Or what am I doing wrong?
EDIT 1
I was ask to modify my code because there was a historically grown redundant for loop and also sum was initialized with -1. I fixed that and also compiled it with gcc-11 which did not throw a warning or note as did gcc-9. However the behavior is similar:
Size: 100
Number of devices: 2
sum: 0.00
I checked with nvtop, the GPU is used. Because there are two GPUs I can even switch the device and can be seen by nvtop.
Solution:
The fix is very easy and stupid. Changing
map(from: data1_ptr[0:size])
to
map(tofrom: data1_ptr[0:size])
did the trick.
Even though I am not writing to the array this seemed to be the problem.

Does C++Amp require GPU hardware before it will build / execute?

After learning from a previous question that my VS 2017 C++ AMP project was basically sound, that the error messages while correct were masking the real problem, and that the issue was certain lines of code, I rewrote the code as below. By commenting out various lines at a time, I learned that
extent<2> e(M,N);
index<2> idx(0,0);
will build and execute, that code like
array_view<int, 2> c(e, vC);
for (idx[0] = 0; idx[0] < e[0]; idx[0]++)
will build but will throw an exception if run, and that code like
c[idx] = a[idx] + b[idx];
will not even build. Note that I have not as yet invoked any parallel functions.
This leads me to ask: does Concurrency Runtime or C++ AMP require that GPU hardware be installed to build and/or execute properly?
My machine has two multi-core CPU processors, but the GPU hardware hasn't been installed yet. Still, I thought I would be able to use the the parallelism constructs to take advantage of the processors I do have.
#include "pch.h"
#include <iostream>
#include "amp.h"
#include <vector>
using namespace Concurrency;
int main() {
const int M = 1024; const int N = 1024; //row, col for vector
std::vector<int> vA(M*N); std::vector<int> vB(M*N); //vectors to add
std::vector<int> vC(M*N); //vector for result
for (int i = 0; i < M; i++) { vA[i] = i; } //populate vectors
for (int j = N - 1; j >= 0; j--) { vB[j] = j; }
extent<2> e(M, N); //uses AMP constructs but
index<2> idx(0, 0); //no parallel functions invoked
array_view<int, 2> a(e, vA), b(e, vB);
array_view<int, 2> c(e, vC);
for (idx[0] = 0; idx[0] < e[0]; idx[0]++) {
for (idx[1] = 0; idx[1] < e[1]; idx[1]++) {
c[idx] = a[idx] + b[idx];
c(idx[0], idx[1]) = a(idx[0], idx[1]) + b(idx[0], idx[1]);
}
}
}
No, GPU hardware is not required. After starting a successfully compiled program, without GPU hardware, the system created a "software" GPU as shown in the output when debugging.
'Amp2.exe' (Win32): Loaded 'C:\Windows\SysWOW64\d3d11ref.dll'. [...]
GPU Device Created.
I used the available GPU diagnostics tool to look at performance.

weird CUDA kernel result on old display driver

I'm writing a CUDA program that to be run on thousands of different GPUs, those machine would have different version of display driver installed, I cannot force them to update to the latest driver. Actually most code runs fine on those 'old' machine, but fails with some particular code:
Here's the problem:
#include <stdio.h>
#include <cuda.h>
#include <cuda_profiler_api.h>
__global__
void test()
{
unsigned i = 64;
unsigned j = 192;
int k = 7;
for(j = 1 << (k - 1); i &j; j >>= 1)
i ^= j;
i ^= j;
printf("i,j,k: %d,%d,%d\n", i,j,k);
// i,j,k: 32,32, 7 (correct)
// i,j,k: 0, 64, 7 (wrong)
}
int main() {
cudaSetDeviceFlags(cudaDeviceScheduleBlockingSync);
test<<<1,1>>>();
}
The code prints 32,32,7 as result on GPU with latest driver, which is the correct result. But on old driver(lower than CUDA 6.5) it prints 0,64,7 .
I'm looking for any workaround for this.
Envoronment:
Developing: Win7-32bit, VS2013, CUDA 6.5
Corrent Result on: WinXP-32bit(and Win7-32bit), GTX-650(latest driver)
Wrong Result on: WinXP-32bit + GTX-750-Ti(old driver), WinXP-32bit + GTX-750(old driver)
There is no workaround. The runtime API is versioned and the minimum driver version requirement is non-negotiable.
Your only two choices are to develop using the lowest common denominator toolkit version that supports the driver being used, or switch to the driver API.
Got a very slow solution: use local memory rather than register variable.
just add volatile keyword before i,j
volatile unsigned i = 64;
volatile unsigned j = 192;

Cuda kernel to compute squares of integers in an array

I am learning some basic CUDA programming. I am trying to initialize an array on the Host with host_a[i] = i. This array consists of N = 128 integers. I am launching a kernel with 1 block and 128 threads per block, in which I want to square the integer at index i.
My questions are:
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
The expected output for my program is a space-separated list of squares of integers -
1 4 9 16 ... .
What's wrong with my code, since it outputs 1 2 3 4 5 ...
Code:
#include <iostream>
#include <numeric>
#include <stdlib.h>
#include <cuda.h>
const int N = 128;
__global__ void f(int *dev_a) {
unsigned int tid = threadIdx.x;
if(tid < N) {
dev_a[tid] = tid * tid;
}
}
int main(void) {
int host_a[N];
int *dev_a;
cudaMalloc((void**)&dev_a, N * sizeof(int));
for(int i = 0 ; i < N ; i++) {
host_a[i] = i;
}
cudaMemcpy(dev_a, host_a, N * sizeof(int), cudaMemcpyHostToDevice);
f<<<1, N>>>(dev_a);
cudaMemcpy(host_a, dev_a, N * sizeof(int), cudaMemcpyDeviceToHost);
for(int i = 0 ; i < N ; i++) {
printf("%d ", host_a[i]);
}
}
How do I come to know whether the kernel gets launched or not? Can I use printf within the kernel?
You can use printf in device code (as long as you #include <stdio.h>) on any compute capability 2.0 or higher GPU. Since CUDA 7 and CUDA 7.5 only support those types of GPUs, if you are using CUDA 7 or CUDA 7.5 (successfully) then you can use printf in device code.
What's wrong with my code?
As identified in the comments, there is nothing "wrong" with your code, if run on a properly set up machine. To address your previous question "How do I come to know whether the kernel gets launched or not?", the best approach in my opinion is to use proper cuda error checking, which has numerous benefits besides just telling you whether your kernel launched or not. In this case it would also give a clue as to the failure being an improper CUDA setup on your machine. You can also run CUDA codes with cuda-memcheck as a quick test as to whether any runtime errors are occurring.

Same Code giving different results in Visual Studio 2010 and Code Blocks 10.05

I am executing a filter code .This Code gives different results in Visual Studio and Code Blocks.Expected Results are the results which code blocks gives.But As i have to implement Intel Thread Building Blocks into it i have to use Visual Studio.Please help in finding why the difference is there.
Code is :
#include <stdio.h>
#include <stdint.h>
#include <cstring>
#include <iostream>
#include <fstream>
#include <math.h>
#include <time.h>
using namespace std;
#define pi 3.141593
#define FILTER_LEN 265
double coeffs[ FILTER_LEN ] =
{
0.0033473431384214393,0.000032074683390218124,0.0033131082058404943,0.0024777666109278788,
-0.0008968429179843104,-0.0031973449396977684,-0.003430943381749411,-0.0029796565504781646,
-0.002770673157048994,-0.0022783059845596586,-0.0008531818129514857,0.001115432556294998,
0.0026079871108133294,0.003012423848769931,0.002461420635709332,0.0014154004589753215,
0.00025190669718400967,-0.0007608257014963959,-0.0013703600874774068,-0.0014133823230551277,
-0.0009759556503342884,-0.00039687498737139273,-0.00007527524701314324,-0.00024181463305012626,
-0.0008521761947454302,-0.00162618205097997,-0.002170446498273018,-0.002129903305507943,
-0.001333859049002249,0.00010700092934983156,0.0018039564602637683,0.0032107930896349583,
0.0038325849735515363,0.003416201274366522,0.002060848732332109,0.00017954815260431595,
-0.0016358832300944531,-0.0028402136847527387,-0.0031256650498727384,-0.0025374271571154713,
-0.001438370315670195,-0.00035115295209013755,0.0002606730012030533,0.0001969569787142967,
-0.00039635535951198597,-0.0010886127490608972,-0.0013530057243606405,-0.0008123200399262436,
0.0005730271959526784,0.0024419465938120906,0.004133717273258681,0.0049402122577746265,
0.0043879285604252714,0.002449549610687005,-0.00040283102645093463,-0.003337730734820209,
-0.0054508346511294775,-0.006093057767824609,-0.005117609782189977,-0.0029293645861970417,
-0.0003251033117661085,0.0018074390555649442,0.0028351284091668164,0.002623563404428517,
0.0015692864792199496,0.0004127664681096788,-0.00009249878881824428,0.0004690173244168184,
0.001964334172374759,0.0037256715492873485,0.004809640399145206,0.004395274594482053,
0.0021650921193604,-0.0014888595443799124,-0.005534807968511709,-0.008642334104607624,
-0.009668950651149259,-0.008104732391434574,-0.004299972815463919,0.0006184612821881392,
0.005136551428636121,0.007907786753766152,0.008241212326068366,0.00634786595941524,
0.003235610213062744,0.00028882736660937287,-0.001320994685952108,-0.0011237433853145615,
0.00044213409507615003,0.0022057106517524255,0.00277593527678719,0.0011909915058737617,
-0.0025807757230413447,-0.007497632882437637,-0.011739520895818884,-0.013377018279057393,
-0.011166543231844196,-0.005133056165990026,0.0032948631959114935,0.011673660427968408,
0.017376415708412904,0.018548938130314566,0.014811760899506572,0.007450782505155853,
-0.001019540069785369,-0.007805775815783898,-0.010898333714715424,-0.00985364043415772,
-0.005988406030111452,-0.001818560524968024,0.000028552677472614846,-0.0019938756495376363,
-0.007477684025727061,-0.013989430449615033,-0.017870518868849213,-0.015639422062597726,
-0.005624959109456065,0.010993528170353541,0.03001263681283932,0.04527492462846608,
0.050581340787164114,0.041949186532860346,0.019360612460662185,-0.012644336735920483,
-0.0458782599058412,-0.07073838953156347,-0.0791205623455818,-0.06709535677423759,
-0.03644544574795176,0.005505370370858695,0.04780486657828151,0.07898800597378192,
0.0904453420042807,0.07898800597378192,0.04780486657828151,0.005505370370858695,
-0.03644544574795176,-0.06709535677423759,-0.0791205623455818,-0.07073838953156347,
-0.0458782599058412,-0.012644336735920483,0.019360612460662185,0.041949186532860346,
0.050581340787164114,0.04527492462846608,0.03001263681283932,0.010993528170353541,
-0.005624959109456065,-0.015639422062597726,-0.017870518868849213,-0.013989430449615033,
-0.007477684025727061,-0.0019938756495376363,0.000028552677472614846,-0.001818560524968024,
-0.005988406030111452,-0.00985364043415772,-0.010898333714715424,-0.007805775815783898,
-0.001019540069785369,0.007450782505155853,0.014811760899506572,0.018548938130314566,
0.017376415708412904,0.011673660427968408,0.0032948631959114935,-0.005133056165990026,
-0.011166543231844196,-0.013377018279057393,-0.011739520895818884,-0.007497632882437637,
-0.0025807757230413447,0.0011909915058737617,0.00277593527678719,0.0022057106517524255,
0.00044213409507615003,-0.0011237433853145615,-0.001320994685952108,0.00028882736660937287,
0.003235610213062744,0.00634786595941524,0.008241212326068366,0.007907786753766152,
0.005136551428636121,0.0006184612821881392,-0.004299972815463919,-0.008104732391434574,
-0.009668950651149259,-0.008642334104607624,-0.005534807968511709,-0.0014888595443799124,
0.0021650921193604,0.004395274594482053,0.004809640399145206,0.0037256715492873485,
0.001964334172374759,0.0004690173244168184,-0.00009249878881824428,0.0004127664681096788,
0.0015692864792199496,0.002623563404428517,0.0028351284091668164,0.0018074390555649442,
-0.0003251033117661085,-0.0029293645861970417,-0.005117609782189977,-0.006093057767824609,
-0.0054508346511294775,-0.003337730734820209,-0.00040283102645093463,0.002449549610687005,
0.0043879285604252714,0.0049402122577746265,0.004133717273258681,0.0024419465938120906,
0.0005730271959526784,-0.0008123200399262436,-0.0013530057243606405,-0.0010886127490608972,
-0.00039635535951198597,0.0001969569787142967,0.0002606730012030533,-0.00035115295209013755,
-0.001438370315670195,-0.0025374271571154713,-0.0031256650498727384,-0.0028402136847527387,
-0.0016358832300944531,0.00017954815260431595,0.002060848732332109,0.003416201274366522,
0.0038325849735515363,0.0032107930896349583,0.0018039564602637683,0.00010700092934983156,
-0.001333859049002249,-0.002129903305507943,-0.002170446498273018,-0.00162618205097997,
-0.0008521761947454302,-0.00024181463305012626,-0.00007527524701314324,-0.00039687498737139273,
-0.0009759556503342884,-0.0014133823230551277,-0.0013703600874774068,-0.0007608257014963959,
0.00025190669718400967,0.0014154004589753215,0.002461420635709332,0.003012423848769931,
0.0026079871108133294,0.001115432556294998,-0.0008531818129514857,-0.0022783059845596586,
-0.002770673157048994,-0.0029796565504781646,-0.003430943381749411,-0.0031973449396977684,
-0.0008968429179843104,0.0024777666109278788,0.0033131082058404943,0.000032074683390218124,
0.0033473431384214393
};
void ComputeFIR(double *coeffs, double *input, int filterLength,ofstream &o )
{
double acc;
double *coeffp;
int n,k,ip,nip;
o<<fixed;
for ( n = 0; n < 150000; n++ )
{
coeffp = coeffs;
ip=(filterLength - 1 + n);
nip=0;
acc = 0;
for ( k = 0; k < filterLength; k++ )
{
nip=ip-k;
acc += ((*coeffp++) * (input[nip])); // *inputp-- we can't use bcoz dynamic memory is not neccasarily in sequence);
}
o<<n<<","<<acc<<endl;
if(n<10)
{
cout<<n<<"\t"<<acc<<endl;
}
}
}
int main()
{
int i;
ofstream o("400hz.csv");
cout<<fixed;
o<<fixed;
double *buffer=new double[150264]; // Length required to process M input samples is N-1+M where N is MAC , M is the Number of samples
memset(buffer, 0, sizeof( buffer));
for(i =(FILTER_LEN - 1); i < 150264; i++)
{
buffer[i] = sin(400 * (2 * pi) * (i / 5000.0));
o<<i<<","<<buffer[i]<<endl;
}
ComputeFIR( coeffs,buffer,FILTER_LEN,o );
delete []buffer;
return 0;
}
In this code part
if(n<10)
{
cout<<n<<"\t"<<acc<<endl;
}
gives results in code blocks as
0 0.002291
1 0.003205
2 0.005587
3 0.007458
4 0.006254
5 0.001537
6 -0.005113
7 -0.011685
8 -0.0016522
9 -0.018142
which are as expected
and in Visual Studio results are
0 7.86052E+64
1 7.88065E+64
2 9.96043E+64
3 1.15158E+65
4 1.09528E+65
5 8.94574E+64
6 6.79198E+64
7 4.92152E+64
8 3.18225E+64
9 1.75206E+64
Please help in resolving this issue.
The following line doesn't initialise your buffer to zero:
memset(buffer, 0, sizeof( buffer));
You need
memset(buffer, 0, sizeof(double) * 150264);
Maybe CodeBlocks and VS2012 differ in the way that they are initialising that buffer.
Having the same code do different result when using different compilers is a sure sign of undefined behavior.
The reason behind this undefined behavior is because of this:
memset(buffer, 0, sizeof( buffer));
The first problem is that the variable buffer is a pointer, and so using sizeof(buffer) you get the size of the actual pointer and not what it points to. This means that only four or eight bytes (depending on if you have a 32 or 64 bit platform) will be initialized to zero. The rest of the memory will be seemingly random.
The second problem, which works in this case (but only in this case), is that you are setting all data to zero. Floating point numbers are not stored like normal integer numbers in memory. You can't use memset with any other value and expect the numbers to be what you want.