code is running, but the gpu function won't be executed

code is running, but the gpu function won't be executed - c++

I got two functions:
The add_cpu function works fine, but the add_gpu function does not.
I tried to check sum options on my GPU driver Software and read my code over and over again. I tried the exact same code on an other machine and it worked fine.
The checkError result on current machine is 1, what it shouldn't be.
And checkError result on my Laptop is 0, what is correct.
Does anyone have any suggestion of what is the problem with the graphic card or the system?
I have no clue what's the problem here.
Did I miss some sort of option?
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <iostream>
#include <math.h>
#define out std::cout <<
#define end << std::endl
__global__
void add_gpu( int n, float* x, float* y ) {
for ( int i = 0; i < n; i++ ) y[i] = x[i] + y[i];
}
void add_cpu( int n, float* x, float* y ) {
for ( int i = 0; i < n; i++ ) y[i] = x[i] + y[i];
}
void init( int n, float* x, float* y ) {
for ( int i = 0; i < n; i++ ) {
x[i] = 1.0f;
y[i] = 2.0f;
}
}
int checkError( int n, float f, float* y ) {
float c = 0.0f;
for ( int i = 0; i < n; i++ ) c = fmax( c, fabs( y[i] - f ) );
return c;
}
void print( int n, float* obj, char* str = "obj: " ) {
out str << obj[0];
for ( int i = 1; i < n; i++ ) out ", " << obj[i];
out "" end;
}
int main( ) {
int n = 1 << 5;
float* x, * y;
float error = 0.0f;
cudaMallocManaged( &x, n * sizeof( float ) );
cudaMallocManaged( &y, n * sizeof( float ) );
init( n, x, y );
print( n, x, "x" );
print( n, y, "y" );
add_gpu<< <1, 1 >> > ( n, x, y );
//add_cpu(n, x, y);
cudaDeviceSynchronize( );
print( n, y, "y" );
error = checkError( n, 3.0f, y );
out "error: " << error end;
cudaFree( x );
cudaFree( y );
return 0;
}

I don't see exactly where the problem is but in order to debug it you should check the cuda errors.
Most cuda functions return a cuda status. You can maybe use a little wrapper function like this to check the errors
checkCudaError(const cudaError_t error) {
if (error != cudaSuccess) {
std::cout << "Cuda error: " << cudaGetErrorString(error) << std::endl;
// maybe do something else
}
}
and call function like cudaMallocManaged() this way
checkCudaError(cudaMallocManaged(&x, n * sizeof(float));
For all operations which are performed on the device (like custom kernels) you should run the kernel and after that call
cudaGetLastError()
and maybe also use checkCudaError()
checkCudaError(cudaGetLastError())
Note that cudaGetLastError() will always return a error if at some point an error occured and so you have to find the place where the first error occures. That is why you should check cuda error every time the GPU was used in some way.

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc263dbe6574220cc776b45438fc351e8
Without copying the data to the device your GPU doesnt know the data and without copying them back your host doesnt know the results

Related

How to write a nested for-loop

I'm running a program that preforms a Euler Approximation of an Ordinary Differential Equation. The smaller the step size that is chosen, the more accurate the approximation is. I can get it to work for a set step size using this code:
#include <iostream>
using std::cout;
double f (double x, double t)
{
return t*x*x-t;
}
int main()
{
double x=0.0,t=0.0,t1=2.0;
int n=20;
double h = (t1-t) / double(n);
// ----- EULERS METHOD
for (int i=0; i<n; i++)
{
x += h*f(x,t);
t += h;
}
cout << h << " " << x << "\n";
}
So this code runs a Eulers approximation for n=20 which corresponds to a step size of 0.1 and outputs the step size along with the approximation for x(2). I want top know how to loop this code (for different values of n) so that it outputs this followed by increasingly smaller step sizes with corresponding approximations.
i.e an output something like this:
0.1 -0.972125
0.01 -0.964762
0.001 -0.9641
etc.
So I tried a for-loop inside a for-loop but its giving me a weird output of extreme values.
#include <iostream>
using std::cout;
double f (double x, double t)
{
return t*x*x-t;
}
int main()
{
double x=0.0,t=0.0,t1=2.0;
for (int n=20;n<40;n++)
{
double h = (t1-t)/n;
for (int i=0;i<n;i++)
{
x += h*f(x,t);
t += h;
}
cout << h << " " << x << "\n";
}
}

If I understand correctly, you want to execute that first piece of code inside your main function for different values of n. Then your problem is with the variables x, t and t1, which are set once before the loop and never reset. You want them inside your outer loop:
#include <iostream>
using std::cout;
double f( double x, double t )
{
return t * x * x - t;
}
int main()
{
for ( int n = 20; n < 40; n++ )
{
double x = 0.0, t = 0.0, t1 = 2.0;
double h = ( t1 - t ) / n;
for ( int i = 0; i < n; i++ )
{
x += h * f( x, t );
t += h;
}
cout << h << " " << x << "\n";
}
}
Using a function for this, makes it clearer:
#include <iostream>
using std::cout;
double f( double x, double t )
{
return t * x * x - t;
}
void eulers( const int n )
{
double x = 0.0, t = 0.0, t1 = 2.0;
double h = ( t1 - t ) / n;
for ( int i = 0; i < n; i++ )
{
x += h * f( x, t );
t += h;
}
cout << h << " " << x << "\n";
}
int main()
{
for ( int n = 20; n < 40; n++ )
{
eulers( n );
}
}
Hope this helps.

Discrete Fourier Transform giving wrong answer

I have implemented a Discrete Fourier Transform function as follows (where CVector is a simple wrapper around an array):
template <typename T, std::size_t Width>
CVector<std::complex<T>, Width> DiscreteFourierTransform( const CVector<T, Width>& vec )
{
CVector<std::complex<T>, Width> vecResult;
const std::complex<T> cmplxPrefactor( std::complex<T>( 0, -M_PI ) / (T)(Width/2) );
for( int s = 0; s < Width; ++s )
{
vecResult[s] = std::complex<T>( T( 0.0 ), T( 0.0 ) );
for( int x = 0; x < Width; ++x )
{
vecResult[s] += vec[x] * std::exp( cmplxPrefactor * (T)(x - (int)(Width/2)) * (T)(s - (int)(Width/2)) );
}
vecResult[s] /= (T)(Width);
}
return vecResult;
}
This works fine on a single top-hat function, centered in the center of the array. However, if I displace the top-hat function by -10 units, using the following bit of code:
int main()
{
CVector<double, 500> vecSlit;
for( unsigned int i = 235; i <= 245; ++i )
{
vecSlit[i] = 1.0;
}
CVector<std::complex<double>, 500> vecFourierTransform = DiscreteFourierTransform( vecSlit );
std::cout << "Saving..." << std::endl;
if( SaveList( "offset-fourier-transform.txt", vecFourierTransform ) )
{
std::cout << "Save Successful!" << std::endl;
}
else
{
std::cout << "Save Unsuccessful!" << std::endl;
}
return 0;
}
I get the following output:
Where the first plot is the amplitude and the second is the real part of the output. The amplitude looks fine, but the real part looks incorrect, does anyone have any idea why this might be?

Code not compiling - ends up in xmemory

I just got started trying to learn how to code graphics using C++. When compiling a linear interpolation code, the code does not run and sends VC++ to the xmemory file. No errors or warnings given, thus leaving me with nothing to work on. What did I do wrong? I suspect the problem is connected to the way I assign the vectors, yet none of my changes have worked.
Here is the code:
#include "SDL.h"
#include <iostream>
#include <glm/glm.hpp>
#include <vector>
#include "SDLauxiliary.h"
using namespace std;
using glm::vec3;
using std::vector;
const int SCREEN_WIDTH = 640;
const int SCREEN_HEIGHT = 480;
SDL_Surface* screen;
void Draw();
void Interpolate( float a, float b, vector<float>& result ) {
int i = 0;
for ( float x=a;x < b+1; ++x )
{
result[i] = x;
i = i + 1;
}
}
void InterpolateVec( vec3 a, vec3 b, vector<vec3>& resultvec ) {
int i = 0;
for (int add=0; add < 4; ++add) {
float count1 = (b[add]-a[add])/resultvec.size() + a[add];
float count2 = (b[add]-a[add])/resultvec.size() + a[add];
float count3 = (b[add]-a[add])/resultvec.size() + a[add];
resultvec[i].x = (count1, count2, count3);
resultvec[i].y = (count1, count2, count3);
resultvec[i].z = (count1, count2, count3);
i = i + 1;
}
}
int main( int argc, char* argv[] )
{
vector<float> result(10); // Create a vector width 10 floats
Interpolate(5, 14, result); // Fill it with interpolated values
for( int i=0; i < result.size(); ++i )
cout << result[i] << " "; // Print the result to the terminal
vector<vec3> resultvec( 4 );
vec3 a(1,4,9.2);
vec3 b(4,1,9.8);
InterpolateVec( a, b, resultvec );
for( int i=0; i<resultvec.size(); ++i )
{
cout << "( "
<< resultvec[i].x << ", "
<< resultvec[i].y << ", "
<< resultvec[i].z << " ) ";
}
screen = InitializeSDL( SCREEN_WIDTH, SCREEN_HEIGHT );
while( NoQuitMessageSDL() )
{
Draw();
}
SDL_SaveBMP( screen, "screenshot.bmp" );
return 0;
}
void Draw()
{
for( int y=0; y<SCREEN_HEIGHT; ++y )
{
for( int x=0; x<SCREEN_WIDTH; ++x )
{
vec3 color(1,0,1);
PutPixelSDL( screen, x, y, color );
}
}
if( SDL_MUSTLOCK(screen) )
SDL_UnlockSurface(screen);
SDL_UpdateRect( screen, 0, 0, 0, 0 );
}

I can not post a comment to the question so I'll write my thoughts as answer.
resultvec[i].x = (count1, count2, count3);
resultvec[i].y = (count1, count2, count3);
resultvec[i].z = (count1, count2, count3);
It looks like you (or one of your library) overload operator, for float to make vec2 and after vec3. Nice solution, but if I right, then no reason to assign each components to that value and your code will be similiar to:
resultvec[i] = (count1, count2, count3);
Again this is just a hypothesis! I can not compile your code and see the error.
Also I am not understand why you using i, which equal to add.

Strange that some of you could not compile the code; it may be that you have not installed the libraries (the n00b speculating, yay ...).
So here is what I did to make it work (less code is better in this case, as the first comment stated):
void InterpolateVec( vec3 a, vec3 b, vector<vec3>& resultvec ) {
resultvec[0].x = a[0];
resultvec[0].y = a[1];
resultvec[0].z = a[2];
float count1 = (b[0]-a[0])/(resultvec.size() - 1);
float count2 = (b[1]-a[1])/(resultvec.size() - 1);
float count3 = (b[2]-a[2])/(resultvec.size() - 1);
for (int add=1; add < 5; ++add) {
a[0] = a[0] + count1;
a[1] = a[1] + count2;
a[2] = a[2] + count3;
resultvec[add].x = a[0];
resultvec[add].y = a[1];
resultvec[add].z = a[2];
}
}
I discovered (after many an hour ...) was that I did not need to add count1, count2 and count3; vec3 is such a type that adding count1 does what I wanted it to; assigning color (i.e. something like (0,0,1)). Am I making since? My vocabulary is not that technical I know.

Or, you could save some time and let glm::vec3 do what glm::vec3 is supposed to do.
In the mean time; here, have a cookie (cookie.png)
void Interpolate(vec3 a, vec3 b, vector<vec3>& result) {
vec3 diffStep = (b-a) * (1.0f / (result.size() - 1)); // Operator overloading
result[0] = vec3(a);
for(int i = 1; i < result.size(); i++) {
result[i] = result[i-1] + diffStep;
}
}

Deadlock in MPI_Reduce() when run on multiple nodes

I have a problem with my MPI code, it hangs when the code is run on multiple nodes. It successfully completes when run on a single node. I am not sure how to debug this. Can someone help me debug this issue?
Program Usage:
mpicc -o string strin.cpp
mpirun -np 4 -npernode 2 -hostfile hosts ./string 12 0.1 0.9 10 2
My Code:
#include <iostream>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
#include "mpi.h"
int main ( int argc, char **argv )
{
float *y, *yold;
float *v, *vold;
int nprocs, myid;
FILE *f = NULL;
MPI_Status status;
int namelen;
char processor_name[MPI_MAX_PROCESSOR_NAME];
// const int NUM_MASSES = 1000;
// const float Ktension = 0.1;
// const float Kdamping = 0.9;
// const float duration = 10.0;
#if 0
if ( argc != 5 ) {
std::cout << "usage: " << argv[0] << " NUM_MASSES durationInSecs Ktension Kdamping\n";
return 2;
}
#endif
int NUM_MASSES = atoi ( argv[1] );
float duration = atof ( argv[2] );
float Ktension = atof ( argv[3] );
float Kdamping = atof ( argv[4] );
const int PICKUP_POS = NUM_MASSES / 7; // change this for diff harmonics
const int OVERSAMPLING = 16; // run sim at this multiple of audio sampling rate
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&nprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name, &namelen);
// open output file
if (myid == 0) {
f = fopen ( "rstring.raw", "wb" );
if (!f) {
std::cout << "can't open output file\n";
return 1;
}
}
// allocate displacement and velocity arrays
y = new float[NUM_MASSES];
yold = new float[NUM_MASSES];
v = new float[NUM_MASSES];
// initialize displacements (pluck it!) and velocities
for (int i = 0; i < NUM_MASSES; i++ ) {
v[i] = 0.0f;
yold[i] = y[i] = 0.0f;
if (i == NUM_MASSES/2 )
yold[i] = 1.0; // impulse at string center
}
// Broadcast data
//MPI_Bcast(y, NUM_MASSES, MPI_FLOAT, 0, MPI_COMM_WORLD);
//MPI_Bcast(yold, NUM_MASSES, MPI_FLOAT, 0, MPI_COMM_WORLD);
//MPI_Bcast(v, NUM_MASSES, MPI_FLOAT, 0, MPI_COMM_WORLD);
//int numIters = duration * 44100 * OVERSAMPLING;
int numIters = atoi( argv[5] );
for ( int t = 0; t < numIters; t++ ) {
// for each mass element
float sum = 0;
float gsum = 0;
int i_start;
int i_end ;
i_start = myid * (NUM_MASSES/nprocs);
i_end = i_start + (NUM_MASSES/nprocs);
for ( int i = i_start; i < i_end; i++ ) {
if ( i == 0 || i == NUM_MASSES-1 ) {
} else {
float accel = Ktension * (yold[i+1] + yold[i-1] - 2*yold[i]);
v[i] += accel;
v[i] *= Kdamping;
y[i] = yold[i] + v[i];
sum += y[i];
}
}
MPI_Reduce(&sum, &gsum, 1, MPI_FLOAT, MPI_SUM, 0, MPI_COMM_WORLD);
float *tmp = y;
y = yold;
yold = tmp;
if (myid == 0) {
//printf("%f\n", gsum);
if ( t % OVERSAMPLING == 0 ) {
fwrite ( &gsum, sizeof(float), 1, f );
}
}
}
if (myid == 0) {
fclose ( f );
}
MPI_Finalize();
}

If you have the possibility to do it, you may try to run your application inside a parallel debugger (like Totalview).
Otherwise, when the program hangs, you can attach a freely available serial debugger (like GDB) to one process at a time so to see where the potential problem may be located.

I guess you're receiving message, which isn't sent by any node. If every node first try to receive message, which node will send it?
You can modify program for example if id == 0 send(msg) else receive(&msg) and try use timeouts.
Write on a piece of paper how it works and how nodes interact and you will see, where is the problem.

I finally found the answer from OpenMPI mailing list. I think the problem is because of the way my hosts are setup.
I guess the TCP BTL gets confused by virtual interfaces (vmnet?) when run on multiple nodes. I limited the used interfaces using the "--mca btl_tcp_if_include eth0" argument. and this solved my issue.

Accessing array elements

I have a small problem, which I think it will be easy for you to figure out. But still I'm not a good programmer. Anyway, the problem is that I need to access the matrix element (20*2), this matrix is representing x,y locations for 20 features in image. I need to have a parameter that can give me the value of all them as x and another one for y; for example P = (all x values) and q= (all y values) in order to use them to draw on the image.
The function for creating the matrix is an opencv function.
CvMat* mat = cvCreateMat(20,2,CV_32FC1);
which this matrix has the values of frame features in x,y. I have used this code to print it out:
float t[20][2];
for (int k1=0; k1<20; k1++) {
for (int k2=0; k2<2; k2++) {
t[k1][k2] = cvmGet(mat,k1,k2);
std::cout<< t[k1][k2]<<"\t";
}
}
std::cout <<" "<< std::endl;
std::cout <<" "<< std::endl;
std::cout <<" "<< std::endl;
This code work out well, but as I mentioned above guys, that I want to sign the values to a parameters in order to use them?

Do you want something like this:
void GetMatrixElem( float t [][2] ,int x ,int y ,float** val )
{
if (val) // && (x >= 0) && (x < 20) && (y>=0) && (y<2)
*val = &t[x][y];
}
// ...
float t [20][2];
float* pElem = NULL;
GetMatrixElem( t ,10 ,1 ,&pElem );
for Columns and Rows you can use something like this:
void GetClmn( float t[][2] ,int y ,float* pClmn[] )
{
for( int x = 0; x < 20; x++ )
{
pClmn[x] = &t[x][y];
}
}
void GetRow( float t[][2] ,int x ,float* pRow[] )
{
for( int y = 0; y < 2; y++ )
{
pRow[y] = &t[x][y];
}
}
Usage:
float* pClm[20];
GetClmn( t ,1 ,pClm);
float* pRow[2];
GetRow( t ,19 ,pRow );

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

code is running, but the gpu function won't be executed - c++

https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDARTMEMORY.html#groupCUDARTMEMORY_1gc263dbe6574220cc776b45438fc351e8 Without copying the data to the device your GPU doesnt know the data and without copying them back your host doesnt know the results

Related

How to write a nested for-loop

Discrete Fourier Transform giving wrong answer

Code not compiling - ends up in xmemory

Deadlock in MPI_Reduce() when run on multiple nodes

Accessing array elements

Categories

Resources

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

code is running, but the gpu function won't be executed - c++

https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__MEMORY.html#group__CUDART__MEMORY_1gc263dbe6574220cc776b45438fc351e8 Without copying the data to the device your GPU doesnt know the data and without copying them back your host doesnt know the results

Related

How to write a nested for-loop

Discrete Fourier Transform giving wrong answer

Code not compiling - ends up in xmemory

Deadlock in MPI_Reduce() when run on multiple nodes

Accessing array elements

Categories

Resources

https://docs.nvidia.com/cuda/cuda-runtime-api/groupCUDARTMEMORY.html#groupCUDARTMEMORY_1gc263dbe6574220cc776b45438fc351e8 Without copying the data to the device your GPU doesnt know the data and without copying them back your host doesnt know the results