CUDA cufft 2D example - c++

I am currently working on a program that has to implement a 2D-FFT, (for cross correlation). I did a 1D FFT with CUDA which gave me the correct results, i am now trying to implement a 2D version. With few examples and documentation online i find it hard to find out what the error is.
So far i have been using the cuFFT manual only.
Anyway, i have created two 5x5 arrays and filled them with 1's. I have copied them onto the GPU memory and done the forward FFT, multiplied them and then done ifft on the result. This gives me a 5x5 array with values 650. I would expect to get a DC signal with the value 25 in only one slot in the 5x5 array. Instead i get 650 in the entire array.
Furthermore i am not allowed to print out the value of the signal after it has been copied onto the GPU memory. Writing
cout << d_signal[1].x << endl;
Gives me an acces violation. I have done the same thing in other cuda programs, where this has not been an issue. Does it have something to do with how the complex variable works, or is it human error?
If anyone has any pointers to what is going wrong i would greatly appreciate it. Here is the code
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <helper_functions.h>
#include <helper_cuda.h>
#include <ctime>
#include <time.h>
#include <stdio.h>
#include <iostream>
#include <math.h>
#include <cufft.h>
#include <fstream>
using namespace std;
typedef float2 Complex;
__global__ void ComplexMUL(Complex *a, Complex *b)
{
int i = threadIdx.x;
a[i].x = a[i].x * b[i].x - a[i].y*b[i].y;
a[i].y = a[i].x * b[i].y + a[i].y*b[i].x;
}
int main()
{
int N = 5;
int SIZE = N*N;
Complex *fg = new Complex[SIZE];
for (int i = 0; i < SIZE; i++){
fg[i].x = 1;
fg[i].y = 0;
}
Complex *fig = new Complex[SIZE];
for (int i = 0; i < SIZE; i++){
fig[i].x = 1; //
fig[i].y = 0;
}
for (int i = 0; i < 24; i=i+5)
{
cout << fg[i].x << " " << fg[i + 1].x << " " << fg[i + 2].x << " " << fg[i + 3].x << " " << fg[i + 4].x << endl;
}
cout << "----------------" << endl;
for (int i = 0; i < 24; i = i + 5)
{
cout << fig[i].x << " " << fig[i + 1].x << " " << fig[i + 2].x << " " << fig[i + 3].x << " " << fig[i + 4].x << endl;
}
cout << "----------------" << endl;
int mem_size = sizeof(Complex)* SIZE;
cufftComplex *d_signal;
checkCudaErrors(cudaMalloc((void **) &d_signal, mem_size));
checkCudaErrors(cudaMemcpy(d_signal, fg, mem_size, cudaMemcpyHostToDevice));
cufftComplex *d_filter_kernel;
checkCudaErrors(cudaMalloc((void **)&d_filter_kernel, mem_size));
checkCudaErrors(cudaMemcpy(d_filter_kernel, fig, mem_size, cudaMemcpyHostToDevice));
// cout << d_signal[1].x << endl;
// CUFFT plan
cufftHandle plan;
cufftPlan2d(&plan, N, N, CUFFT_C2C);
// Transform signal and filter
printf("Transforming signal cufftExecR2C\n");
cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_FORWARD);
cufftExecC2C(plan, (cufftComplex *)d_filter_kernel, (cufftComplex *)d_filter_kernel, CUFFT_FORWARD);
printf("Launching Complex multiplication<<< >>>\n");
ComplexMUL <<< 32, 256 >> >(d_signal, d_filter_kernel);
// Transform signal back
printf("Transforming signal back cufftExecC2C\n");
cufftExecC2C(plan, (cufftComplex *)d_signal, (cufftComplex *)d_signal, CUFFT_INVERSE);
Complex *result = new Complex[SIZE];
cudaMemcpy(result, d_signal, sizeof(Complex)*SIZE, cudaMemcpyDeviceToHost);
for (int i = 0; i < SIZE; i=i+5)
{
cout << result[i].x << " " << result[i + 1].x << " " << result[i + 2].x << " " << result[i + 3].x << " " << result[i + 4].x << endl;
}
delete result, fg, fig;
cufftDestroy(plan);
//cufftDestroy(plan2);
cudaFree(d_signal);
cudaFree(d_filter_kernel);
}
The above code gives the following terminal output:
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
----------------
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
----------------
Transforming signal cufftExecR2C
Launching Complex multiplication<<< >>>
Transforming signal back cufftExecC2C
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625
625 625 625 625 625

This gives me a 5x5 array with values 650 : It reads 625 which is 5555. The convolution algorithm you are using requires a supplemental divide by NN. Indeed, in cufft, there is no normalization coefficient in the forward transform. Hence, your convolution cannot be the simple multiply of the two fields in frequency domain. (some would call it the mathematicians DFT and not the physicists DFT).
Furthermore i am not allowed to print out the value of the signal after it has been copied onto the GPU memory: This is standard CUDA behavior. When allocating memory on the device, the data exists in device memory address space, and cannot be accessed by the CPU without additionnal effort. Search for managed memory, or zerocopy to have data accessible from both sides of the PCI Express (this is discussed in many other posts).

There are several problems here:
You are launching far too many threads for the size of the input arrays in the multiplication kernel, so that should be failing with out-of-bounds memory errors. I am surprised you are not receiving any sort of runtime error.
Your expected solution from the fft/fft - dot product - ifft sequence is, I believe, incorrect. The correct solution would be a 5x5 matrix with 25 in each entry.
As clearly described in the cuFFT documentation, the library performs unnormalised FFTs:
cuFFT performs un-normalized FFTs; that is, performing a forward FFT on an input data set followed by an inverse FFT on the resulting set yields data that is equal to the input, scaled by the number of elements. Scaling either transform by the reciprocal of the size of the data set is left for the user to perform as seen fit.
So by my reckoning, the correct output solution for your code should be a 5x5 matrix with 625 in each entry, which would be normalised to a 5x5 matrix with 25 in each entry, ie. the expected result. I don't understand how the problem at (1) isn't producing different results as the multiplication kernel should be failing.
TLDR; nothing to see here, move along...

Just as an add up to the other things mentioned already: I think your complex multiplication kernel is not doing the right thing. You are overwriting a[i].x in the first line and then use the new value of a[i].x in the second line to calculate a[i].y. I think you need to first generate a backup of a[i].x before you overwrite, something like:
float aReal_bk = a[i].x;
a[i].x = a[i].x * b[i].x - a[i].y * b[i].y;
a[i].y = aReal_bk * b[i].y + a[i].y * b[i].x;

Related

Pointer Exception while getting RGB values from (video) frame Intel Realsense

I'm trying to get the different RGB values from a frame with the Realsense SDK. This is for a 3D depth camera with RGB. According to https://github.com/IntelRealSense/librealsense/issues/3364 I need to use
int i = 100, j = 100; // fetch pixel 100,100
rs2::frame rgb = ...
auto ptr = (uint8_t*)rgb.get_data();
auto stride = rgb.as<rs2::video_frame>().stride();
cout << "R=" << ptr[3*(i * stride + j)];
cout << ", G=" << ptr[3*(i * stride + j) + 1];
cout << ", B=" << ptr[3*(i * stride + j) + 2];
In my code I'm getting a pointer exception if I want to get the values for pixel (x,y)=1000,1000. With (x,y)=100,100 it works every time... Error: Exception thrown: read access violation. ptr was 0x11103131EB9192A.
I set the enable_stream to cfg.enable_stream(RS2_STREAM_COLOR, WIDTH_COLOR_FRAME, HEIGTH_COLOR_FRAME, RS2_FORMAT_RGB8, 15); where in the .h file are:
#define WIDTH_COLOR_FRAME 1920
#define HEIGTH_COLOR_FRAME 1080
This is my code. Maybe it has something to do with the RS2_FORMAT_RGB8?
frameset frames = pl.wait_for_frames();
frame color = frames.get_color_frame();
uint8_t* ptr = (uint8_t*)color.get_data();
int stride = color.as<video_frame>().get_stride_in_bytes();
int i = 1000, j = 1000; // fetch pixel 100,100
cout << "R=" << int(ptr[3 * (i * stride + j)]);
cout << ", G=" << int(ptr[3 * (i * stride + j) + 1]);
cout << ", B=" << int(ptr[3 * (i * stride + j) + 2]);
cout << endl;
Thanks in advance!
stride is in bytes (length of row in bytes), multiplication with 3 is not required.
cout << " R= " << int(ptr[i * stride + (3*j) ]);
cout << ", G= " << int(ptr[i * stride + (3*j) + 1]);
cout << ", B= " << int(ptr[i * stride + (3*j) + 2]);
I had the same problem and even with the last answers I still got segfaults.
I found out that when you do
uint8_t *ptr = color.get_data()
the realsense sdk won't increase/track some internal reference and the pointer went invalid after some time, causing the segfaults.
my Fix is copy the content to a local buffer.
malloc new buffer with RGB size.
right after get_data() copy data to the new buffer.
that fixed all my issues.
all the best.

Efficient zero padding using cudaMemcpy3D

I would like to transfer a 3d array stored in linear memory on the host, into a larger (3D) array on the device. As an example (see below), I tried to transfer a (3x3x3) array into a (5x5x3) array.
I expect that on the host I get 2D slices with the following pattern:
x x x 0 0
x x x 0 0
x x x 0 0
0 0 0 0 0
0 0 0 0 0
where x are the values of my array. However, I get something like that, where y are the values of the next 2D slice:
x x x 0 0
x x x 0 0
x x x 0 0
y y y 0 0
y y y 0 0
According to the cudaMemcpy3D documentation I would have expect that the extent parameter would take into account the padding in the vertical axis but apparently not.
Am I mistaken in the understanding of the documentation? If yes, is there any other way to perform this operation? The final size of the array to transfer will be 60x60x900 into an array of size 1100x1500x900. I use the zero padding to prepare a Fourier transform.
Here is the simplified code that I used:
cudaError_t cuda_status;
cudaPitchedPtr d_ptr;
cudaExtent d_extent = make_cudaExtent(sizeof(int)*5,sizeof(int)*5,sizeof(int)*3);
cudaExtent h_extent = make_cudaExtent(sizeof(int)*3,sizeof(int)*3,sizeof(int)*3);
int* h_array = (int*) malloc(27*sizeof(int));
int* h_result = (int*) malloc(512*sizeof(int)*5*3);
for (int i = 0; i<27; i++)
{
h_array[i] = i;
}
cuda_status = cudaMalloc3D(&d_ptr, d_extent);
cout << cudaGetErrorString(cuda_status) << endl;
cudaMemcpy3DParms myParms = {0};
myParms.extent = h_extent;
myParms.srcPtr.ptr = h_array;
myParms.srcPtr.pitch = 3*sizeof(int);
myParms.srcPtr.xsize = 3*sizeof(int);
myParms.srcPtr.ysize = 3*sizeof(int);
myParms.dstPtr = d_ptr;
myParms.kind = cudaMemcpyHostToDevice;
cuda_status = cudaMemcpy3D(&myParms);
cout << cudaGetErrorString(cuda_status) << endl;
cout << "Pitch: " << d_ptr.pitch << " / xsize:" << d_ptr.xsize << " / ysize:" << d_ptr.ysize << endl; // returns Pitch: 512 / xsize:20 / ysize:20 which is as expected
// Copy array to host to be able to print the values - may not be necessary
cout << cudaMemcpy(h_result, (int*) d_ptr.ptr, 512*5*3, cudaMemcpyDeviceToHost) << endl;
cout << h_result[128] << " " << h_result[3*128] << " " << h_result[5*128] << " " << endl; // output : 3 9 15 / expected 3 0 9
The problems here have to do with your extents and sizes.
When an extent is used with cudaMemcpy3D for the non-cudaArray case, it is intended to provide the size of the region in bytes. A way to think about this is that product of the 3 dimensions of the extent should yield the size of the region in bytes.
What you're doing however is scaling each of the 3 dimensions by the element size, which is not correct:
cudaExtent h_extent = make_cudaExtent(sizeof(int)*3,sizeof(int)*3,sizeof(int)*3);
^^^^^^^^^^^
this is the only element scaling expected
You've made a similar error here:
myParms.srcPtr.xsize = 3*sizeof(int); // correct
myParms.srcPtr.ysize = 3*sizeof(int); // incorrect
We only scale the x (width) dimension by the element size, we don't scale the y (height) or z (depth) dimensions.
I haven't fully verified your code, but with those 2 changes, your code produces the output you indicate is expected:
$ cat t1593.cu
#include <iostream>
using namespace std;
int main(){
cudaError_t cuda_status;
cudaPitchedPtr d_ptr;
cudaExtent d_extent = make_cudaExtent(sizeof(int)*5,5,3);
cudaExtent h_extent = make_cudaExtent(sizeof(int)*3,3,3);
int* h_array = (int*) malloc(27*sizeof(int));
int* h_result = (int*) malloc(512*sizeof(int)*5*3);
for (int i = 0; i<27; i++)
{
h_array[i] = i;
}
cuda_status = cudaMalloc3D(&d_ptr, d_extent);
cout << cudaGetErrorString(cuda_status) << endl;
cudaMemcpy3DParms myParms = {0};
myParms.extent = h_extent;
myParms.srcPtr.ptr = h_array;
myParms.srcPtr.pitch = 3*sizeof(int);
myParms.srcPtr.xsize = 3*sizeof(int);
myParms.srcPtr.ysize = 3;
myParms.dstPtr = d_ptr;
myParms.kind = cudaMemcpyHostToDevice;
cuda_status = cudaMemcpy3D(&myParms);
cout << cudaGetErrorString(cuda_status) << endl;
cout << "Pitch: " << d_ptr.pitch << " / xsize:" << d_ptr.xsize << " / ysize:" << d_ptr.ysize << endl; // returns Pitch: 512 / xsize:20 / ysize:20 wich is as expected
// Copy array to host to be able to print the values - may not be necessary
cout << cudaMemcpy(h_result, (int*) d_ptr.ptr, d_ptr.pitch*5*3, cudaMemcpyDeviceToHost) << endl;
cout << h_result[128] << " " << h_result[3*128] << " " << h_result[5*128] << " " << endl; // output : 3 9 15 / expected 3 0 9
}
$ nvcc -o t1593 t1593.cu
$ cuda-memcheck ./t1593
========= CUDA-MEMCHECK
no error
no error
Pitch: 512 / xsize:20 / ysize:5
0
3 0 9
========= ERROR SUMMARY: 0 errors
$
I should also point out that the strided memcpy operations in CUDA (e.g. cudaMemcpy2D, cudaMemcpy3D) are not necessarily the fastest way to conduct such a transfer. You can find writeups of this characteristic in various questions about cudaMemcpy2D here on SO cuda tag.
The net of it is that it may be faster to transfer the data to the device in an unstrided, unpadded linear transfer, then write a CUDA kernel to take the data that is now on the device, and place it in the array of interest, with appropriate striding/padding.

Subtracting two integers causes integer-underflow in device code

In my cuda device code I am doing a check where I subtracting the thread's id and the blockDim to see weather or not the data I might want to use is in range. But when this number goes bellow 0 it seems to wrap back around to be max instead.
#include <iostream>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
float input[] =
{
1.5f, 2.5f, 3.5f,
4.5f, 5.5f, 6.5f,
7.5f, 8.5f, 9.5f,
};
__global__ void underflowCausingFunction(float* in, float* out)
{
int id = (blockDim.x * blockIdx.x) + threadIdx.x;
out[id] = id - blockDim.x;
}
int main()
{
float* in;
float* out;
cudaMalloc(&in, sizeof(float) * 9);
cudaMemcpy(in, input, sizeof(float) * 9, cudaMemcpyHostToDevice);
cudaMalloc(&out, sizeof(float) * 9);
underflowCausingFunction<<<3, 3>>>(in, out);
float recivedOut[9];
cudaMemcpy(recivedOut, out, sizeof(float) * 9, cudaMemcpyDeviceToHost);
cudaDeviceSynchronize();
std::cout << recivedOut[0] << " " << recivedOut[1] << " " << recivedOut[2] << "\n"
<< recivedOut[3] << " " << recivedOut[4] << " " << recivedOut[5] << "\n"
<< recivedOut[6] << " " << recivedOut[7] << " " << recivedOut[8] << "\n";
cudaFree(in);
cudaFree(out);
std::cin.get();
}
The output of this is:
4.29497e+09 4.29497e+09 4.29497e+09
0 1 2
3 4 5
I'm not sure why it's acting like an unsigned int.
If it is relevant I am using GTX 970 and the NVCC compiler that comes with the visual studio plugin. If somebody could explain what's happening or what I'm doing on wrong that would be great.
The built-in variables like threadIdx and blockIdx are composed of unsigned quantities.
In C++, when you subtract an unsigned quantity from a signed integer quantity:
out[id] = id - blockDim.x;
the arithmetic that gets performed is unsigned arithmetic.
Since you want signed arithmetic (apparently) the correct thing to do is to make sure both quantities being subtracted are of signed type (let's use int in this case):
out[id] = id - (int)blockDim.x;

Opencv - RTrees algorithm : adding weight to class

I am using OpenCV's implementation of Random Forest algorithm (i.e. RTrees) and am facing a little problem when setting parameters.
I have 5 classes and 3 variables and I want to add weight to classes because the samples sizes for each classes vary a lot.
I took a look at the documentation here and here and it seems that the priors array is the solution, but when I try to give it 5 weights (for my 5 classes) it gives me the following error :
OpenCV Error: One of arguments' values is out of range (Every class weight should be positive) in CvDTreeTrainData::set_data, file /home/sguinard/dev/opencv-2.4.13/modules/ml/src/tree.cpp, line 644
terminate called after throwing an instance of 'cv::Exception'
what(): /home/sguinard/dev/opencv-2.4.13/modules/ml/src/tree.cpp:644: error: (-211) Every class weight should be positive in function CvDTreeTrainData::set_data
If I understand well, it's due to the fact that the priors array have 5 elements. And when I try to give it only 3 elements (as my number of variables) everything works.
According to the documentation, this array should be used to add weight to classes but it actually seems that it is used to add weight to variables...
So, does anyone knows how to add weight to classes on OpenCV's RTrees algorithm ? (I'm working with OpenCV 2.4.13 in c++)
Thanks in advance !
Here is my code :
cv::Mat RandomForest(cv::Mat train_data, cv::Mat response_data, cv::Mat sample_data, int size, int size_predict, float weights[5])
{
#undef CV_TERMCRIT_ITER
#define CV_TERMCRIT_ITER 10
#define ATTRIBUTES_PER_SAMPLE 3
cv::RandomTrees RFTree;
float priors[] = {1,1,1};
CvRTParams RFParams = CvRTParams(25, // max depth
500, // min sample count
0, // regression accuracy: N/A here
false, // compute surrogate split, no missing data
5, // max number of categories (use sub-optimal algorithm for larger numbers)
//priors
weights, // the array of priors (use weights or priors)
true,//false, // calculate variable importance
2, // number of variables randomly selected at node and used to find the best split(s).
100, // max number of trees in the forest
0.01f, // forrest accuracy
CV_TERMCRIT_ITER | CV_TERMCRIT_EPS // termination cirteria
);
cv::Mat varIdx = cv::Mat();
cv::Mat vartype( train_data.cols + 1, 1, CV_8U );
vartype.setTo(cv::Scalar::all(CV_VAR_NUMERICAL));
vartype.at<uchar>(ATTRIBUTES_PER_SAMPLE, 0) = CV_VAR_CATEGORICAL;
cv::Mat sampleIdx = cv::Mat();
cv::Mat missingdatamask = cv::Mat();
for (int i=0; i!=train_data.rows; ++i)
{
for (int j=0; j!=train_data.cols; ++j)
{
if(train_data.at<float>(i,j)<0
|| train_data.at<float>(i,j)>10000
|| !float(train_data.at<float>(i,j)))
{train_data.at<float>(i,j)=0;}
}
}
// Training
std::cout << "Training ....." << std::flush;
bool train = RFTree.train(train_data,
CV_ROW_SAMPLE,//tflag,
response_data,//responses,
varIdx,
sampleIdx,
vartype,
missingdatamask,
RFParams);
if (train){std::cout << " Done" << std::endl;}
else{std::cout << " Failed" << std::endl;return cv::Mat();}
std::cout << "Variable Importance : " << std::endl;
cv::Mat VI = RFTree.getVarImportance();
for (int i=0; i!=VI.cols; ++i){std::cout << VI.at<float>(i) << " - " << std::flush;}
std::cout << std::endl;
std::cout << "Predicting ....." << std::flush;
cv::Mat predict(1,sample_data.rows,CV_32F);
float max = 0;
for (int i=0; i!=sample_data.rows; ++i)
{
predict.at<float>(i) = RFTree.predict(sample_data.row(i));
if (predict.at<float>(i)>max){max=predict.at<float>(i);/*std::cout << predict.at<float>(i) << "-"<< std::flush;*/}
}
// Personnal test due to an error I got (everyone sent to 0)
if (max==0){std::cout << " Failed ... Max value = 0" << std::endl;return cv::Mat();}
std::cout << " Done ... Max value = " << max << std::endl;
return predict;
}

Inverting matrices mod-26 with Eigen C++ library

I'm trying to write a program to crack a Hill cipher of arbitrary dimensions (MxM) in C++. Part of the process requires me to calculate the mod-26 inverse of a matrix.
For example, the modular inverse of 2x2 array
14 3
11 0
is
0 19
9 24
I have a function that can accomplish this for 2x2 arrays only, which is not sufficient. I know that calculating inverses on larger-dimension arrays is difficult, so I'm using the Eigen C++ library. However, the Eigen inverse() function gives me this as the inverse of the above matrix:
0.000 0.091
0.333 -0.424
How can I calculate the modular 26 inverse that I need for a matrix of any dimensions with Eigen?
Try this:
#include <iostream>
#include <functional>
#include <Eigen/Dense>
using namespace Eigen;
using namespace std;
int inverse_mod_26(int d)
{
// We're not going to use Euclidean Alg. or
// even Fermat's Little Theorem, but brute force
int base = 26, inv = 1;
while ( (inv < base) &&
(((d * ++inv) % 26) != 1)) {}
return inv;
}
int main(int argc, char **argv)
{
Matrix2d m, minv;
int inv_factor;
m << 14, 3, 15, 0;
double mdet = m.determinant();
minv = mdet * m.inverse();
transform(&minv.data()[0], &minv.data()[4], &minv.data()[0],
[](double d){ return static_cast<int>(d) % 26;});
if ((static_cast<int>(mdet) % 26) == 1) { // no further modification}
else
{
inv_factor = inverse_mod_26(std::abs((m * minv)(0,0)));
if (inv_factor == 26)
{
cerr << "No inverse exists!" << endl;
return EXIT_FAILURE;
}
transform(&minv.data()[0], &minv.data()[4], &minv.data()[0],
[=](double d){ return static_cast<int>(d) * inv_factor;});
}
cout << "m = " << endl << m << endl;
cout << "minv = " << endl << minv << endl;
cout << "(m * minv) = " << endl << m * minv << endl;
return 0;
}
This is a 2x2 case, for base 26, but can easily be modified. The algorithm relies on modifying the normal matrix inverse, and can easily be explained, if you wish. If your original matrix has determinant (in the normal sense) that is not relatively prime to 26; i.e., if GCD(det(m), 26) != 1, then it will not have an inverse.
Tip: to avoid this problem, and the else clause above, pad your dictionary with three arbitrary characters, bringing the size to 29, which is prime, and will trivially satisfy the GCD property above.