Matrix calculation error appears when dimensions become large [duplicate] - c++

This question already has answers here:
C programming, why does this large array declaration produce a segmentation fault?
(6 answers)
Closed 6 years ago.
I am running a code where I am simply creating 2 matrices: one matrix is of dimensions arows x nsame and the other has dimensions nsame x bcols. The result is an array of dimensions arows x bcols. This is fairly simple to implement using BLAS and the following code appears to work as intended when using the below master-slave model with OpenMPI:`
#include <iostream>
#include <stdio.h>
#include <iostream>
#include <cmath>
#include <mpi.h>
#include <gsl/gsl_blas.h>
using namespace std;`
int main(int argc, char** argv){
int noprocs, nid;
MPI_Status status;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &nid);
MPI_Comm_size(MPI_COMM_WORLD, &noprocs);
int master = 0;
const int nsame = 500; //must be same if matrices multiplied together = acols = brows
const int arows = 500;
const int bcols = 527; //works for 500 x 500 x 527 and 6000 x 100 x 36
int rowsent;
double buff[nsame];
double b[nsame*bcols];
double c[arows][bcols];
double CC[1*bcols]; //here ncols corresponds to numbers of rows for matrix b
for (int i = 0; i < bcols; i++){
CC[i] = 0.;
};
// Master part
if (nid == master ) {
double a [arows][nsame]; //creating identity matrix of dimensions arows x nsame (it is I if arows = nsame)
for (int i = 0; i < arows; i++){
for (int j = 0; j < nsame; j++){
if (i == j)
a[i][j] = 1.;
else
a[i][j] = 0.;
}
}
double b[nsame*bcols];//here ncols corresponds to numbers of rows for matrix b
for (int i = 0; i < (nsame*bcols); i++){
b[i] = (10.*i + 3.)/(3.*i - 2.) ;
};
MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
rowsent=0;
for (int i=1; i < (noprocs); i++) {
// Note A is a 2D array so A[rowsent]=&A[rowsent][0]
MPI_Send(a[rowsent], nsame, MPI_DOUBLE_PRECISION,i,rowsent+1,MPI_COMM_WORLD);
rowsent++;
}
for (int i=0; i<arows; i++) {
MPI_Recv(CC, bcols, MPI_DOUBLE_PRECISION, MPI_ANY_SOURCE, MPI_ANY_TAG,
MPI_COMM_WORLD, &status);
int sender = status.MPI_SOURCE;
int anstype = status.MPI_TAG; //row number+1
int IND_I = 0;
while (IND_I < bcols){
c[anstype - 1][IND_I] = CC[IND_I];
IND_I++;
}
if (rowsent < arows) {
MPI_Send(a[rowsent], nsame,MPI_DOUBLE_PRECISION,sender,rowsent+1,MPI_COMM_WORLD);
rowsent++;
}
else { // tell sender no more work to do via a 0 TAG
MPI_Send(MPI_BOTTOM,0,MPI_DOUBLE_PRECISION,sender,0,MPI_COMM_WORLD);
}
}
}
// Slave part
else {
MPI_Bcast(b,nsame*bcols, MPI_DOUBLE_PRECISION, master, MPI_COMM_WORLD);
MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
while(status.MPI_TAG != 0) {
int crow = status.MPI_TAG;
gsl_matrix_view AAAA = gsl_matrix_view_array(buff, 1, nsame);
gsl_matrix_view BBBB = gsl_matrix_view_array(b, nsame, bcols);
gsl_matrix_view CCCC = gsl_matrix_view_array(CC, 1, bcols);
/* Compute C = A B */
gsl_blas_dgemm (CblasNoTrans, CblasNoTrans, 1.0, &AAAA.matrix, &BBBB.matrix,
0.0, &CCCC.matrix);
MPI_Send(CC,bcols,MPI_DOUBLE_PRECISION, master, crow, MPI_COMM_WORLD);
MPI_Recv(buff,nsame,MPI_DOUBLE_PRECISION,master,MPI_ANY_TAG,MPI_COMM_WORLD,&status);
}
}
// output c here on master node //uncomment the below lines if I wish to see the output
// if (nid == master){
// if (rowsent == arows){
// // cout << rowsent;
// int IND_F = 0;
// while (IND_F < arows){
// int IND_K = 0;
// while (IND_K < bcols){
// cout << "[" << IND_F << "]" << "[" << IND_K << "] = " << c[IND_F][IND_K] << " ";
// IND_K++;
// }
// cout << "\n";
// IND_F++;
// }
// }
// }
MPI_Finalize();
//free any allocated space here
return 0;
};
Now what appears odd is that when I increase size of the matrices (e.g. from nsame = 500 to nsame = 501), the code no longer works. I receive the following error:
mpirun noticed that process rank 0 with PID 0 on node Users-MacBook-Air exited on signal 11 (Segmentation fault: 11).
I have tried this with other combinations of sizes for the matrices and there always appears to be an upper limit for the size of the matrices themselves (which seems to vary based on how I vary the different dimensions themselves). I have also tried modifying the values of the matrices themselves although this does not appear to change anything. I realize there are alternative ways to initialize the matrices in my example (e.g. using vector) but am simply wondering why my current scheme of multiplying matrices of arbitrary size seems to only work to a certain extent.

You're declaring too many big local variables, which is causing stack space related problems. a, in particular, is 500x500 doubles (250000 8 byte elements, or 2 million bytes). b is even larger.
You'll need to dynamically allocate space for some or all of those arrays.
There might be a compiler option to increase the initial stack space but that isn't a good long term solution.

Related

finding global maxima of a function from comparing each processor's local maxima using MPI ring topology

I wish to use the MPI ring topology, passing each processor's maxima around the ring, comparing the local maxima and then output the global maxima for all processors.
I am using a 10 dimensional Monte Carlo integration function. My first idea was to make an array with each processor's local maxima, then pass that value, compare and output the highest value. But I couldn't elegantly code to make an array which will take only each processors' max value and store it corresponding to rank of the processor, this way I can also keep track which processor got the global maxima.
I didn't finish my code yet, right now I am interested to see if an array with local maxima from processor's can be created. the way I coded, it's very time consuming and if there is a lot of processors, then I have to declare them each time, and yet I couldn't produce the array I am looking for.
I am sharing the code here:
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <ctime>
#include <mpi.h>
using namespace std;
//define multivariate function F(x1, x2, ...xk)
double f(double x[], int n)
{
double y;
int j;
y = 0.0;
for (j = 0; j < n-1; j = j+1)
{
y = y + exp(-pow((1-x[j]),2)-100*(pow((x[j+1] - pow(x[j],2)),2)));
}
y = y;
return y;
}
//define function for Monte Carlo Multidimensional integration
double int_mcnd(double(*fn)(double[],int),double a[], double b[], int n, int m)
{
double r, x[n], v;
int i, j;
r = 0.0;
v = 1.0;
// initial seed value (use system time)
//srand(time(NULL));
// step 1: calculate the common factor V
for (j = 0; j < n; j = j+1)
{
v = v*(b[j]-a[j]);
}
// step 2: integration
for (i = 1; i <= m; i=i+1)
{
// calculate random x[] points
for (j = 0; j < n; j = j+1)
{
x[j] = a[j] + (rand()) /( (RAND_MAX/(b[j]-a[j])));
}
r = r + fn(x,n);
}
r = r*v/m;
return r;
}
double f(double[], int);
double int_mcnd(double(*)(double[],int), double[], double[], int, int);
int main(int argc, char **argv)
{
int rank, size;
MPI_Init (&argc, &argv); // initializes MPI
MPI_Comm_rank (MPI_COMM_WORLD, &rank); // get current MPI-process ID. O, 1, ...
MPI_Comm_size (MPI_COMM_WORLD, &size); // get the total number of processes
/* define how many integrals */
const int n = 10;
double b[n] = {5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0,5.0};
double a[n] = {-5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0,-5.0};
double result, mean;
int m;
const unsigned int N = 5;
double max = -1;
double max_store[4];
cout.precision(6);
cout.setf(ios::fixed | ios::showpoint);
srand(time(NULL) * rank); // each MPI process gets a unique seed
m = 4; // initial number of intervals
// convert command-line input to N = number of points
//N = atoi( argv[1] );
for (unsigned int i=0; i <=N; i++)
{
result = int_mcnd(f, a, b, n, m);
mean = result/(pow(10,10));
if( mean > max)
{
max = mean;
}
//cout << setw(10) << m << setw(10) << max << setw(10) << mean << setw(10) << rank << setw(10) << size <<endl;
m = m*4;
}
//cout << setw(30) << m << setw(30) << result << setw(30) << mean <<endl;
printf("Process %d of %d mean = %1.5e\n and local max = %1.5e\n", rank, size, mean, max );
if (rank==0)
{
max_store[0] = max;
}
else if (rank==1)
{
max_store[1] = max;
}
else if (rank ==2)
{
max_store[2] = max;
}
else if (rank ==3)
{
max_store[3] = max;
}
for( int k = 0; k < 4; k++ )
{
printf( "%1.5e\n", max_store[k]);
}
//double max_store[4] = {4.43095e-02, 5.76586e-02, 3.15962e-02, 4.23079e-02};
double send_junk = max_store[0];
double rec_junk;
MPI_Status status;
// This next if-statment implemeents the ring topology
// the last process ID is size-1, so the ring topology is: 0->1, 1->2, ... size-1->0
// rank 0 starts the chain of events by passing to rank 1
if(rank==0) {
// only the process with rank ID = 0 will be in this block of code.
MPI_Send(&send_junk, 1, MPI_DOUBLE, 1, 0, MPI_COMM_WORLD); // send data to process 1
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, size-1, 0, MPI_COMM_WORLD, &status); // receive data from process size-1
}
else if( rank == size-1) {
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, &status); // recieve data from process rank-1 (it "left" neighbor")
MPI_Send(&send_junk, 1, MPI_DOUBLE, 0, 0, MPI_COMM_WORLD); // send data to its "right neighbor", rank 0
}
else {
MPI_Recv(&rec_junk, 1, MPI_DOUBLE, rank-1, 0, MPI_COMM_WORLD, &status); // recieve data from process rank-1 (it "left" neighbor")
MPI_Send(&send_junk, 1, MPI_DOUBLE, rank+1, 0, MPI_COMM_WORLD); // send data to its "right neighbor" (rank+1)
}
printf("Process %d send %1.5e\n and recieved %1.5e\n", rank, send_junk, rec_junk );
MPI_Finalize(); // programs should always perform a "graceful" shutdown
return 0;
}
compile with :
mpiCC -o gd test_code.cpp
mpirun -np 4 ./gd
I would appreciate suggestion:
if there is a more elegant way to make local maxima arrays?
How to compare the local maxima and decide the global maxima while passing the values in a ring?
Also feel free to modify the code to provide me a better example to work with. I would appreciate any suggestion. thanks.
For this sort of thing, better using either MPI_Reduce() or MPI_Allreduce() with MPI_MAX as operator. The former will compute the max over the values exposed by all processes and give the result to the "root" process only, while the later will do the same, but give the results to all processes.
// Only process of rank 0 get the global max
MPI_Reduce( &local_max, &global_max, 1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD );
// All processes get the global max
MPI_Allreduce( &local_max, &global_max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
// All processes get the global max, stored in place of the local max
// after the call ends - this might be the most interesting one for you
MPI_Allreduce( MPI_IN_PLACE, &max, 1, MPI_DOUBLE, MPI_MAX, MPI_COMM_WORLD );
As you can see, you could just insert the 3rd example into your code to solve your problem.
BTW, unrelated remark, but this hurts my eyes:
if (rank==0)
{
max_store[0] = max;
}
else if (rank==1)
{
max_store[1] = max;
}
else if (rank ==2)
{
max_store[2] = max;
}
else if (rank ==3)
{
max_store[3] = max;
}
What about something like this:
if ( rank < 4 && rank >= 0 ) {
max_store[rank] = max;
}

C++ Advice on manipulating output Matrix data

I have the following code.
Essentially it is creating N random normal variables, and running through an equation M times for a simulation.
The output should be an NxM matrix of data, however the only way I could do the calculation has the output as MxN. ie each M run should be a column, not a row.
I have attempted in vain to follow some of the other suggestions that have been posted on previous similar topics.
Code:
#include <iostream>
#include <time.h>
#include <random>
int main()
{
double T = 1; // End time period for simulation
int N = 4; // Number of time steps
int M = 2; // Number of simulations
double x0 = 1.00; // Starting x value
double mu = 0.00; // mu(x,t) value
double sig = 1.00; // sigma(x,t) value
double dt = T/N;
double sqrt_dt = sqrt(dt);
double** SDE_X = new double*[M]; // SDE Matrix setup
// Random Number generation setup
double RAND_N;
srand ((unsigned int) time(NULL)); // Generator loop reset
std::default_random_engine generator (rand());
std::normal_distribution<double> distribution (0.0,1.0); // Mean = 0.0, Variance = 1.0 ie Normal
for (int i = 0; i < M; i++)
{
SDE_X[i] = new double[N];
for (int j=0; j < N; j++)
{
RAND_N = distribution(generator);
SDE_X[i][0] = x0;
SDE_X[i][j+1] = SDE_X[i][j] + mu * dt + sig * RAND_N * sqrt_dt; // The SDE we wish to plot the path for
std::cout << SDE_X[i][j] << " ";
}
std::cout << std::endl;
}
std::cout << std::endl;
std::cout << " The simulation is complete!!" << std::endl;
std::cout << std::endl;
system("pause");
return 0;
}
Well why can't you just create the transpose of your SDE_X matrix then? Isn't that what you want to get?
Keep in mind, that presentation has nothing to do with implementation. Whether to access columns or rows is your decision. So you want an implementation of it transposed. Then quick and dirty create your matrix first, and then create your number series. Change i and j, and N and M.
I said quick and dirty, because the program at all is bad:
why don't you just keep it simple and use a better data structure for your matrix? If you know the size: compile-time array or dynamic vectors at runtime? Maybe there are some nicer implementation for 2d array.
There is a bug I think: you create N doubles and access index 0 to N inclusive.
In every iteration you set index 0 to x0 what is also needless.
I would change your code a bit make more clear:
create your matrix at first
initialize the first value of the matrix
provide an algorithm function calculating a target cell taking the matrix and the parameters.
Go through each cell and invoke your function for that cell
Thank you all for your input. I was able to implement my code and have it displayed as needed.
I added a second for loop to rearrange the matrix rows and columns.
Please feel free to let me know if you think there is anyway I can improve it.
#include <iostream>
#include <time.h>
#include <random>
#include <vector>
int main()
{
double T = 1; // End time period for simulation
int N = 3; // Number of time steps
int M = 2; // Number of simulations
int X = 100; // Max number of matrix columns
int Y = 100; // Max number of matrix rows
double x0 = 1.00; // Starting x value
double mu = 0.00; // mu(x,t) value
double sig = 1.00; // sigma(x,t) value
double dt = T/N;
double sqrt_dt = sqrt(dt);
std::vector<std::vector<double>> SDE_X((M*N), std::vector<double>((M*N))); // SDE Matrix setup
// Random Number generation setup
double RAND_N;
srand ((unsigned int) time(NULL)); // Generator loop reset
std::default_random_engine generator (rand());
std::normal_distribution<double> distribution (0.0,1.0); // Mean = 0.0, Variance = 1.0 ie Normal
for (int i = 0; i <= M; i++)
{
SDE_X[i][0] = x0;
for (int j=0; j <= N; j++)
{
RAND_N = distribution(generator);
SDE_X[i][j+1] = SDE_X[i][j] + mu * dt + sig * RAND_N * sqrt_dt; // The SDE we wish to plot the path for
}
}
for (int j = 0; j <= N; j++)
{
for (int i = 0; i <=M; i++)
{
std::cout << SDE_X[i][j] << ", ";
}
std::cout << std::endl;
}
std::cout << std::endl;
std::cout << " The simulation is complete!!" << std::endl;
std::cout << std::endl;
system("pause");
return 0;
}

CUFFT : How to calculate the fft when the input is a pitched array

I'm trying to find the fft of a dynamically allocated array. The input array is copied from host to device using cudaMemcpy2D. Then the fft is taken (cufftExecR2C) and the results are copied back from device to host.
So my initial problem was how to use the pitch information in the fft. Then I found an answer here - CUFFT: How to calculate fft of pitched pointer?
But unfortunately it doesn't work. The results I get are garbage values. Given below is my code.
#define NRANK 2
#define BATCH 10
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cufft.h>
#include <stdio.h>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
const size_t NX = 4;
const size_t NY = 6;
int main()
{
// Input array (static) - host side
float h_in_data_static[NX][NY] ={
{0.7943 , 0.6020 , 0.7482 , 0.9133 , 0.9961 , 0.9261},
{0.3112 , 0.2630 , 0.4505 , 0.1524 , 0.0782 , 0.1782},
{0.5285 , 0.6541 , 0.0838 , 0.8258 , 0.4427, 0.3842},
{0.1656 , 0.6892 , 0.2290 , 0.5383 , 0.1067, 0.1712}
};
// --------------------------------
// Input array (dynamic) - host side
float *h_in_data_dynamic = new float[NX*NY];
// Set the values
size_t h_ipitch;
for (int r = 0; r < NX; ++r) // this can be also done on GPU
{
for (int c = 0; c < NY; ++c)
{ h_in_data_dynamic[NY*r + c] = h_in_data_static[r][c]; }
}
// --------------------------------
// Output array - host side
float2 *h_out_data_temp = new float2[NX*(NY/2+1)] ;
// Input and Output array - device side
cufftHandle plan;
cufftReal *d_in_data;
cufftComplex * d_out_data;
int n[NRANK] = {NX, NY};
// Copy input array from Host to Device
size_t ipitch;
cudaError cudaStat1 = cudaMallocPitch((void**)&d_in_data,&ipitch,NY*sizeof(cufftReal),NX);
cout << cudaGetErrorString(cudaStat1) << endl;
cudaError cudaStat2 = cudaMemcpy2D(d_in_data,ipitch,h_in_data_dynamic,NY*sizeof(float),NY*sizeof(float),NX,cudaMemcpyHostToDevice);
cout << cudaGetErrorString(cudaStat2) << endl;
// Allocate memory for output array - device side
size_t opitch;
cudaError cudaStat3 = cudaMallocPitch((void**)&d_out_data,&opitch,(NY/2+1)*sizeof(cufftComplex),NX);
cout << cudaGetErrorString(cudaStat3) << endl;
// Performe the fft
int rank = 2; // 2D fft
int istride = 1, ostride = 1; // Stride lengths
int idist = 1, odist = 1; // Distance between batches
int inembed[] = {ipitch, NX}; // Input size with pitch
int onembed[] = {opitch, NX}; // Output size with pitch
int batch = 1;
cufftPlanMany(&plan, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, batch);
//cufftPlan2d(&plan, NX, NY , CUFFT_R2C);
cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE);
cufftExecR2C(plan, d_in_data, d_out_data);
cudaThreadSynchronize();
// Copy d_in_data back from device to host
cudaError cudaStat4 = cudaMemcpy2D(h_out_data_temp,(NY/2+1)*sizeof(float2), d_out_data, opitch, (NY/2+1)*sizeof(cufftComplex), NX, cudaMemcpyDeviceToHost);
cout << cudaGetErrorString(cudaStat4) << endl;
// Print the results
for (int i = 0; i < NX; i++)
{
for (int j =0 ; j< NY/2 + 1; j++)
printf(" %f + %fi",h_out_data_temp[i*(NY/2+1) + j].x ,h_out_data_temp[i*(NY/2+1) + j].y);
printf("\n");
}
cudaFree(d_in_data);
return 0;
}
I think the problem is in cufftPlanMany. How can I solve this issue ?
You may want to study the advanced data layout section of the documentation carefully.
I think the previous question that was linked is somewhat confusing because that question is passing the width and height parameters in reverse order for what I would expect for a cufft 2D plan. However the answer then mimics that order so it is at least consistent.
Secondly, you missed in the previous question that the "pitch" parameters that are being passed in inembed and onembed are not the same as the pitch parameters that you would receive from a cudaMallocPitch operation. They have to be scaled by the number of bytes per data element in the input and output data sets. I'm actually not entirely sure this is the intended use of the inembed and onembed parameters, but it seems to work.
When I adjust your code to account for the above two changes, I seem to get valid results, at least they appear to be in a reasonable range. You've posted several questions now about 2D FFTs, where you've said the results are not correct. I can't do these 2D FFT's in my head, so I suggest in the future you indicate what data you are expecting.
This has the changes I made:
#define NRANK 2
#define BATCH 10
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <cufft.h>
#include <stdio.h>
#include <iomanip>
#include <iostream>
#include <vector>
using namespace std;
const size_t NX = 4;
const size_t NY = 6;
int main()
{
// Input array (static) - host side
float h_in_data_static[NX][NY] ={
{0.7943 , 0.6020 , 0.7482 , 0.9133 , 0.9961 , 0.9261},
{0.3112 , 0.2630 , 0.4505 , 0.1524 , 0.0782 , 0.1782},
{0.5285 , 0.6541 , 0.0838 , 0.8258 , 0.4427, 0.3842},
{0.1656 , 0.6892 , 0.2290 , 0.5383 , 0.1067, 0.1712}
};
// --------------------------------
// Input array (dynamic) - host side
float *h_in_data_dynamic = new float[NX*NY];
// Set the values
size_t h_ipitch;
for (int r = 0; r < NX; ++r) // this can be also done on GPU
{
for (int c = 0; c < NY; ++c)
{ h_in_data_dynamic[NY*r + c] = h_in_data_static[r][c]; }
}
// --------------------------------
int owidth = (NY/2)+1;
// Output array - host side
float2 *h_out_data_temp = new float2[NX*owidth] ;
// Input and Output array - device side
cufftHandle plan;
cufftReal *d_in_data;
cufftComplex * d_out_data;
int n[NRANK] = {NX, NY};
// Copy input array from Host to Device
size_t ipitch;
cudaError cudaStat1 = cudaMallocPitch((void**)&d_in_data,&ipitch,NY*sizeof(cufftReal),NX);
cout << cudaGetErrorString(cudaStat1) << endl;
cudaError cudaStat2 = cudaMemcpy2D(d_in_data,ipitch,h_in_data_dynamic,NY*sizeof(float),NY*sizeof(float),NX,cudaMemcpyHostToDevice);
cout << cudaGetErrorString(cudaStat2) << endl;
// Allocate memory for output array - device side
size_t opitch;
cudaError cudaStat3 = cudaMallocPitch((void**)&d_out_data,&opitch,owidth*sizeof(cufftComplex),NX);
cout << cudaGetErrorString(cudaStat3) << endl;
// Performe the fft
int rank = 2; // 2D fft
int istride = 1, ostride = 1; // Stride lengths
int idist = 1, odist = 1; // Distance between batches
int inembed[] = {NX, ipitch/sizeof(cufftReal)}; // Input size with pitch
int onembed[] = {NX, opitch/sizeof(cufftComplex)}; // Output size with pitch
int batch = 1;
if ((cufftPlanMany(&plan, rank, n, inembed, istride, idist, onembed, ostride, odist, CUFFT_R2C, batch)) != CUFFT_SUCCESS) cout<< "cufft error 1" << endl;
//cufftPlan2d(&plan, NX, NY , CUFFT_R2C);
if ((cufftSetCompatibilityMode(plan, CUFFT_COMPATIBILITY_NATIVE)) != CUFFT_SUCCESS) cout << "cufft error 2" << endl;
if ((cufftExecR2C(plan, d_in_data, d_out_data)) != CUFFT_SUCCESS) cout << "cufft error 3" << endl;
cudaDeviceSynchronize();
// Copy d_in_data back from device to host
cudaError cudaStat4 = cudaMemcpy2D(h_out_data_temp,owidth*sizeof(float2), d_out_data, opitch, owidth*sizeof(cufftComplex), NX, cudaMemcpyDeviceToHost);
cout << cudaGetErrorString(cudaStat4) << endl;
// Print the results
for (int i = 0; i < NX; i++)
{
for (int j =0 ; j< owidth; j++)
printf(" %f + %fi",h_out_data_temp[i*owidth + j].x ,h_out_data_temp[i*owidth + j].y);
printf("\n");
}
cudaFree(d_in_data);
return 0;
}

Unhandled Exception at Memory Location

I am new at OpenCV and I am trying to write a simple code to get the mean of a block size in an image. I wrote the following code, the build is ok, however, the debug is giving me an unhandled exception at memory location. This exception is at the following line:
mean_img.at<double>(i/block_size, j/block_size) = mean_img.at<double>(i/block_size,j/block_size) + new_img.at<double>(i + x, j + y) / (mean);
So, I will be grateful if anyone give me some hints. Thanks in advance and here is the whole code:
#include "opencv2/highgui/highgui.hpp" // Include Libs for OpenCV and Image Processing
#include <opencv2/opencv.hpp> // check that
#include "opencv2/core/core.hpp" // check that
#include <iostream> // Include Libs for C++
#include "opencv2/imgproc/imgproc.hpp" // Include Libs for OpenCV and Image Processing
#include <math.h>
using namespace cv; // namespace parameters not important in OpenCV2.4.6
using namespace std; // namespace parameters not important in OpenCV2.4.6
int main( int argc, const char** argv )
{
/*This part is to compute the parameters(block size, resize parameter) of the new_img*/
int resize_parameter; // resize parameter must be multiplication of 2
resize_parameter = 500;
int block_size; // block parameter must be divisable by of block size
block_size = 50;
if ((resize_parameter % 2) != 0) resize_parameter = resize_parameter - (resize_parameter % 2);
while ((resize_parameter % block_size) != 0) block_size = block_size - 1;
int mean_size = resize_parameter/block_size; // this is the size of the mean matrix
int mean = block_size * block_size; // this no is ti get the mean of every element in the matrix
//int mean_img [mean_size][mean_size] = {}; // the mean image matrix initialized by zero
/*This part is to allocate the array with dynamic size*/
//int** mean_img = new int*[mean_size];
//for(int x = 0; x < mean_size; x++)
//mean_img[x] = new int[mean_size];
/*Then we can use the array*/
/*This part is to fill all the elements of the mean matrix with zeros*/
//memset(mean_img, 0, sizeof(mean_img[0][0]) * mean_size * mean_size);
/*This part is the definition of the matrices that are used for the images*/
Mat mean_img = Mat(mean_size,mean_size,CV_64FC4, cv::Scalar(0)); // define a new matrix with meansize*meansize elements to compute the mean
Mat mean_img_full = Mat(resize_parameter,resize_parameter,CV_64FC4, cv::Scalar(0)); // define a new matrix with resizeparameter*resizeparameter elements to compute the mean
Mat new_img = Mat(resize_parameter,resize_parameter,CV_64FC4); // define a new matrix with resize_parameter*resize_parameter elements
Mat original_img = imread("Desert.JPG", CV_LOAD_IMAGE_GRAYSCALE); //define a new matrix and read the image data in the file "Desert.JPG" and store it in 'original_img'
// notes: the location of the image must be in the same directory of the C++ file
if (original_img.empty()) //check whether the image is loaded or not
{
cout << "Error : Image cannot be loaded..!!" << endl;
//system("pause"); //wait for a key press
return -1;
}
// explicitly specify dsize=dst.size(); fx and fy will be computed from that.
// resize( src matrix, dst matrix, dst.size to get the size of the dst matrix, 0, 0 "to deal with the dst matrix size, may be 0.5 or any fraction from the src size, "AREA,CUBIC,LINEAR")
resize(original_img, new_img, new_img.size(), 0, 0, CV_INTER_AREA);
/*This part is to compute the mean of each block*/
for ( int i = 0; i < resize_parameter; i = i + block_size) // i represents the index of the raw
{
for ( int j = 0; j < resize_parameter; j = j + block_size) // for the blocks in the same raw with different columns
{
for ( int x = 0; x < block_size; x++) // x represents the index of the raw
{
for ( int y = 0; y < block_size; y++) // y represents the index of the column
{
//cout << i ; //cout << "\n"; //cout << j ; //cout << "\n"; //cout << x ; //cout << "\n"; //cout << y ; //cout << "\n";
mean_img.at<double>(i/block_size, j/block_size) = mean_img.at<double>(i/block_size,j/block_size) + new_img.at<double>(i + x, j + y) / (mean);
}
}
}
}
/*This is the end of the part to compute the mean of each block*/
/*This part is to fill all the resize matrix with the mean value*/
for ( int x = 0; x < resize_parameter/block_size; x++) // x represents the index of the raw in the mean matrix
{
for ( int y = 0; y < resize_parameter/block_size; y++) // y represents the index of the column in the mean matrix
{
for ( int i = 0; i < block_size; i++) // i represents the index of the raw in the mean_full matrix
{
for ( int j = 0; j < block_size; j++) // j represents the index of the column in the mean_full matrix
{
mean_img_full.at<double>((x*block_size)+i,(y*block_size)+j) = mean_img.at<double>(x,y);
}
}
}
}
//cout << cv::getBuildInformation() << endl;
/*This is the end of the part to fill all the resize matrix with the mean value*/
namedWindow("OriginalImage", CV_WINDOW_AUTOSIZE); //create a window with the name "OriginalImage"
imshow("OriginalImage", original_img); //display the image which is stored in the 'original_img' in the "OriginalImage" window
namedWindow("NewImage", CV_WINDOW_AUTOSIZE); //create a window with the name "NewImage"
imshow("NewImage", new_img); //display the image which is stored in the 'new_img' in the "NewImage" window
namedWindow("MeanImage", CV_WINDOW_AUTOSIZE); //create a window with the name "MeanImage"
imshow("MeanImage", mean_img); //display the image which is stored in the 'mean_img' in the "MeanImage" window
namedWindow("MeanFullImage", CV_WINDOW_AUTOSIZE); //create a window with the name "MeanFullImage"
imshow("MeanFullImage", mean_img_full); //display the image which is stored in the 'mean_img_full' in the "MeanFullImage" window
waitKey(0); //wait infinite time for a keypress
destroyWindow("OriginalImage"); //destroy the window with the name, "OriginalImage"
destroyWindow("NewImage"); //destroy the window with the name, "NewImage"
destroyWindow("MeanImage"); //destroy the window with the name, "MeanImage"
destroyWindow("MeanFullImage"); //destroy the window with the name, "MeanImage"
return 0;
}
The problem was at the definition of the type of each matrix. It has to be 8 Bits Unsigned Character. It is working now. Thanks a lot ,,,

Gram-Schmidt Orthogonalization incorrect implementation

I'm in the process of building a free open source OpenGL3-based 3D game engine (it's not a school assignment, rather it's for personal skill development and to give something back to the open source community). I've reached the stage where I need to learn lots of related math, so I'm reading a great textbook called "Mathematics for 3D Game Programming and Computer Graphics, 3rd Edition".
I've hit a snag early on trying to do the book's exercises though, as my attempt at implementing the "Gram-Schmidt Orthogonalization algorithm" in C++ is outputting a wrong answer. I'm no math expert (although I'm trying to get better), and I have very limited experience looking at a math algorithm and translating it into code (limited to some stuff I learned from Udacity.com). Anyway, it would really help if someone could look at my incorrect code and give me a hint or a solution.
Here it is:
/*
The Gram-Schmidt Orthogonalization algorithm is as follows:
Given a set of n linearly independent vectors Beta = {e_1, e_2, ..., e_n},
the algorithm produces a set Beta' = {e_1', e_2', ..., e_n'} such that
dot(e_i', e_j') = 0 whenever i != j.
A. Set e_1' = e_1
B. Begin with the index i = 2 and k = 1
C. Subtract the projection of e, onto the vectors e_1', e_2', ..., e_(i-1)'
from e_i, and store the result in e_i', That is,
dot(e_i, e_k')
e_i' = e_i - sum_over(-------------- e_k')
e_k'^2
D. If i < n, increment i and loop back to step C.
*/
#include <iostream>
#include <glm/glm.hpp>
glm::vec3 sum_over_e(glm::vec3* e, glm::vec3* e_prime, int& i)
{
int k = 0;
glm::vec3 result;
while (k < i-2)
{
glm::vec3 e_prime_k_squared(pow(e_prime[k].x, 2), pow(e_prime[k].y, 2), pow(e_prime[k].z, 2));
result += (glm::dot(e[i], e_prime[k]) / e_prime_k_squared) * e_prime[k];
k++;
}
return result;
}
int main(int argc, char** argv)
{
int n = 2; // number of vectors we're working with
glm::vec3 e[] = {
glm::vec3(sqrt(2)/2, sqrt(2)/2, 0),
glm::vec3(-1, 1, -1),
glm::vec3(0, -2, -2)
};
glm::vec3 e_prime[n];
e_prime[0] = e[0]; // step A
int i = 0; // step B
do // step C
{
e_prime[i] = e[i] - sum_over_e(e, e_prime, i);
i++; // step D
} while (i-1 < n);
for (int loop_count = 0; loop_count <= n; loop_count++)
{
std::cout << "Vector e_prime_" << loop_count+1 << ": < "
<< e_prime[loop_count].x << ", "
<< e_prime[loop_count].y << ", "
<< e_prime[loop_count].z << " >" << std::endl;
}
return 0;
}
This code outputs:
Vector e_prime_1: < 0.707107, 0.707107, 0 >
Vector e_prime_2: < -1, 1, -1 >
Vector e_prime_3: < 0, -2, -2 >
but the correct answer is supposed to be:
Vector e_prime_1: < 0.707107, 0.707107, 0 >
Vector e_prime_2: < -1, 1, -1 >
Vector e_prime_3: < 1, -1, -2 >
Edit: Here's the code that produces the correct answer:
#include <iostream>
#include <glm/glm.hpp>
glm::vec3 sum_over_e(glm::vec3* e, glm::vec3* e_prime, int& i)
{
int k = 0;
glm::vec3 result;
while (k < i-1)
{
float e_prime_k_squared = glm::dot(e_prime[k], e_prime[k]);
result += ((glm::dot(e[i], e_prime[k]) / e_prime_k_squared) * e_prime[k]);
k++;
}
return result;
}
int main(int argc, char** argv)
{
int n = 3; // number of vectors we're working with
glm::vec3 e[] = {
glm::vec3(sqrt(2)/2, sqrt(2)/2, 0),
glm::vec3(-1, 1, -1),
glm::vec3(0, -2, -2)
};
glm::vec3 e_prime[n];
e_prime[0] = e[0]; // step A
int i = 0; // step B
do // step C
{
e_prime[i] = e[i] - sum_over_e(e, e_prime, i);
i++; // step D
} while (i < n);
for (int loop_count = 0; loop_count < n; loop_count++)
{
std::cout << "Vector e_prime_" << loop_count+1 << ": < "
<< e_prime[loop_count].x << ", "
<< e_prime[loop_count].y << ", "
<< e_prime[loop_count].z << " >" << std::endl;
}
return 0;
}
The problem is probably in the way you define e_k'^2. As far as vector math goes, the square of a vector is usually taken to be the square of its norm. Therefore,
double e_prime_k_squared = glm::dot(e_prime_k, e_prime_k);
Moreover, dividing by a vector is undefined (I wonder why GLM allows it?), so if e_k'^2 is a vector, the whole thing is undefined.