CUDA c++, simple matrix multiplication error

CUDA c++, simple matrix multiplication error - c++

I am quite new at CUDA programming with c++, so sorry for this simple question. I simply cannot figure out where i am going wrong with this. I am trying to do a matrix multiplication. I have found inspiration from several sources so it might be that i have mixed up some different methods. I am trying to multiply two matrixes h_a and h_b. I successfuly generate the two matrixes, but when i allocate the memory for the two matrices, i for some reason lose the values in that matrix, and even after the multiplication all values are zero. Below is the code:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <ctime>
#include <stdio.h>
#include <iostream>
#include <math.h>
using namespace std;
__global__ void MulKernel(int *c, const int *a, const int *b, const int P)
{
float tempsum;
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
if (row < P && col < P){
for (int i = 0; i < P; i++){
tempsum += a[row*P + i] * b[i*P + col];
}
}
c[row*P + col] = tempsum;
}
int main()
{
srand(time(NULL));
int *pointer;
int N = 16;
int SIZE = N*N;
int *h_a = new int[SIZE];
int *h_b = new int[SIZE];
int *h_c = new int[SIZE];
for (int i = 0; i < SIZE; i++) {
h_a[i] = rand() % 1000;
h_b[i] = rand() % 1000;
}
cout << "First values " << h_a[0] << " " << h_b[0] << endl;
cudaMalloc(&h_a, sizeof(int)*SIZE);
cudaMalloc(&h_b, sizeof(int)*SIZE);
cudaMalloc(&h_c, sizeof(int)*SIZE);
cudaMalloc(&pointer, sizeof(int));
cout << "Second values " << h_a[0] << " " << h_b[0] << endl;
cudaMemcpy(h_a, &h_a, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(h_b, &h_b, sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(pointer, &N, sizeof(int), cudaMemcpyHostToDevice);
cout << "Third values " << h_a[0] <<" "<< h_b[0] << endl;
MulKernel <<<1, 256 >>>(h_c, h_a, h_b, N);
cudaMemcpy(h_c, &h_c, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(h_a, &h_a, sizeof(int), cudaMemcpyDeviceToHost);
cudaMemcpy(h_b, &h_b, sizeof(int), cudaMemcpyDeviceToHost);
for (int i = 0; i < 5; i++){
cout << h_c[i] << "=" << h_a[i] << h_b[i] << endl;
}
cout << h_c[1] << endl;
cudaFree(h_a);
cudaFree(h_b);
cudaFree(h_c);
return 0;
}
The output in the terminal reads:
First values 454 964
Second values 0 0
Third values 0 0
0=00
0=00
0=00
0=00
0=00
0
Press any key to continue . . .
I hope someone can spot the error(s)
Best regards

There are quite a few issues with your code.
Any time you're having trouble with a cuda code, I recommend proper cuda error checking as well as running your code with cuda-memcheck. In this case, you've made programming errors that actually result in a seg fault, so these methods aren't that useful.
Your kernel is mostly workable. There are 3 issues. First, you are performing int multiplication but have declared your tempsum variable as float. That probably isn't a huge issue but is not consistent with your kernel. Second, you are not initializing tempsum (it should be set to zero). Third, you have your threadcheck (i.e. if-statement) slightly misplaced. You should condition the kernel so as not to write to c if the thread is out-of-bounds.
You're probably confused about host and device variables. We don't allocate a host variable with new and then do a cudaMalloc operation on the same pointer. That's not how things work. We need to create an equivalent set of variables to store data on the device. Let's call those *d_a etc. We'll call cudaMalloc on those to allocate device space, then we'll use those in the cudaMemcpy operations as the device-side variables.
Your kernel is expecting a 2D thread array (so that the .x and .y built-in variables in the kernel have meaning). But you are defining the thread array using 1D variables. That needs to be fixed in your kernel launch (i.e. define a 2D array using dim3 variables). Likewise the kernel launch should specify the d_a and etc. variables that are device pointers.
You may be confused about how to handle a variable like N when passing it to the kernel. We can pass that directly (by value) without any of the gymnastics with pointer that you have created.
You have transfer sizes wrong in your cudaMemcpy operations. Like memcpy you need to specify a transfer size in bytes, so we need to multiply most of your transfer sizes by SIZE.
Here's a modified version of your code with the above issues addressed:
$ cat t1073.cu
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <ctime>
#include <stdio.h>
#include <iostream>
#include <math.h>
using namespace std;
__global__ void MulKernel(int *c, const int *a, const int *b, const int P)
{
int tempsum=0;
int row = blockIdx.y*blockDim.y + threadIdx.y;
int col = blockIdx.x*blockDim.x + threadIdx.x;
if (row < P && col < P){
for (int i = 0; i < P; i++){
tempsum += a[row*P + i] * b[i*P + col];
}
c[row*P + col] = tempsum;
}
}
int main()
{
srand(time(NULL));
int N = 16;
int SIZE = N*N;
int *h_a = new int[SIZE];
int *h_b = new int[SIZE];
int *h_c = new int[SIZE];
for (int i = 0; i < SIZE; i++) {
h_a[i] = rand() % 1000;
h_b[i] = rand() % 1000;
}
cout << "First values " << h_a[0] << " " << h_b[0] << endl;
int *d_a, *d_b, *d_c;
cudaMalloc(&d_a, sizeof(int)*SIZE);
cudaMalloc(&d_b, sizeof(int)*SIZE);
cudaMalloc(&d_c, sizeof(int)*SIZE);
cout << "Second values " << h_a[0] << " " << h_b[0] << endl;
cudaMemcpy(d_a, h_a, sizeof(int)*SIZE, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, sizeof(int)*SIZE, cudaMemcpyHostToDevice);
cout << "Third values " << h_a[0] <<" "<< h_b[0] << endl;
MulKernel <<<1, dim3(N,N) >>>(d_c, d_a, d_b, N);
cudaMemcpy(h_c, d_c, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
cudaMemcpy(h_a, d_a, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
cudaMemcpy(h_b, d_b, sizeof(int)*SIZE, cudaMemcpyDeviceToHost);
for (int i = 0; i < 5; i++){
cout << h_c[i] << "=" << h_a[i] << h_b[i] << endl;
}
cout << h_c[1] << endl;
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
return 0;
}
$ nvcc -o t1073 t1073.cu
$ cuda-memcheck ./t1073
========= CUDA-MEMCHECK
First values 698 173
Second values 698 173
Third values 698 173
5502745=698173
5866060=120710
3945532=646669
4432346=582703
4971909=746272
5866060
========= ERROR SUMMARY: 0 errors
$
Personally, I can't interpret the output easily, and I'm not sure why you've chosen the = sign. For matrix multiplication, c[i] is not equal to a[i]*b[i], if that's what you were thinking. If you want a simple test that is easily understood visually, try setting both a and b matrices to all 1. Then you can easily spot a correct output, it should be all N.
Also note that for brevity, I've not tried to teach you every aspect of CUDA programming in this question, just fix some mistakes. As just one example, this code will break if you set N to a value larger than 32. You may need to learn more about CUDA programming to understand why that is.

Related

Problem creating and returning jagged array (error std::bad_array_new_length)

For this homework problem, we need to create a new jagged array with the code provided by our professor, print the array, and calculate the max, min, and sum of the array's contents. We are only allowed to edit the createAndReturnJaggedArray() and printAndThenFindMaxMinSum(int**,int*,int*,int*) functions, as the rest of the code was provided for us so we could check that we get the correct output.
I'm able to get the program to run, however after printing an initial string it terminates the program giving me the error terminate called after throwing an instance of 'std::bad_array_new_length' what(): std::bad_array_new_length. I believe the problem is in my creation of the jagged array and my allocation of memory for the columns part of the array, however I used the notes we were given as reference and have no idea where the problem is coming from. The entire program is provided below. Thanks for any help!
EDIT/NOTE: We haven't learned vectors yet so we're not allowed to use them.
#include <iostream>
#include <climits>
using namespace std;
class JaggedArray {
public:
int numRows;
int *numColumnsInEachRow;
JaggedArray() {
numRows = 11;
numColumnsInEachRow = new int[numRows];
for (int i = 0; i < numRows; i++) {
if (i <= numRows / 2) {
numColumnsInEachRow[i] = i + 1;
} else {
numColumnsInEachRow[i] = numRows - i;
}
}
readComputeWrite();
}
int **createAndReturnJaggedArray() { // COMPLETE THIS FUNCTION
int **A = new int*[numRows];
for(int i=0;i<numRows;i++){ //allocate columns in each row
A[i] = new int[numColumnsInEachRow[i]];
for(int j=0;j<numColumnsInEachRow[i];j++){
if(i <= numRows/2)
A[i][j] = (i + j);
else
A[i][j] = -1 * (i+j);
}
}
return A;
}
void printAndThenFindMinMaxSum(int **A, int *maxPtr, int *minPtr, int *sumPtr) { // COMPLETE THIS FUNCTION
maxPtr = new int[INT_MIN];
minPtr = new int[INT_MAX];
sumPtr = 0;
for(int i=0;i<numRows;i++){
for(int j=0;j<numColumnsInEachRow[i];j++){
//1. print array
if (j == (numColumnsInEachRow[i]-1))
cout << A[i][j] << endl;
else
cout << A[i][j] << " ";
//2. compute max, min, and sum
sumPtr += A[i][j];
if (A[i][j] > *maxPtr)
maxPtr = new int[A[i][j]];
if (A[i][j] < *minPtr)
minPtr = new int[A[i][j]];
}
}
}
void print(int max, int min, int sum) {
cout << endl;
cout << "Max is " << max << "\n";
cout << "Min is " << min << "\n";
cout << "Sum is " << sum << "\n";
}
void readComputeWrite() {
int max, min, sum;
int **A = createAndReturnJaggedArray();
cout << "*** Jagged Array ***" << endl;
printAndThenFindMinMaxSum(A, &max, &min, &sum);
print(max, min, sum);
}
};
int main() {
JaggedArray jaf;
return 0;
}

As #user4581301 hints at, your problem is in printAndThenFindMinMaxSum. Simply changing it to the below solves your problem:
void printAndThenFindMinMaxSum(int **A, int &maxPtr, int &minPtr, int &sumPtr) { // COMPLETE THIS FUNCTION
maxPtr = INT_MIN;
minPtr = INT_MAX;
sumPtr = 0;
.
.
.
sumPtr += A[i][j];
if (A[i][j] > maxPtr)
maxPtr = A[i][j];
if (A[i][j] < minPtr)
minPtr = A[i][j];
}
}
}
We also need to change readComputeWrite to:
void readComputeWrite() {
int max, min, sum;
int **A = createAndReturnJaggedArray();
cout << "*** Jagged Array ***" << endl;
printAndThenFindMinMaxSum(A, max, min, sum);
print(max, min, sum);
}
I would also recommend changing the name minPtr, maxPtr, and sumPtr to something more appropriate, as they aren't pointer at this point and represent primitive values.
You will note, that I changed pointers to references as this is a more natural adaptation for this type of operation. Essentially, passing by reference allow the user to operate on the passed value in a straightforward manner without the tedious task of making sure you dereference things at the appropriate time. It also allows one to operate in a less error prone manner.
Again, as #user4581301 shrewdly points out, the intent of this assignment was probably to deal with pointers. As such, there are a few things that need to be changed if the OP cannot use references. Observe:
void printAndThenFindMinMaxSum(int **A, int *maxPtr, int *minPtr, int *sumPtr) { // COMPLETE THIS FUNCTION
*maxPtr = INT_MIN; // Make sure to deference before assigning
*minPtr = INT_MAX; // Make sure to deference before assigning
*sumPtr = 0; // Make sure to deference before assigning
for(int i=0;i<numRows;i++){
for(int j=0;j<numColumnsInEachRow[i];j++){
//1. print array
if (j == (numColumnsInEachRow[i]-1))
cout << A[i][j] << endl;
else
cout << A[i][j] << " ";
//2. compute max, min, and sum
*sumPtr += A[i][j]; // Make sure to deference before assigning
if (A[i][j] > *maxPtr) // Make sure to deference before comparing
*maxPtr = A[i][j]; // Make sure to deference before assigning
if (A[i][j] < *minPtr) // Make sure to deference before comparing
*minPtr = A[i][j]; // Make sure to deference before assigning
}
}
}
And the readComputeWrite can stay unaltered from the OP's original attempt.
In the OP's code, they are mainly forgetting to deference before assigning/comparing.

CUDA Zero Copy vs. CudaMemcpy on Jetson TK1

My Question:
I am looking for someone to either point out a mistake in the way I am attempting to use implement zero-copy in CUDA, or reveal a more 'behind the scenes' perspective to why the zero-copy method would not be faster than memcpy method. By the way, I am performing my tests on NVidia's TK1 processor, using Ubuntu.
My problem has to do with efficiently using NVIDIA TK1's (physically) unified memory architecture with CUDA. There are 2 methods NVIDIA provides for GPU/CPU memory transfer abstraction.
Unified Memory abstraction (using cudaHostAlloc & cudaHostGetDevicePointer)
Explicit copy to host, and from device (using cudaMalloc() & cudaMemcpy)
Short description of my test code: I test out the same cuda kernel using both methods 1 and 2. I expected 1 to be faster given that there is no copy to device of the source data or copy from device of the result data. However, results backwards to my assumption (method # 1 is 50% slower). Below is my code for this test:
#include <libfreenect/libfreenect.hpp>
#include <iostream>
#include <vector>
#include <cmath>
#include <pthread.h>
#include <cxcore.h>
#include <time.h>
#include <sys/time.h>
#include <memory.h>
///CUDA///
#include <cuda.h>
#include <cuda_runtime.h>
///OpenCV 2.4
#include <highgui.h>
#include <cv.h>
#include <opencv2/gpu/gpu.hpp>
using namespace cv;
using namespace std;
///The Test Kernel///
__global__ void cudaCalcXYZ( float *dst, float *src, float *M, int height, int width, float scaleFactor, int minDistance)
{
float nx,ny,nz, nzpminD, jFactor;
int heightCenter = height / 2;
int widthCenter = width / 2;
//int j = blockIdx.x; //Represents which row we are in
int index = blockIdx.x*width;
jFactor = (blockIdx.x - heightCenter)*scaleFactor;
for(int i= 0; i < width; i++)
{
nz = src[index];
nzpminD = nz + minDistance;
nx = (i - widthCenter )*(nzpminD)*scaleFactor;
ny = (jFactor)*(nzpminD);
//Solve for only Y matrix (height vlaues)
dst[index++] = nx*M[4] + ny*M[5] + nz*M[6];
//dst[index++] = 1 + 2 + 3;
}
}
//Function fwd declarations
double getMillis();
double getMicros();
void runCudaTestZeroCopy(int iter, int cols, int rows);
void runCudaTestDeviceCopy(int iter, int cols, int rows);
int main(int argc, char **argv) {
//ZERO COPY FLAG (allows runCudaTestZeroCopy to run without fail)
cudaSetDeviceFlags(cudaDeviceMapHost);
//Runs kernel using explicit data copy to 'device' and back from 'device'
runCudaTestDeviceCopy(20, 640,480);
//Uses 'unified memory' cuda abstraction so device can directly work from host data
runCudaTestZeroCopy(20,640, 480);
std::cout << "Stopping test" << std::endl;
return 0;
}
void runCudaTestZeroCopy(int iter, int cols, int rows)
{
cout << "CUDA Test::ZEROCOPY" << endl;
int src_rows = rows;
int src_cols = cols;
int m_rows = 4;
int m_cols = 4;
int dst_rows = src_rows;
int dst_cols = src_cols;
//Create and allocate memory for host mats pointers
float *psrcMat;
float *pmMat;
float *pdstMat;
cudaHostAlloc((void **)&psrcMat, src_rows*src_cols*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&pmMat, m_rows*m_cols*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&pdstMat, dst_rows*dst_cols*sizeof(float), cudaHostAllocMapped);
//Create mats using host pointers
Mat src_mat = Mat(cvSize(src_cols, src_rows), CV_32FC1, psrcMat);
Mat m_mat = Mat(cvSize(m_cols, m_rows), CV_32FC1, pmMat);
Mat dst_mat = Mat(cvSize(dst_cols, dst_rows), CV_32FC1, pdstMat);
//configure src and m mats
for(int i = 0; i < src_rows*src_cols; i++)
{
psrcMat[i] = (float)i;
}
for(int i = 0; i < m_rows*m_cols; i++)
{
pmMat[i] = 0.1234;
}
//Create pointers to dev mats
float *d_psrcMat;
float *d_pmMat;
float *d_pdstMat;
//Map device to host pointers
cudaHostGetDevicePointer((void **)&d_psrcMat, (void *)psrcMat, 0);
//cudaHostGetDevicePointer((void **)&d_pmMat, (void *)pmMat, 0);
cudaHostGetDevicePointer((void **)&d_pdstMat, (void *)pdstMat, 0);
//Copy matrix M to device
cudaMalloc( (void **)&d_pmMat, sizeof(float)*4*4 ); //4x4 matrix
cudaMemcpy( d_pmMat, pmMat, sizeof(float)*m_rows*m_cols, cudaMemcpyHostToDevice);
//Additional Variables for kernels
float scaleFactor = 0.0021;
int minDistance = -10;
//Run kernel! //cudaSimpleMult( float *dst, float *src, float *M, int width, int height)
int blocks = src_rows;
const int numTests = iter;
double perfStart = getMillis();
for(int i = 0; i < numTests; i++)
{
//cudaSimpleMult<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_cols, src_rows);
cudaCalcXYZ<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_rows, src_cols, scaleFactor, minDistance);
cudaDeviceSynchronize();
}
double perfStop = getMillis();
double perfDelta = perfStop - perfStart;
cout << "Ran " << numTests << " iterations totaling " << perfDelta << "ms" << endl;
cout << " Average time per iteration: " << (perfDelta/(float)numTests) << "ms" << endl;
//Copy result back to host
//cudaMemcpy(pdstMat, d_pdstMat, sizeof(float)*src_rows*src_cols, cudaMemcpyDeviceToHost);
//cout << "Printing results" << endl;
//for(int i = 0; i < 16*16; i++)
//{
// cout << "src[" << i << "]= " << psrcMat[i] << " dst[" << i << "]= " << pdstMat[i] << endl;
//}
cudaFree(d_psrcMat);
cudaFree(d_pmMat);
cudaFree(d_pdstMat);
cudaFreeHost(psrcMat);
cudaFreeHost(pmMat);
cudaFreeHost(pdstMat);
}
void runCudaTestDeviceCopy(int iter, int cols, int rows)
{
cout << "CUDA Test::DEVICE COPY" << endl;
int src_rows = rows;
int src_cols = cols;
int m_rows = 4;
int m_cols = 4;
int dst_rows = src_rows;
int dst_cols = src_cols;
//Create and allocate memory for host mats pointers
float *psrcMat;
float *pmMat;
float *pdstMat;
cudaHostAlloc((void **)&psrcMat, src_rows*src_cols*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&pmMat, m_rows*m_cols*sizeof(float), cudaHostAllocMapped);
cudaHostAlloc((void **)&pdstMat, dst_rows*dst_cols*sizeof(float), cudaHostAllocMapped);
//Create pointers to dev mats
float *d_psrcMat;
float *d_pmMat;
float *d_pdstMat;
cudaMalloc( (void **)&d_psrcMat, sizeof(float)*src_rows*src_cols );
cudaMalloc( (void **)&d_pdstMat, sizeof(float)*src_rows*src_cols );
cudaMalloc( (void **)&d_pmMat, sizeof(float)*4*4 ); //4x4 matrix
//Create mats using host pointers
Mat src_mat = Mat(cvSize(src_cols, src_rows), CV_32FC1, psrcMat);
Mat m_mat = Mat(cvSize(m_cols, m_rows), CV_32FC1, pmMat);
Mat dst_mat = Mat(cvSize(dst_cols, dst_rows), CV_32FC1, pdstMat);
//configure src and m mats
for(int i = 0; i < src_rows*src_cols; i++)
{
psrcMat[i] = (float)i;
}
for(int i = 0; i < m_rows*m_cols; i++)
{
pmMat[i] = 0.1234;
}
//Additional Variables for kernels
float scaleFactor = 0.0021;
int minDistance = -10;
//Run kernel! //cudaSimpleMult( float *dst, float *src, float *M, int width, int height)
int blocks = src_rows;
double perfStart = getMillis();
for(int i = 0; i < iter; i++)
{
//Copty from host to device
cudaMemcpy( d_psrcMat, psrcMat, sizeof(float)*src_rows*src_cols, cudaMemcpyHostToDevice);
cudaMemcpy( d_pmMat, pmMat, sizeof(float)*m_rows*m_cols, cudaMemcpyHostToDevice);
//Run Kernel
//cudaSimpleMult<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_cols, src_rows);
cudaCalcXYZ<<<blocks,1>>>(d_pdstMat, d_psrcMat, d_pmMat, src_rows, src_cols, scaleFactor, minDistance);
//Copy from device to host
cudaMemcpy( pdstMat, d_pdstMat, sizeof(float)*src_rows*src_cols, cudaMemcpyDeviceToHost);
}
double perfStop = getMillis();
double perfDelta = perfStop - perfStart;
cout << "Ran " << iter << " iterations totaling " << perfDelta << "ms" << endl;
cout << " Average time per iteration: " << (perfDelta/(float)iter) << "ms" << endl;
cudaFree(d_psrcMat);
cudaFree(d_pmMat);
cudaFree(d_pdstMat);
cudaFreeHost(psrcMat);
cudaFreeHost(pmMat);
cudaFreeHost(pdstMat);
}
//Timing functions for performance measurements
double getMicros()
{
timespec ts;
//double t_ns, t_s;
long t_ns;
double t_s;
clock_gettime(CLOCK_MONOTONIC, &ts);
t_s = (double)ts.tv_sec;
t_ns = ts.tv_nsec;
//return( (t_s *1000.0 * 1000.0) + (double)(t_ns / 1000.0) );
return ((double)t_ns / 1000.0);
}
double getMillis()
{
timespec ts;
double t_ns, t_s;
clock_gettime(CLOCK_MONOTONIC, &ts);
t_s = (double)ts.tv_sec;
t_ns = (double)ts.tv_nsec;
return( (t_s * 1000.0) + (t_ns / 1000000.0) );
}
I have already seen the post Cuda zero-copy performance, but I feel this was not related for the following reason: The GPU and CPUs have a physically unified memory architecture.
Thanks

When you are using ZeroCopy, the read to memory goes through some path where it queries the memory unit to fetch data from system memory. This operation has some latency.
When using direct access to memory, the memory unit gathers data from global memory, and has a different access pattern and latency.
Actually seeing this difference would require some form of profiling.
Nonetheless, your call to global function makes use of a single thread
cudaCalcXYZ<<< blocks,1 >>> (...
In this case, the GPU has little way to hide latency when memory is gathered from the system memory (or global memory). I would recommend you use more threads (some multiple of 64, at least 128 total), and run the profiler on it to get the cost of memory access. Your algorithm seems separable, and modifing the code from
for(int i= 0; i < width; i++)
to
for (int i = threadIdx.x ; i < width ; i += blockDim.x)
will probably increase performance overall.
Image size is 640 in width which will turn into 5 iterations of 128 threads.
cudaCalcXYZ<<< blocks,128 >>> (...
I believe it would result in some performance increase.

ZeroCopy feature allow us to running data on device without manually copy it to Device Memory like cudaMemcpy function. Zero copy memory only pass host address to device that read/wrote on kernel device. So, the more thread block you declaration to kernel device, the more data that read/wrote on kernel device, the more host address that passed to device. Finally, you got better performance gain than if you only declaration a few thread block to device kernel.

Error in passing a matrix as an argument

I am trying to get an understanding of how to work with matrices in C++. The code at the bottom is supposed to take an input matrix and return the places where there are 0s. However, I am getting the following errors:
matrix.cpp:47:3: error: no matching function for call to 'make_zero' make_zero(i,j,l);
^~~~~~~~~
matrix.cpp:8:6: note: candidate function not viable: no known conversion from 'double [i][j]' to
'double (*)[col]' for 3rd argument
void make_zero(int row, int col, double matrix[row][col])
^
1 error generated.
when I try to run the following code:
// Matrix
#include <iostream>
#include <stdio.h>
using namespace std;
void make_zero(int row, int col, double matrix[row][col])
{
int k,l;
for(k=0;k<row;k++)
for(l=0;l<col;l++)
{
if(matrix[k][l]==0)
printf("%d %d\n",k,l);
}
}
int main ()
{
int i = 0,j = 0;
cout << "Enter no of rows of the matrix";
cin >> i;
cout << "Enter no of columns of the matrix";
cin >> j;
double l[i][j];
int p = 0, q = 0;
while (p < i) {
while (q < j) {
cout << "Enter the" << p + 1 << "*" << q + 1 << "entry";
cin >> l[p][q];
q = q + 1;
}
p = p + 1;
q = 0;
}
cout << l << "\n";
make_zero(i,j,l);
}
Any help would be appreciated. Thanks.

There are a bunch of ways to do this with pointers. The most common is
void make_zero(int row, int col, double ** matrix)
defines a pointer (usually rows) to a pointer (usually columns). Unfortunately
double l[i][j];
does not define a pointer to a pointer. If this syntax is supported by the compiler, and the compiler is not required to allow arrays of variable length, it most likely defines a pointer to a 1D array (double l[i*j];) and hides the indexing arithmetic used to convert the array to two dimensions. Anyway, it can't be passed to a double ** because it isn't a double **
Trying to pass as an array is troublesome
void make_zero(int row, int col, double matrix[][NUMBER_OF_COLUMNS])
The number of columns in the array must be known to perform the indexing arithmetic and be provided to any functions called with it. This means that number of columns cannot be changed at run time because the indexing used by the function will be rendered invalid.
Getting around this would require changes to the compiler that will drive it further and further from the C++ standard. A bad idea since there are a number of simple ways around calling functions with multi dimensional arrays. Most depend on arrays of arrays or std::vectors of std::vectors.
And when it comes to these solutions, as far as I'm concerned, the best is don't. I'm not going to cover them.
None of the arrays representing a dimension are guaranteed to be anywhere close to the others in memory, and this limits the CPU's ability to read and cache. Without caching and being able to look ahead, a modern CPU is at a serious performance disadvantage. (Read for more information: Why is it faster to process a sorted array than an unsorted array?)
So what you want is a 1 D array, and those are easy to pass around. The indexing math is also easy, row number * size of column + column number, but you need to pass at least the size of the column around. Rather than scatter the book-keeping around like this:
void make_zero(int row, int col, std::vector<double> matrix)
make a wrapper class like this:
class Matrix
{
private:
std::vector<double> myArray;
size_t nrRows;
size_t nrColumns;
public:
Matrix(size_t rows, size_t columns) :
myArray(rows * columns), // allocate vector to store matrix.
nrRows(rows),
nrColumns(columns)
{
}
size_t getNrRows() const
{
return nrRows;
}
size_t getNrColumns() const
{
return nrColumns;
}
// gets value at row, column and returns a reference so caller can
// modify the value
double& operator()(size_t row, size_t column)
{
// note: No sanity check for row >= nrRows or column > nrColumns
return myArray[row * nrColumns + column];
}
// gets value at row, column and returns a copy so caller cannot
// change the contents of the Matrix
double operator()(size_t row, size_t column) const
{
return myArray[row * nrColumns + column];
}
};
Using the vector gets around a number of common pointer-to-array problems by managing its own memory. No destructor is required and Matrix can be copied and moved without requiring special handling because vector performs all that heavy lifting for us.
And as a usage example, let's make a function that prints the matrix out:
std::ostream & operator<<(std::ostream & out, const Matrix & in)
{
for (size_t i = 0; i < in.getNrRows(); i++)
{
for (size_t j = 0; j < in.getNrColumns(); j++)
{
out << in(i,j) << ' ';
}
out << "\n";
}
return out;
}
And modifying OP's main function to use Matrix we get:
int main()
{
int i = 0, j = 0;
cout << "Enter no of rows of the matrix";
cin >> i;
cout << "Enter no of columns of the matrix";
cin >> j;
Matrix matrix(i,j);
int p = 0, q = 0;
while (p < i)
{
while (q < j)
{
cout << "Enter the" << p + 1 << "*" << q + 1 << "entry";
cin >> matrix(p,q);
q = q + 1;
}
p = p + 1;
q = 0;
}
cout << matrix << "\n";
make_zero(matrix);
}

void make_zero(int row, int col, double ** matrix)
Note, that you need to pass also size of the matrix separately.
Also you can use
std::vector<std::vector<double> >
instead and pass this object by reference, pointer, or just make a copy.
Actually, it works, but your problem in this line also:
double l[i][j];
i, j is unknown during the compile time.
You have 2 ways.
1) dynamically allocate the memory
2) use std::vector<std::vector<double> >. Default constructor already sets zero values. But you can do it manually like this:
#include <iostream>
#include <vector>
void make_zero(std::vector<std::vector<double> > & to_zero) {
for (int i = 0; i < to_zero.size(); ++i) {
for (int j = 0; j < to_zero[i].size(); ++j) {
to_zero[i][j] = 0;
}
}
}
void print_double_vector(const std::vector<std::vector<double> > & to_print) {
for (int i = 0; i < to_print.size(); ++i) {
for (int j = 0; j < to_print[i].size(); ++j) {
std::cout << to_print[i][j] << " ";
}
std::cout << std::endl;
}
std::cout << std::endl;
}
int main() {
// your code goes here
int n, m;
std::cin >> n >> m;
std::vector<std::vector<double> > d(n, std::vector<double>(m));
print_double_vector(d);
make_zero(d);
print_double_vector(d);
return 0;
}
http://ideone.com/0X53Yj

matrix showing up empty when passed from cpp to CUDA

I've passed a 2D array from a C++ class to a CUDA function; however, once in the CUDA function the data in the matrix is gone. I'm still in the host, not the device so I don't understand what I've done wrong as this should be very straight forward.
Here is the C++
int main()
{
const int row=8;
const int column=8;
int rnum;
srand(time(0));
rnum = (rand() % 100) + 1;
float table[row][column];
for(int r=0; r<row; r++){
for(int c=0; c<column;c++){
table[row][column] = (rand()%100) + 1.f;
cout << table[row][column] << " ";
}
cout << "\n";
}
//CUDA
handleMatrix(&table[0][0], 8);
}
Here is the CUDA code that is just printing out the matrix.
void handleMatrix(float * A, int size)
{
printf("&A[0]=%i\n",&A);
printf("A[0] is %f \n",A[0]);
for(int j=0; j<size; j++){
for(int k=0; k<size;k++){
printf("%f ",A[j +size*k]); // << " ";
}
printf("\n");
}
}
In the C++ file - the print out of the matrix has real numbers, but the CUDA function just prints out 0's for both the matrix and for the address of A[0]. I don't know if this means I'm not passing in the matrix correctly between the 2 or if there is something I should do with the matrix once I get it to the CUDA function.

Ha, needed a while to find it. Check the indexing in your matrix randomization code. :) You're using the wrong variables and never initialize the float values.

float * A is a pointer on host, not in device space. use cuda malloc+memcpy.
float * A doesnt pass contents, only the address.

Two dimensional array allocation

i have following code for allocation two dimensional array
#include <iostream>
using namespace std;
int **malloc2d(int r,int c){
int **t=new int*[r];
for (int i=0;i<r;i++)
t[i]=new int[c];
for (int i=0;i<r;i++){
for (int j=0;j<c;j++){
t[i][j]=i+j;
}
}
return t;
}
int main(){
int m=10;
int n=10;
int **a=malloc2d(m,n);
for (int i=0;i<m;i++){
for (int j=0;j<n;j++){
cout<<a[i][j]<< " ";
cout<< " \n";
}
cout<< " \n";
}
return 0;
}
it works but my question is: how good is this code according to performance efficienty or according to code speed? thanks

With an int ** you have lots of pointers to tiny (4 byte) memory spaces which is inefficient due to malloc overhead (every malloc implementation has an overhead, the minimum normally is sizeof(void*) AFAIK which in your case would mean there's at least a 100% overhead for all "cells").
As an alternative, you could use a one-dimensional array and calculate the indexes yourself like this: index = (row * num_columns) + column. You would lose the nice a[row][column] notation, though. Still, it should be faster to access as well because in your (clean) solution there have to be two pointer dereferences (memory operations) while in the way I suggest you only have one. It would look something like this:
#include <iostream>
using namespace std;
inline int a_index(int row, int column, int column_size) {
return((row * column_size) + column);
}
int *malloc2d(int r,int c) {
int *t=new int[r * c];
for (int i=0;i<r;i++){
for (int j=0;j<c;j++){
t[a_index(i,j,c)]=i+j;
}
}
return t;
}
int main(){
int m=10;
int n=10;
int *a=malloc2d(m, n);
for (int i=0;i<m;i++){
for (int j=0;j<n;j++){
cout<<a[a_index(i,j,n)]<< " ";
cout<< " \n";
}
cout<< " \n";
}
return 0;
}

I assume you plan to add delete[], or the program will terminate before leakage matters.
Anyway, it won't be very efficient.
First, the array will be composed of non-contiguous blocks of memory. That makes it harder for the machine's memory subsystem to handle.
Second, some extra space is being wasted to hold the array of pointers.
Just do it the old fashioned way:
int *a = new int[ r * c ];
or with vector
std::vector<int> a( r * c );
and compute indexes as ever:
cout << a[ i * c + j ] << ' ';
However, since you are looping over the entire array, you could ignore the two-dimensionality except for formatting:
for ( int i = 0; i < r * c; ++ i ) {
cout << a[ i ] << ' ';
if ( i % c == c-1 ) cout << '\n';
}

If you don't delete the memory that you have allocated by using new, then you will leak memory.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CUDA c++, simple matrix multiplication error - c++

Related

Problem creating and returning jagged array (error std::bad_array_new_length)

CUDA Zero Copy vs. CudaMemcpy on Jetson TK1

Error in passing a matrix as an argument

matrix showing up empty when passed from cpp to CUDA

Two dimensional array allocation

Categories

Resources