I am trying to write a program for matrix calculations using C/CUDA.
I have the following program:
In main.cu
#include <cuda.h>
#include <iostream>
#include "teste.cuh"
using std::cout;
int main(void)
{
const int Ndofs = 2;
const int Nel = 4;
double *Gh = new double[Ndofs*Nel*Ndofs*Nel];
double *Gg;
cudaMalloc((void**)& Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel);
for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
Gh[ii] = 0.;
cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice);
integraG<<<256, 256>>>(Nel, Gg);
cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost);
for (int ii = 0; ii < Ndofs*Nel*Ndofs*Nel; ii++)
cout << ii + 1 << " " << Gh[ii] << "\n";
return 0;
}
In mtrx.cuh
#ifndef TESTE_CUH_
#define TESTE_CUH_
__global__ void integraG(const int N, double* G)
{
const int szmodel = 2*N;
int idx = threadIdx.x + blockIdx.x*blockDim.x;
int idy = threadIdx.y + blockIdx.y*blockDim.y;
int offset = idx + idy*blockDim.x*gridDim.x;
int posInit = szmodel*offset;
G[posInit + 0] = 1;
G[posInit + 1] = 1;
G[posInit + 2] = 1;
G[posInit + 3] = 1;
}
#endif
The result (which is supposed to be a matrix filled with 1's) is copied back to the host array; The problem is: nothing happens! Apparently, my program is not calling the gpu kernel, and I am still getting an array full of zeros.
I am very new to CUDA programming and I am using CUDA by example (Jason Sanders) as a reference book.
My questions are:
What is wrong with my code?
Is this the best way to deal with matrices using GPU, using matrices vectorized form?
Is there another reference that can provide more examples on matrices using GPU's?
These are your questions:
What is wrong with my code?
Is this the best way to deal with matrices using GPU, using matrices vectorized form?
Is there another reference that can provide more examples on matrices using GPU's?
For your first question. First of all, your problem should explicitly be defined. What do you want to do with this code? what sort of calculations do you want to do on the Matrix?
Try to check for errors properly THIS is a very good way to do so. There are some obvious bugs in your code as well. some of your bugs:
You're passing the wrong address pointers to the cudaMemcpy, the pointers that are passed to the source and the destination have to be swapped with each other, Check here
Change them to:
"NdofsNelNdofs*Nel" shows that you're interested in the value of the first 64 numbers of the array, so why calling 256 Threads and 256 blocks?
This part of your code:
int idx = threadIdx.x + blockIdx.xblockDim.x;
int idy = threadIdx.y + blockIdx.yblockDim.y;
shows that you want to use 2-Dim threads and blocks; to do that so, you need to use Dim type.
By making the following changes:
cudaMemcpy(Gg, Gh, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyHostToDevice); //HERE
dim3 block(2,2); //HERE
dim3 thread(4,4); //HERE
integraG<<<block, thread>>>(Nel, Gg); //HERE
cudaMemcpy(Gh, Gg, sizeof(double)*Ndofs*Nel*Ndofs*Nel, cudaMemcpyDeviceToHost); //HERE
You'll get a result like the following:
1 1
2 1
3 1
4 1
5 0
6 0
7 0
8 0
9 1
10 1
11 1
12 1
.
.
.
57 1
58 1
59 1
60 1
61 0
62 0
63 0
64 0
Anyway, if you state your problem and goal more clearly, better suggestions can be provided for you.
Regarding to your last two questions:
In my opinion CUDA C PROGRAMMING GUIDE and CUDA C BEST PRACTICES GUIDE are the two must documents to read when starting with CUDA, and they include examples on Matrix calculations as well.
Related
This question already has answers here:
Pre vs Post Increment
(3 answers)
Closed 8 months ago.
I have the following c++ program:
#include <iostream>
using namespace std;
//looping through arrays backwards
int main() {
int a[3] {1, 2, 3};
int x = sizeof(a), y = sizeof(int), z = x / y;
for(int i = z - 1; i >= 0; i--) {
cout << a[i] << " ";
}
return 0;
}
And it outputs 3 2 1. But if I change the first parameter in the for loop to int i = z--;, it outpus 2 3 2 1 and I don't understand why. Aren't z - 1 and z-- supposed to be the same thing? Could someone please explain why? Also, I'm a begginer in C++ and I'm learning via the W3Schools tutorial about it. Thanks!
The expression z-- evaluates to z, then - as a side effect - z is decremented (scheduled according to scheduling rules). This means, you're essentially saying int i = z in your loop (and then decrement z, but it's not used anymore) - therefore, your code has UB. The 2 printed is purely coincidental, anything might be printed or anything could happen in your code. If you'd like to use --, use it as prefix, i. e., int i = --z.
This is an extension of my earlier question, but I am asking it separately because I am getting really frustrated, so please do not down-vote it!
Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?
I know gemv is designed for matrix-vector multiplication but why can't I use gemm for vector-matrix multiplication if it takes less time, especially for sparse matrices
A short representative code is given below. It asks to enter a value, and then populates a vector with that value. It then replaces every 32nd value with its index. So, if we enter '0' then we get a sparse vector but for other values we get a dense vector.
#include <iostream>
#include <stdio.h>
#include <time.h>
#include <cblas.h>
using namespace std;
int main()
{
const int m = 5000;
timespec blas_start, blas_end;
long totalnsec; //total nano sec
double totalsec, totaltime;
int i, j;
float *A = new float[m]; // 1 x m
float *B = new float[m*m]; // m x m
float *C = new float[m]; // 1 x m
float input;
cout << "Enter a value to populate the vector (0 for sparse) ";
cin >> input; // enter 0 for sparse
// input martix A: every 32nd element is non-zero, rest of the values = input
for(i = 0; i < m; i++)
{
A[i] = input;
if( i % 32 == 0) //adjust for sparsity
A[i] = i;
}
// input matrix B: identity matrix
for(i = 0; i < m; i++)
for(j = 0; j < m; j++)
B[i*m + j] = (i==j);
clock_gettime(CLOCK_REALTIME, &blas_start);
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
clock_gettime(CLOCK_REALTIME, &blas_end);
/* for(i = 0; i < m; i++)
printf("%f ", C[i]);
printf("\n\n"); */
// Print time
totalsec = (double)blas_end.tv_sec - (double)blas_start.tv_sec;
totalnsec = blas_end.tv_nsec - blas_start.tv_nsec;
if(totalnsec < 0)
{
totalnsec += 1e9;
totalsec -= 1;
}
totaltime = totalsec + (double)totalnsec*1e-9;
cout<<"Duration = "<< totaltime << "\n";
return 0;
}
I run it as follows in Ubuntu 14.04 with blas 3.0
erisp#ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp#ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0291558
erisp#ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.000959521
EDIT
If I replace the gemm call with the following gemv call then matrix sparsity does not matter
//cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, 1, m, m, 1.0f, A, m, B, m, 0.0f, C, m);
cblas_sgemv(CblasRowMajor, CblasNoTrans, m, m, 1.0f, B, m, A, 1, 0.0f, C, 1);
Results
erisp#ubuntu:~/uas/stackoverflow$ g++ gemmcomp.cpp -o gemmcomp.o -lblas
erisp#ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 5
Duration = 0.0301581
erisp#ubuntu:~/uas/stackoverflow$ ./gemmcomp.o
Enter a value to populate the vector (0 for sparse) 0
Duration = 0.0299282
But the issue is that I am trying to optimize someone else's code using cublas and he is successfully and efficiently using gemm to perform this vector-matrix multiplication. So, I have to test against it or to categorically prove this call to be incorrect
EDIT
I have even updated my blas library today by using
sudo apt-get install libblas-dev liblapack-dev
EDIT: Executed the following commands as suggested by different contributors
erisp#ubuntu:~/uas/stackoverflow$ ll -d /usr/lib/libblas* /etc/alternatives/libblas.*
lrwxrwxrwx 1 root root 26 مارچ 13 2015 /etc/alternatives/libblas.a -> /usr/lib/libblas/libblas.a
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /etc/alternatives/libblas.so -> /usr/lib/libblas/libblas.so
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3 -> /usr/lib/libblas/libblas.so.3
lrwxrwxrwx 1 root root 29 مارچ 13 2015 /etc/alternatives/libblas.so.3gf -> /usr/lib/libblas/libblas.so.3
drwxr-xr-x 2 root root 4096 مارچ 13 2015 /usr/lib/libblas/
lrwxrwxrwx 1 root root 27 مارچ 13 2015 /usr/lib/libblas.a -> /etc/alternatives/libblas.a
lrwxrwxrwx 1 root root 28 مارچ 13 2015 /usr/lib/libblas.so -> /etc/alternatives/libblas.so
lrwxrwxrwx 1 root root 30 مارچ 13 2015 /usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3
lrwxrwxrwx 1 root root 32 مارچ 13 2015 /usr/lib/libblas.so.3gf -> /etc/alternatives/libblas.so.3gf
erisp#ubuntu:~/uas/stackoverflow$ ldd ./gemmcomp.o
linux-gate.so.1 => (0xb76f6000)
libblas.so.3 => /usr/lib/libblas.so.3 (0xb765e000)
libstdc++.so.6 => /usr/lib/i386-linux-gnu/libstdc++.so.6 (0xb7576000)
libc.so.6 => /lib/i386-linux-gnu/libc.so.6 (0xb73c7000)
libm.so.6 => /lib/i386-linux-gnu/libm.so.6 (0xb7381000)
/lib/ld-linux.so.2 (0xb76f7000)
libgcc_s.so.1 => /lib/i386-linux-gnu/libgcc_s.so.1 (0xb7364000)
Question: What could be the reason behind a cblas_sgemm call taking much less time for matrices with a large number of zeros as compared to the same cblas_sgemm call for dense matrices?
It seems that the BLAS implementation provided by the default libblas-dev package for Ubuntu 14.04 (and probably other Ubuntu distributions) includes an optimization for cases where certain matrix elements are zero.
For Ubuntu 14.04, the source code for the BLAS (and cblas) implementation/package can be downloaded from here.
After unpacking that archive, we have a cblas/src directory that contains the cblas API, and we have another src directory that contains F77 implementations of various blas routines.
In the case of cblas_sgemm, when the parameter CblasRowMajor is specified, the cblas/src/cblas_sgemm.c code will call the underlying fortran routine as follows:
void cblas_sgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA,
const enum CBLAS_TRANSPOSE TransB, const int M, const int N,
const int K, const float alpha, const float *A,
const int lda, const float *B, const int ldb,
const float beta, float *C, const int ldc)
{
...
} else if (Order == CblasRowMajor)
...
F77_sgemm(F77_TA, F77_TB, &F77_N, &F77_M, &F77_K, &alpha, B, &F77_ldb, A, &F77_lda, &beta, C, &F77_ldc);
Note that for this row major call, the order of the A and B matrices are reversed when passed to the F77_sgemm routine. This is sensible but I won't delve into why here. It's sufficient to note that A has become B in the fortran call/code, and B has become A.
When we inspect the corresponding fortran routine in src/sgemm.f, we see the following sequence of code:
*
* Start the operations.
*
IF (NOTB) THEN
IF (NOTA) THEN
*
* Form C := alpha*A*B + beta*C.
*
DO 90 J = 1,N
IF (BETA.EQ.ZERO) THEN
DO 50 I = 1,M
C(I,J) = ZERO
50 CONTINUE
ELSE IF (BETA.NE.ONE) THEN
DO 60 I = 1,M
C(I,J) = BETA*C(I,J)
60 CONTINUE
END IF
DO 80 L = 1,K
IF (B(L,J).NE.ZERO) THEN ***OPTIMIZATION
TEMP = ALPHA*B(L,J)
DO 70 I = 1,M
C(I,J) = C(I,J) + TEMP*A(I,L)
70 CONTINUE
END IF
80 CONTINUE
90 CONTINUE
The above is the section of code that handles the case where No transpose of A and No transpose of B are indicated (which is true for this cblas row-major test case). The matrix row/column multiply operation is handled at the loops beginning where I have added the note ***OPTIMIZATION. In particular, if the matrix element B(L,J) is zero, then the DO-loop closing at line 70 is skipped. But remember B here corresponds to the A matrix passed to the cblas_sgemm routine.
The skipping of this do-loop allows the sgemm function implemented this way to be substantially faster for the cases where there are a large number of zeroes in the A matrix passed to cblas_sgemm when row-major is specified.
Experimentally, not all blas implementation have this optimization. Testing on the exact same platform but using libopenblas-dev instead of libblas-dev provides no such speed-up, i.e. essentially no execution time difference when the A matrix is mostly zeroes, vs. the case when it is not.
Note that the fortran (77) code here appears to be similar to or identical to older published versions of the sgemm.f routine such as here. Newer published versions of this fortran routine that I could find do not contain this optimization, such as here.
Ok some background
I have been working on this project, which I had started back in college, (no longer in school but want to expand on it to help me improve my understanding of C++). I digress... The problem is to find the Best path through a matrix. I generate a matrix filled with a set integer value lets say 9. I then create a path along the outer edge (Row 0, Col length-1) so that all values along it are 1.
The goal is that my program will run through all the possible paths and determine the best path. To simplify the problem I decide to just calculate the path SUM and then compare that to what the SUM computed by the application.
(The title is miss leading S=single-thread P=multi-threads)
OK so to my question.
In one section the algorithm does some simple bit-wise shifts to come up with the bounds for iteration. My question is how exactly do these shifts work so that the entire matrix (or MxN array) is completely traversed?
void AltitudeMapPath::bestPath(unsigned int threadCount, unsigned int threadIndex) {
unsigned int tempPathCode;
unsigned int toPathSum, toRow, toCol;
unsigned int fromPathSum, fromRow, fromCol;
Coordinates startCoord, endCoord, toCoord, fromCoord;
// To and From split matrix in half along the diagonal
unsigned int currentPathCode = threadIndex;
unsigned int maxPathCode = ((unsigned int)1 << (numRows - 1));
while (currentPathCode < maxPathCode) {
tempPathCode = currentPathCode;
// Setup to path iteration
startCoord = pathedMap(0, 0);
toPathSum = startCoord.z;
toRow = 0;
toCol = 0;
// Setup from path iteration
endCoord = pathedMap(numRows - 1, numCols - 1);
fromPathSum = endCoord.z;
fromRow = numRows - 1;
fromCol = numCols - 1;
for (unsigned int index = 0; index < numRows - 1; index++) {
if (tempPathCode % 2 == 0) {
toCol++;
fromCol--;
}
else {
toRow++;
fromRow--;
}
toCoord = pathedMap(toRow, toCol);
toPathSum += toCoord.z;
fromCoord = pathedMap(fromRow, fromCol);
fromPathSum += fromCoord.z;
tempPathCode = tempPathCode >> 1;
}
if (toPathSum < bestToPathSum[threadIndex][toRow]) {
bestToPathSum[threadIndex][toRow] = toPathSum;
bestToPathCode[threadIndex][toRow] = currentPathCode;
}
if (fromPathSum < bestFromPathSum[threadIndex][fromRow]) {
bestFromPathSum[threadIndex][fromRow] = fromPathSum;
bestFromPathCode[threadIndex][fromRow] = currentPathCode;
}
currentPathCode += threadCount;
}
}
I simplified the code since all the extra stuff just detracts from the question. Also if people are wondering I wrote most of the application but this idea of using the bit-wise operators was given to me by my past instructor.
Edit:
I added the entire algorithm for which each thread executes on. The entire project is still a work a progress but here is the source code for the whole thing if any one is interested [GITHUB]
A right bit shift is equivalent to dividing by 2 to the power of the number of bits shifted. IE 1 >> 2 = 1 / (2 ^ 2) = 1 / 4
A left bit shift is equivalent to multiplying by 2 to the power of the number of bits shifted. IE 1 << 2 = 1 * 2 ^ 2 = 1 * 4
I'm not entirely sure what that algorithm does and why it needs to multiply by 2^ (num rows - 1) and then progressively divide by 2.
I am new c++ learner.I logged in Codeforces site and it is 11A question:
A sequence a0, a1, ..., at - 1 is called increasing if ai - 1 < ai for each i: 0 < i < t.
You are given a sequence b0, b1, ..., bn - 1 and a positive integer d. In each move you may choose one element of the given sequence and add d to it. What is the least number of moves required to make the given sequence increasing?
Input
The first line of the input contains two integer numbers n and d (2 ≤ n ≤ 2000, 1 ≤ d ≤ 106). The second line contains space separated sequence b0, b1, ..., bn - 1 (1 ≤ bi ≤ 106).
Output the minimal number of moves needed to make the sequence increasing.
I write this code for this question:
#include <iostream>
using namespace std;
int main()
{
long long int n,d,ci,i,s;
s=0;
cin>>n>>d;
int a[n];
for(ci=0;ci<n;ci++)
{
cin>>a[ci];
}
for(i=0;i<(n-1);i++)
{
while(a[i]>=a[i+1])
{
a[i+1]+=d;
s+=1;
}
}
cout<<s;
return 0;
}
It work good.But In a test codeforces server enter 2000 number.Time limit is 1 second.But it calculate up to 1 second.
How to make this code shorter to calculate faster?
One improvement that can be made is to use
std::ios_base::sync_with_stdio(false);
By default, cin/cout waste time synchronizing themselves with the C library’s stdio buffers, so that you can freely intermix calls to scanf/printf with operations on cin/cout. By turning this off using the above call the input and output operations in the above program should take less time since it no longer initialises the sync for input and output.
This is know to have helped in previous code challenges that require code to be completed in a certain time scale and which the c++ input/output was causing some bottleneck in the speed.
You can get rid of the while loop. Your program should run faster without
#include <iostream>
using namespace std;
int main()
{
long int n,d,ci,i,s;
s=0;
cin>>n>>d;
int a[n];
for(ci=0;ci<n;ci++)
{
cin>>a[ci];
}
for(i=0;i<(n-1);i++)
{
if(a[i]>=a[i+1])
{
int x = ((a[i] - a[i+1])/d) + 1;
s+=x;
a[i+1]+=x*d;
}
}
cout<<s;
return 0;
}
This is not a complete answer, but a hint.
Suppose our seqence is {1000000, 1} and d is 2.
To make an increasing sequence, we need to make the second element 1,000,001 or greater.
We could do it your way, by repeatedly adding 2 until we get past 1,000,000
1 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + ...
which would take a while, or we could say
Our goal is 1,000,001
We have 1
The difference is 1,000,000
So we need to to do 1,000,000 / 2 = 500,000 additions
So the answer is 500,000.
Which is quite a bit faster, because we only did 1 addition (1,000,000 + 1), one subtraction (1,000,001 - 1) and one division (1,000,000 / 2) instead of doing half a million additions.
Just as #molbdnilo said, Use math to get rid of the loop, and it's simple.
Here is my code, accepted on Codeforces.
#include <iostream>
using namespace std;
int main()
{
int n = 0 , b = 0;
int a[2001];
cin >> n >> b;
for(int i = 0 ; i < n ; i++){
cin >> a[i];
}
int sum = 0;
for(int i = 0 ; i < n - 1 ; i++){
if(a[i] >= a[i + 1]){
int minus = a[i] - a[i+1];
int diff = minus / b + 1;
a[i+1] += diff * b;
sum += diff;
}
}
cout << sum << endl;
return 0;
}
I suggest you profile your code to see where the bottlenecks are.
One of the popular areas of time wasting is with input. The fewer input requests, the faster your program will be.
So, you could speed up your program by reading from cin using read() into a buffer and then parse the buffer using istringstream.
Other techniques include loop unrolling and optimizing for data cache. Reducing the number of branches or if statements will also speed up your programs. Processor prefer crunching data and moving data around to jumping to different areas in the code.
I would like to have a multidimensional array that allows for different sizes.
Example:
int x[][][] = {{{1,2},{2,3}},{{1,2}},{{4,5},{2,7},{1,1}}};
The values will be known at compile time and will not change.
I would like to be able to access the values like val = x[2][0][1];
What is the best way to go about this? I'm used to java/php where doing something like this is trivial.
Thanks
I suppose you could do this "the old fashioned (uphill both ways) way":
#include <stdio.h>
int main(void){
int *x[3][3];
int y[12] = {1,2,3,4,5,6,7,8,9,10,11,12};
x[0][0] = &y[0];
x[0][1] = &y[2];
x[1][0] = &y[4];
x[2][0] = &y[6];
x[2][1] = &y[8];
x[2][2] = &y[10];
// testing:
printf("x[0][0][0] = %d\n", x[0][0][0]);
printf("x[0][0][1] = %d\n", x[0][0][1]);
printf("x[0][1][0] = %d\n", x[0][1][0]);
printf("x[0][1][1] = %d\n", x[0][1][1]);
printf("x[1][0][0] = %d\n", x[1][0][0]);
printf("x[1][0][1] = %d\n", x[1][0][1]);
printf("x[2][0][0] = %d\n", x[2][0][0]);
printf("x[2][0][1] = %d\n", x[2][0][1]);
printf("x[2][1][0] = %d\n", x[2][1][0]);
printf("x[2][1][1] = %d\n", x[2][1][1]);
printf("x[2][2][1] = %d\n", x[2][2][0]);
printf("x[2][2][1] = %d\n", x[2][2][1]);
return 0;
}
Basically, the array x is a little bit too big (3x3) and it points to the "right place" in the array y that contains your data (I am using the digits 1…12 because it's easier to see it is doing the right thing). For a small example like this, you end up with an array of 9 pointers in x (72 bytes), plus the 12 integers in y (48 bytes).
If you filled an int array with zeros where you didn't need values (or -1 if you wanted to indicate "invalid") you would end up with 18x4 = 72 bytes. So the above method is less efficient - because this array is not "very sparse". As you change the degree of raggedness, this gets better. If you really wanted to be efficient you would have an array of pointers-of-pointers, followed by n arrays of pointers - but this gets very messy very quickly.
Very often the right approach is a tradeoff between speed and memory size (which is always at a premium on the Arduino).
By the way - the above code does indeed produce the output
x[0][0][0] = 1
x[0][0][1] = 2
x[0][1][0] = 3
x[0][1][1] = 4
x[1][0][0] = 5
x[1][0][1] = 6
x[2][0][0] = 7
x[2][0][1] = 8
x[2][1][0] = 9
x[2][1][1] = 10
x[2][2][1] = 11
x[2][2][1] = 12
Of course it doesn't stop you from accessing an invalid array element - and doing so will generate a seg fault (since the unused elements in x are probably invalid pointers).
Thanks Floris.
I've decided to just load all values into a single array, like
{1,2,2,3,1,2,4,5,2,7,1,1}
and have a second array which stores the length of each first dimension, like
{2,1,3}
The third dimension always has a length of 2, so I will just multiply the number by 2. I'm going to make a helper class so I can just do something like getX(2,0) which would return 4, and have another function like getLength(0) which would return 2.