Cuda: least square solving , poor in speed - c++

Recently ,I use Cuda to write an algorithm called 'orthogonal matching pursuit' . In my ugly Cuda code the entire iteration takes 60 sec , and Eigen lib takes just 3 sec...
In my code Matrix A is [640,1024] and y is [640,1] , in each step I select some vectors from A to compose a new Matrix called A_temp [640,itera], iter=1:500 . I new a array MaxDex_Host[] in cpu to tell which column to select .
I want to get x_temp[itera,1] from A_temp*x_temp=y using least-square , I use a cula API 'culaDeviceSgels' and cublas matrix-vector multiplication API.
So the culaDeviceSgels would call 500 times , and I think this would be faster than Eigen lib's QR.Sovler .
I check the Nisight performence anlysis , I found the custreamdestory takes a long time . I initial cublas before iteration and destory it after I get the result . So I want to know the what is the custreamdestory , different with cublasdestory?
The main problem is memcpy and function 'gemm_kernel1x1val' . I think this function is from 'culaDeviceSgels'
while(itera<500): I use cublasSgemv and cublasIsamax to get MaxDex_Host[itera] , then
MaxDex_Host[itera]=pos;
itera++;
float* A_temp_cpu=new float[M*itera]; // matrix all in col-major
for (int j=0;j<itera;j++) // to get A_temp [M,itera] , the MaxDex_Host[] shows the positon of which column of A to chose ,
{
for (int i=0;i<M;i++) //M=640 , and A is 640*1024 ,itera is add 1 each step
{
A_temp_cpu[j*M+i]=A[MaxDex_Host[j]*M+i];
}
}
// I must allocate one more array because culaDeviceSgels will decompose the one input Array , and I want to use A_temp after least-square solving.
float* A_temp_gpu;
float* A_temp2_gpu;
cudaMalloc((void**)&A_temp_gpu,Size_float*M*itera);
cudaMalloc((void**)&A_temp2_gpu,Size_float*M*itera);
cudaMemcpy(A_temp_gpu,A_temp_cpu,Size_float*M*itera,cudaMemcpyHostToDevice);
cudaMemcpy(A_temp2_gpu,A_temp_gpu,Size_float*M*itera,cudaMemcpyDeviceToDevice);
culaDeviceSgels('N',M,itera,1,A_temp_gpu,M,y_Gpu_temp,M);// the x_temp I want is in y_Gpu_temp's return value , stored in the y_Gpu_temp[0]——y_Gpu_temp[itera-1]
float* x_temp;
cudaMalloc((void**)&x_temp,Size_float*itera);
cudaMemcpy(x_temp,y_Gpu_temp,Size_float*itera,cudaMemcpyDeviceToDevice);
Cuda's memory manage seems too complex , is there any other convenience method to solve least-square?

I think that custreamdestory and gemm_kernel1x1val are internally called by the APIs you are using, so there is not much to do with them.
To improve your code, I would suggest to do the following.
You can get rid of A_temp_cpu by keeping a device copy of the matrix A. Then you can copy the rows of A into the rows of A_temp_gpu and A_temp2_gpu by a kernel assignment. This would avoid performing the first two cudaMemcpys.
You can preallocate A_temp_gpu and A_temp2_gpu outside the while loop by using the maximum possible value of itera instead of itera. This will avoid the first two cudaMallocs inside the loop. The same applies to x_temp.
As long as I know, culaDeviceSgels solves a linear system of equations. I think you can do the same also by using cuBLAS APIs only. For example, you can perform an LU factorization first by cublasDgetrfBatched() and then use cublasStrsv() two times to solve the two arising linear systems. You may wish to see if this solution leads to a faster algorithm.

Related

Problem about value assignment in Arrayfire

I'm using Arrayfire and Flashlight evaluating a network.
auto tmp = output(af::seq(2, 10), af::span, af::span, af::span);
auto softmax_tmp = fl::softmax(tmp, 0);
output(af::seq(2,10),af::span,af::span,af::span)=softmax_tmp;
output is a tensor with the shape of (12,100,1,1). Now I want to pull out the (2,10) dims of the tensor and for the extracted 100 9-dim vectors, apply softmax operation to them. Then put them back. Codes above.
Problem is that the 3rd line doesn't work. softmax_tmp is the right value, but the assignment operator in the 3rd line just failed. Exactly it can pass the compilation successfully, but output remains the old value as in 1st line.
Who could help me? A lot thanks really.

K-Means Algorithm not working properly

I was trying to write my own K-Means clustering algorithm however it is not working.Can someone take a look and help me finding what mistake I am committing.I am fairly new.
I expect the data to be clustered in 2 groups since K=2.However I am not getting the expected result.I think mean assignment is not working properly.Can someone give a look?
https://github.com/DivJ/Robo_Lab/blob/master/K_Means.py
dist=[]
lab=[]
x_sum,y_sum=0,0
x_sum1,y_sum1=0,0
k=2
mean=pt[:k]
def assignment():
global dist
global lab
for i in range(0,100):
for j in range(0,k):
dist.append(math.hypot(pt[i,0]-mean[j,0],pt[i,1]-mean[j,1]))
lab.append(dist.index(min(dist)))
dist=[]
def mean_shift():
global x_sum,x_sum1,y_sum,y_sum1,lab
for i in range(0,100):
if(lab[i]==0):
plt.scatter(pt[i,0],pt[i,1],c='r')
x_sum=pt[i,0]+x_sum
y_sum=pt[i,1]+y_sum
elif(lab[i]==1):
plt.scatter(pt[i,0],pt[i,1],c='b')
x_sum1=pt[i,0]+x_sum1
y_sum1=pt[i,1]+y_sum1
mean[0,0]=x_sum/lab.count(0)
mean[0,1]=y_sum/lab.count(0)
mean[1,0]=x_sum1/lab.count(1)
mean[1,1]=y_sum1/lab.count(1)
lab=[]
def k_means(itr):
for z in range(0,itr):
assignment()
mean_shift()
k_means(100)
Here's what's wrong with your code:
1) You initialize means as pt[:k], however later you reassign means which leads to the first two points being reassigned unintentionally since means merely is a pointer to these points. You need to create a copy of the first to points to avoid changing them:
import copy
means=copy.copy(pt[:k])
2) You initialize x_sum, y_sum, x_sum1 and y_sum1 outside of mean_shift() which causes the sums to grow bigger and bigger with each iteration. Set them to 0 every time you call mean_shift().

mlpack: Lasso regression that takes in pointer to function

from
http://www.mlpack.org/doxygen.php?doc=classmlpack_1_1regression_1_1LARS.html
I'm trying to use
void mlpack::regression::LARS::Regress
but the function itself only takes in some &gramMatrix as input. If I want to pass in function to compute some sum R_i * X_i, I'm stuck because it only takes in pointer to matrix. Any idea on how to get around this ? ( the beta is constantly updating within the optimization function void mlpack::regression::LARS::Regress, and beta is neeeded to compute sum R_i * X_i ).
Any suggestion to other C++ ml library would also be madly helpful.
Thanks!

How to properly compare results between matlab/octave and C++

I'm writing some piece of code available in Matlab/Octave into C++ code. I only have octave so I will just say octave from now on.
I want to properly compare the results between the octave code and the C++ code. The algorithms I'm writing take as input a 2D matrix, and output another 2D matrix.
To compare the results, I write the input matrix A from octave using the function save A.mat A, with default options. This creates an ascii file A.mat which starts like
# Created by Octave 3.8.1, Tue May 27 12:12:53 2014 CEST <remi#desktop>
# name: values
# type: matrix
# rows: 25
# columns: 5
43.0656 6.752420000000001 68.39323 35.75617 98.85446
...
I run the algorithm using octave and save the output matrix B similarly.
In my C++ code, I load the matrices A and B using the following piece of code:
// I opened the file A.mat with std::ifstream infile(filename);
// and read the first lines starting by # and loaded the matrix dimensions
std::string buffer;
double* matBuffer = new double[rows*cols];
while (std::getline(infile, buffer)) {
std::istringstream iss(buffer);
while (iss >> *matBuffer) {
matBuffer++;
}
}
Then I run the C++ code with the values read from A.mat, and compare the results with the values read from B.mat by computing the mean squared error(MSE) on the coeff of B read vs B computed.
However, with such a design, can I expect that the MSE be 0 between the C++ and octave code? Of course I do the computation on octave and C++ with the same machine. But what about the loss in precision due to writing/reading the matrices in files? Also, I assume that coefficients of octave matrices are stored in double by default, is this correct?
can I expect that the MSE be 0 between the C++ and octave code?
I don't think so, because of the many levels of conversion, a precision loss is hard to avoid.
Also, I assume that coefficients of octave matrices are stored in double by default, is this correct?
Octave uses double precision for internal representation of the values, but again there can be a loss in precision when storing the values in ASCII.
I'd suggest you try to use the binary format for storing the values, which will exclude any problems with precision. You can go with the HDF5 format by using
save -hdf5 A.mat A
You can then use the HDF5 API to read the values in your CPP application.

c++ armadillo - calculating null space

this is my first post here...
Is there any way to calculate a vector in the null space of another vector? I don't need the basis, just one vector in the null space will do.
I already tried using the solve() method -
colvec x(3);
x = solve(A,B);
where A is a 3x3 matrix of type mat -
2 2 2
3 3 3
4 4 4
and B is the zero vector of type colvec -
0
0
0
But the program terminates throwing the following error -
error: solve(): solution not found
terminate called after throwing an instance of 'std::runtime_error'
what():
I have used the solve() method before and got perfect results, but it doesn't seem to work in this simple case. Is this because the equation has multiple solutions? If so, is there any workaround to this, any other way that I could get a vector in the null space?
Any help would be appreciated.
Edit :
I tried the svd(mat U, vec s, mat V, mat X, method = "standard") method and I got the null space of X from the columns of V. I was just wondering if there is any way to improve the precision of the answer.
Thanks!
In recent version of the armadillo library you can find the orthonormal basis of the null space of matrix using the null() function. See the documentation at http://arma.sourceforge.net/docs.html#null. The functionality was added in version 5.400 (August 2015).