Logic error problem with the Gaussian Elimination code...This code was from my Numerical Methods text in 1990's. The code is typed in from the book- not producing correct output...
Sample Run:
SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS
USING GAUSSIAN ELIMINATION
This program uses Gaussian Elimination to solve the
system Ax = B, where A is the matrix of known
coefficients, B is the vector of known constants
and x is the column matrix of the unknowns.
Number of equations: 3
Enter elements of matrix [A]
A(1,1) = 0
A(1,2) = -6
A(1,3) = 9
A(2,1) = 7
A(2,2) = 0
A(2,3) = -5
A(3,1) = 5
A(3,2) = -8
A(3,3) = 6
Enter elements of [b] vector
B(1) = -3
B(2) = 3
B(3) = -4
SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS
The solution is
x(1) = 0.000000
x(2) = -1.#IND00
x(3) = -1.#IND00
Determinant = -1.#IND00
Press any key to continue . . .
The code as copied from the text...
//Modified Code from C Numerical Methods Text- June 2009
#include <stdio.h>
#include <math.h>
#define MAXSIZE 20
//function prototype
int gauss (double a[][MAXSIZE], double b[], int n, double *det);
int main(void)
{
double a[MAXSIZE][MAXSIZE], b[MAXSIZE], det;
int i, j, n, retval;
printf("\n \t SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS");
printf("\n \t USING GAUSSIAN ELIMINATION \n");
printf("\n This program uses Gaussian Elimination to solve the");
printf("\n system Ax = B, where A is the matrix of known");
printf("\n coefficients, B is the vector of known constants");
printf("\n and x is the column matrix of the unknowns.");
//get number of equations
n = 0;
while(n <= 0 || n > MAXSIZE)
{
printf("\n Number of equations: ");
scanf ("%d", &n);
}
//read matrix A
printf("\n Enter elements of matrix [A]\n");
for (i = 0; i < n; i++)
for (j = 0; j < n; j++)
{
printf(" A(%d,%d) = ", i + 1, j + 1);
scanf("%lf", &a[i][j]);
}
//read {B} vector
printf("\n Enter elements of [b] vector\n");
for (i = 0; i < n; i++)
{
printf(" B(%d) = ", i + 1);
scanf("%lf", &b[i]);
}
//call Gauss elimination function
retval = gauss(a, b, n, &det);
//print results
if (retval == 0)
{
printf("\n\t SOLUTION OF SIMULTANEOUS LINEAR EQUATIONS\n");
printf("\n\t The solution is");
for (i = 0; i < n; i++)
printf("\n \t x(%d) = %lf", i + 1, b[i]);
printf("\n \t Determinant = %lf \n", det);
}
else
printf("\n \t SINGULAR MATRIX \n");
return 0;
}
/* Solves the system of equations [A]{x} = {B} using */
/* the Gaussian elimination method with partial pivoting. */
/* Parameters: */
/* n - number of equations */
/* a[n][n] - coefficient matrix */
/* b[n] - right-hand side vector */
/* *det - determinant of [A] */
int gauss (double a[][MAXSIZE], double b[], int n, double *det)
{
double tol, temp, mult;
int npivot, i, j, l, k, flag;
//initialization
*det = 1.0;
tol = 1e-30; //initial tolerance value
npivot = 0;
//mult = 0;
//forward elimination
for (k = 0; k < n; k++)
{
//search for max coefficient in pivot row- a[k][k] pivot element
for (i = k + 1; i < n; i++)
{
if (fabs(a[i][k]) > fabs(a[k][k]))
{
//interchange row with maxium element with pivot row
npivot++;
for (l = 0; l < n; l++)
{
temp = a[i][l];
a[i][l] = a[k][l];
a[k][l] = temp;
}
temp = b[i];
b[i] = b[k];
b[k] = temp;
}
}
//test for singularity
if (fabs(a[k][k]) < tol)
{
//matrix is singular- terminate
flag = 1;
return flag;
}
//compute determinant- the product of the pivot elements
*det = *det * a[k][k];
//eliminate the coefficients of X(I)
for (i = k; i < n; i++)
{
mult = a[i][k] / a[k][k];
b[i] = b[i] - b[k] * mult; //compute constants
for (j = k; j < n; j++) //compute coefficients
a[i][j] = a[i][j] - a[k][j] * mult;
}
}
//adjust the sign of the determinant
if(npivot % 2 == 1)
*det = *det * (-1.0);
//backsubstitution
b[n] = b[n] / a[n][n];
for(i = n - 1; i > 1; i--)
{
for(j = n; j > i + 1; j--)
b[i] = b[i] - a[i][j] * b[j];
b[i] = b[i] / a[i - 1][i];
}
flag = 0;
return flag;
}
The solution should be: 1.058824, 1.823529, 0.882353 with det as -102.000000
Any insight is appreciated...
//eliminate the coefficients of X(I)
for (i = k; i < n; i++)
Should this maybe be
for (i = k + 1; i < n; i++)
The way it is now, I believe this will cause you to divide the pivot row by itself, zeroing it out.
This probably doesn't answer your question in the way you intended, but programming your own numerically-stable matrix algorithms is about as well-advised as do-it-yourself surgery.
There's a very nice library called TNT/JAMA from a reputable source (NIST) which does elementary matrix math in C++. To solve Ax=B, first factor A (the QR decomposition is a good general method, you can use LU but it's less numerically stable), then call solve(B). This works both for square matrices, where it's exact (subject to numerical computation issues), and overdetermined systems, where you get a least-squares answer.
Related
So, here is the problem.
I am given the lengths of 3 sides of a triangle.
The program calculates the area of the given triangle using determinates.
I assume that one vertex of the triangle is in the (0,0) point and the 2nd one is in the (c,0), where c is the length of the longest side. So what would be the easiest way to get the 3rd vertices coordinates.
I tried cosine theorem to get the line equation the side is going through, but it is a bit off
I have the determination solver program if you need it down here:
float det(int n, float mat[3][3])
{
int d=0;
int c, subi, i, j, subj;
float submat[3][3];
if(n == 2) {
return( (mat[0][0] * mat[1][1]) - (mat[1][0] * mat[0][1]));
}
else{
for(c = 0; c < n; c++){
subi = 0;
for(i = 1; i < n; i++){
subj = 0;
for(j = 0; j < n; j++){
if (j == c){
continue;
}
submat[subi][subj] = mat[i][j];
subj++;
}
subi++;
}
d = d + (pow(-1 ,c) * mat[0][c] * det(n - 1 ,submat));
}
}
return d;
}
.
.
.
ans=det.det(3,coords)*0.5;
Example picture of the triangle constructed in GeoGebra:
This question might be long and I really appreciate your patience. The core problem is I used matlab and c++ to implement an optimization algorithm but they provided me different results(matlab's better).
I am recently studying some evolutionary algorithms and interested in one variant of PSO(Particle Swarm Optimization), which is called Competitive Swarm Optimizer(born in 2015). This is the paper link http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6819057.
The basic idea of this algorithm is to first generate some random particles in searching space and assign them random velocities. At each iteration, we randomly pair them and let every pair of particles compare their objective function values. Winners(with better objective values) keep status quo while losers update themselves by learning from winners(moving toward winners).
Suppose at iteration t, particle i and j are compared and i is better. Then we update particle j for iteration t+1 by following these formulas. If particle j is out of searching space, we simply pull it back to the boundary. R_1, R_2, R_3 are all random vectors uniformly drawn from [0, 1]; operation 'otimes' means elementwise product; phi is a parameter; x_bar is the center of swarm.
For example, suppose now I want to minimize a 500-d Schwefel function(minimize the maximal absolute element) and I use 250 particles, set phi=0.1, searching space is 500-d [-100, 100]. Matlab could return me something around 35 while C++ got stuck at 85 to 90. I cannot figure out what's the problem.
Let me attach my matlab and c++ code here.
Sch = #(x)max(abs(x))
lb = -100 * ones(1, 500);
ub = 100 * ones(1, 500);
swarmsize = 250;
phi = 0.1;
maxiter = 10000;
tic
cso(Sch, lb, ub, swarmsize, phi, maxiter);
toc
function [minf, minx] = cso(obj_fun, lb, ub, swarmsize, phi, maxiter)
assert(length(lb) == length(ub), 'Not equal length of bounds');
if all(ub - lb <= 0) > 0
error('Error. \n Upper bound must be greater than lower bound.')
end
vhigh = abs(ub - lb);
vlow = -vhigh;
S = swarmsize;
D = length(ub);
x = rand(S, D);
x = bsxfun(#plus, lb, bsxfun(#times, ub-lb, x)); % randomly initalize all particles
v = zeros([S D]); % set initial velocities to 0
iter = 0;
pairnum_1 = floor(S / 2);
losers = 1:S;
fx = arrayfun(#(K) obj_fun(x(K, :)), 1:S);
randperm_index = randperm(S);
while iter <= maxiter
fx(losers) = arrayfun(#(K) obj_fun(x(K, :)), losers);
swarm_center = mean(x); % calculate center all particles
randperm_index = randperm(S); % randomly permuate all particle indexes
rpairs = [randperm_index(1:pairnum_1); randperm_index(S-pairnum_1+1:S)]'; % random pair
cmask= (fx(rpairs(:, 1)) > fx(rpairs(:, 2)))';
losers = bsxfun(#times, cmask, rpairs(:, 1)) + bsxfun(#times, ~cmask, rpairs(:, 2)); % losers who with larger values
winners = bsxfun(#times, ~cmask, rpairs(:, 1)) + bsxfun(#times, cmask, rpairs(:, 2)); % winners who with smaller values
R1 = rand(pairnum_1, D);
R2 = rand(pairnum_1, D);
R3 = rand(pairnum_1, D);
v(losers, :) = bsxfun(#times, R1, v(losers, :)) + bsxfun(#times, R2, x(winners, :) - x(losers, :)) + phi * bsxfun(#times, R3, bsxfun(#minus, swarm_center, x(losers, :)));
x(losers, :) = x(losers, :) + v(losers, :);
maskl = bsxfun(#lt, x(losers, :), lb);
masku = bsxfun(#gt, x(losers, :), ub);
mask = bsxfun(#lt, x(losers, :), lb) | bsxfun(#gt, x(losers, :), ub);
x(losers, :) = bsxfun(#times, ~mask, x(losers, :)) + bsxfun(#times, lb, maskl) + bsxfun(#times, ub, masku);
iter = iter + 1;
fprintf('Iter: %d\n', iter);
fprintf('Best fitness: %e\n', min(fx));
end
fprintf('Best fitness: %e\n', min(fx));
[minf, min_index] = min(fx);
minx = x(min_index, :);
end
(I didn't write C++ function.)
#include <cstring>
#include <iostream>
#include <cmath>
#include <algorithm>
#include <ctime>
#include <iomanip>
#include <time.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>
#define rand_01 ((double) rand() / RAND_MAX) // generate 0~1 random numbers
#define PI 3.14159265359
const int numofdims = 500; // problem dimension
const int numofparticles = 250; // number of particles
const int halfswarm = numofparticles / 2;
const double phi = 0.1;
const int maxiter = 10000; // iteration number
double Sch(double X[], int d); // max(abs(x_i))
using namespace std;
int main(){
clock_t t1,t2;
t1=clock();
srand(time(0)); // random seed
double** X = new double*[numofparticles]; // X for storing all particles
for(int i=0; i<numofparticles; i++)
X[i] = new double[numofdims];
double** V = new double*[numofparticles]; // V for storing velocities
for(int i=0; i<numofparticles; i++)
V[i] = new double[numofdims];
double Xmin[numofdims] = {0}; // lower bounds
double Xmax[numofdims] = {0}; // upper bounds
double* fitnesses = new double[numofparticles]; // objective function values
for(int j=0; j<numofdims; j++)
{
Xmin[j] = -100;
Xmax[j] = 100;
}
for(int i=0; i<numofparticles; i++)
{
for(int j=0; j<numofdims; j++)
{
X[i][j] = Xmin[j] + rand_01 * (Xmax[j] - Xmin[j]); // initialize X
V[i][j] = 0; // initialize V
}
}
for(int i=0; i<numofparticles; i++)
{
fitnesses[i] = Sch(X[i], numofdims); //
}
double minfit = fitnesses[0]; // temporary minimal value
int minidx = 0; // temporary index of minimal value
int* idxofparticles = new int[numofparticles];
for(int i=0; i<numofparticles; i++)
idxofparticles[i] = i;
double* Xmean = new double[numofdims];
int* losers = new int[halfswarm]; // for saving losers indexes
for(int iter=0; iter<maxiter; iter++)
{
random_shuffle(idxofparticles, idxofparticles+numofparticles);
for(int j=0; j<numofdims; j++)
{
for(int i=0; i<numofparticles; i++)
{
Xmean[j] += X[i][j];
}
Xmean[j] = (double) Xmean[j] / numofparticles; // calculate swarm center
}
for(int i = 0; i < halfswarm; i++)
{
// indexes are now random
// compare 1st to (halfswarm+1)th, 2nd to (halfswarm+2)th, ...
if(fitnesses[idxofparticles[i]] < fitnesses[idxofparticles[i+halfswarm]])
{
losers[i] = idxofparticles[i+halfswarm];
for(int j = 0; j < numofdims; j++)
{
V[idxofparticles[i+halfswarm]][j] = rand_01 * V[idxofparticles[i+halfswarm]][j] + rand_01 * (X[idxofparticles[i]][j] - X[idxofparticles[i+halfswarm]][j]) + rand_01 * phi * (Xmean[j] - X[idxofparticles[i+halfswarm]][j]);
X[idxofparticles[i+halfswarm]][j] = min(max((X[idxofparticles[i+halfswarm]][j] + V[idxofparticles[i+halfswarm]][j]), Xmin[j]), Xmax[j]);
}
}
else
{
losers[i] = idxofparticles[i];
for(int j = 0; j < numofdims; j++)
{
V[idxofparticles[i]][j] = rand_01 * V[idxofparticles[i]][j] + rand_01 * (X[idxofparticles[i+halfswarm]][j] - X[idxofparticles[i]][j]) + rand_01 * phi * (Xmean[j] - X[idxofparticles[i]][j]);
X[idxofparticles[i]][j] = min(max((X[idxofparticles[i]][j] + V[idxofparticles[i]][j]), Xmin[j]), Xmax[j]);
}
}
}
// recalculate particles' values
for(int i=0; i<numofparticles; i++)
{
fitnesses[i] = Sch(X[i], numofdims);
if(fitnesses[i] < minfit)
{
minfit = fitnesses[i]; // update minimum
minidx = i; // update index
}
}
if(iter % 1000 == 0)
{
cout << scientific << endl;
cout << minfit << endl;
}
}
cout << scientific << endl;
cout << minfit << endl;
t2=clock();
delete [] X;
delete [] V;
delete [] fitnesses;
delete [] idxofparticles;
delete [] Xmean;
delete [] losers;
float diff ((float)t2-(float)t1);
float seconds = diff / CLOCKS_PER_SEC;
cout << "runtime: " << seconds << "s" <<endl;
return 0;
}
double Sch(double X[], int d)
{
double result=abs(X[0]);
for(int j=0; j<d; j++)
{
if(abs(X[j]) > result)
result = abs(X[j]);
}
return result;
}
So, finally, why can't my c++ code reproduce matlab's outcome? Thank you very much.
I have a 3007 x 1644 dimensional matrix of terms and documents. I am trying to assign weights to frequency of terms in each document so I'm using this log entropy formula http://en.wikipedia.org/wiki/Latent_semantic_indexing#Term_Document_Matrix (See entropy formula in the last row).
I'm successfully doing this but my code is running for >7 minutes.
Here's the code:
int N = mat.cols();
for(int i=1;i<=mat.rows();i++){
double gfi = sum(mat(i,colon()))(1,1); //sum of occurrence of terms
double g =0;
if(gfi != 0){// to avoid divide by zero error
for(int j = 1;j<=N;j++){
double tfij = mat(i,j);
double pij = gfi==0?0.0:tfij/gfi;
pij = pij + 1; //avoid log0
double G = (pij * log(pij))/log(N);
g = g + G;
}
}
double gi = 1 - g;
for(int j=1;j<=N;j++){
double tfij = mat(i,j) + 1;//avoid log0
double aij = gi * log(tfij);
mat(i,j) = aij;
}
}
Anyone have ideas how I can optimize this to make it faster? Oh and mat is a RealSparseMatrix from amlpp matrix library.
UPDATE
Code runs on Linux mint with 4gb RAM and AMD Athlon II dual core
Running time before change: > 7mins
After #Kereks answer: 4.1sec
Here's a very naive rewrite that removes some redundancies:
int const N = mat.cols();
double const logN = log(N);
for (int i = 1; i <= mat.rows(); ++i)
{
double const gfi = sum(mat(i, colon()))(1, 1); // sum of occurrence of terms
double g = 0;
if (gfi != 0)
{
for (int j = 1; j <= N; ++j)
{
double const pij = mat(i, j) / gfi + 1;
g += pij * log(pij);
}
g /= logN;
}
for (int j = 1; j <= N; ++j)
{
mat(i,j) = (1 - g) * log(mat(i, j) + 1);
}
}
Also make sure that the matrix data structure is sane (e.g. a flat array accessed in strides; not a bunch of dynamically allocated rows).
Also, I think the first + 1 is a bit silly. You know that x -> x * log(x) is continuous at zero with limit zero, so you should write:
double const pij = mat(i, j) / gfi;
if (pij != 0) { g += pij + log(pij); }
In fact, you might even write the first inner for loop like this, avoiding a division when it isn't needed:
for (int j = 1; j <= N; ++j)
{
if (double pij = mat(i, j))
{
pij /= gfi;
g += pij * log(pij);
}
}
I'm working on problem 9 in Project Euler:
There exists exactly one Pythagorean triplet for which a + b + c = 1000.
Find the product abc.
The following code I wrote uses Euclid's formula for generating primes. For some reason my code returns "0" as an answer; even though the variable values are correct for the first few loops. Since the problem is pretty easy, some parts of the code aren't perfectly optimized; I don't think that should matter. The code is as follows:
#include <iostream>
using namespace std;
int main()
{
int placeholder; //for cin at the end so console stays open
int a, b, c, m, n, k;
a = 0; b = 0; c = 0;
m = 0; n = 0; k = 0; //to prevent initialization warnings
int sum = 0;
int product = 0;
/*We will use Euclid's (or Euler's?) formula for generating primitive
*Pythagorean triples (a^2 + b^2 = c^2): For any "m" and "n",
*a = m^2 - n^2 ; b = 2mn ; c = m^2 + n^2 . We will then cycle through
*values of a scalar/constant "k", to make sure we didn't miss anything.
*/
//these following loops will increment m, n, and k,
//and see if a+b+c is 1000. If so, all loops will break.
for (int iii = 1; m < 1000; iii++)
{
m = iii;
for (int ii = 1; n < 1000; ii++)
{
n = ii;
for (int i = 1; k <=1000; i++)
{
sum = 0;
k = i;
a = (m*m - n*n)*k;
b = (2*m*n)*k;
c = (m*m + n*n)*k;
if (sum == 1000) break;
}
if (sum == 1000) break;
}
if (sum == 1000) break;
}
product = a * b * c;
cout << "The product abc of the Pythagorean triplet for which a+b+c = 1000 is:\n";
cout << product << endl;
cin >> placeholder;
return 0;
}
And also, is there a better way to break out of multiple loops without using "break", or is "break" optimal?
Here's the updated code, with only the changes:
for (m = 2; m < 1000; m++)
{
for (int n = 2; n < 1000; n++)
{
for (k = 2; (k < 1000) && (m > n); k++)
{
sum = 0;
a = (m*m - n*n)*k;
b = (2*m*n)*k;
c = (m*m + n*n)*k;
sum = a + b + c;
if ((sum == 1000) && (!(k==0))) break;
}
It still doesn't work though (now gives "1621787660" as an answer). I know, a lot of parentheses.
The new problem is that the solution occurs for k = 1, so starting your k at 2 misses the answer outright.
Instead of looping through different k values, you can just check for when the current sum divides 1000 evenly. Here's what I mean (using the discussed goto statement):
for (n = 2; n < 1000; n++)
{
for (m = n + 1; m < 1000; m++)
{
sum = 0;
a = (m*m - n*n);
b = (2*m*n);
c = (m*m + n*n);
sum = a + b + c;
if(1000 % sum == 0)
{
int k = 1000 / sum;
a *= k;
b *= k;
c *= k;
goto done;
}
}
}
done:
product = a * b * c;
I also switched around the two for loops so that you can just initialize m as being larger than n instead of checking every iteration.
Note that with this new method, the solution doesn't occur for k = 1 (just a difference in how the loops are run, this isn't a problem)
Presumably sum is supposed to be a + b + c. However, nowhere in your code do you actually do this, which is presumably your problem.
To answer the final question: Yes, you can use a goto. Breaking out of multiple nested loops is one of the rare occasions when it isn't considered harmful.
I'm trying to implement a matrix-vector Multiplication on GPU (using CUDA).
In my C++ code (CPU), I load the matrix as a dense matrix, and then I perform the matrix-vector multiplication using CUDA. I'm also using shared memory to improve the performance.
How can I load the matrix in an efficient way, knowing that my matrix is a sparse matrix?
Below is my C++ function to load the matrix:
int readMatrix( char* filename, float* &matrix, unsigned int *dim = NULL, int majority = ROW_MAJOR )
{
unsigned int w, h, x, y, num_entries;
float val;
std::ifstream file( filename );
if ( file )
{
file >> h >> w >> num_entries;
cout << w << " " << h << " " << num_entries << "\n";
assert( w == h || w == 1 || h == 1 );
if( dim != NULL ) *dim = std::max( w, h );
matrix = new float[ w * h ];
unsigned int i;
for( i = 0; i < num_entries; i++ ){
if( file.eof() ) break;
file >> y >> x >> val;
if( majority == ROW_MAJOR ){
matrix[ w * y + x ] = val;
} else if( majority == COLUMN_MAJOR ){
matrix[ h * x + y ] = val;
}
}
file.close();
if( i == num_entries )
std::cout << "\nFile read successfully\n";
else
std::cout << "\nFile read successfully but seems defective:\n num entries read = " << i << ", entries epected = " << num_entries << "\n";
// print first few elements
if( w == h ){
for( unsigned int i = 0; i < w; i++ ){
printf("\n");
for( unsigned int j = 0; j < h; j++ ){
printf("%.2f ", matrix[ j + w * i ] );
}
}
}
else{
printf("\n");
for( unsigned int j = 0; j < h; j++ ){
printf("%.2f ", matrix[ j ] );
}
}
} else {
std::cout << "Unable to open file\n";
return false;
}
return true;
}
Below is my CUDA Kernel function that handles the matrix-vector multiplication:
__global__ void
_cl_matrix_vector_( float *A, float *b, float *x, int dim )
{
extern __shared__ float vec[];
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
float temp = 0.0;
int vOffs = 0;
//load vector into shared memory
for (int i = 0; i < (dim/blockDim.x) + 1 ; ++i, vOffs+= blockDim.x) {
vec[vOffs + threadIdx.x] = b[vOffs + threadIdx.x];
}
//make sure all threads are synchronized
__syncthreads();
if (idx < dim) {
temp = 0.0;
//dot product (multiplication)
for (int i = 0; i < dim; i++){
temp += A[idx * dim + i] * vec[i];
}
x[idx] = temp;
}
}
What are the necessary changes that I have to make on my CUDA code to take into account that my matrix is a sparse matrix?
I found out from a forum that we can also use padding to be able to optimize the performance, but this requires me to change the way I read the matrix / sort the matrix. Any ideas how to implement this padding in the way I read the matrix and perform the calculation?
This is a very old post and I want to highlight that cuSPARSE (since some time now) makes routines for the multiplication between sparse matrices or between a sparse matrix and a dense vector available.
For the csr format, the relevant routine for the multiplication between a sparse matrix and a dense vector is cusparse<t>csrmv. Below, a fully worked example showing its use.
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <assert.h>
#include "Utilities.cuh"
#include <cuda_runtime.h>
#include <cusparse_v2.h>
/********/
/* MAIN */
/********/
int main()
{
// --- Initialize cuSPARSE
cusparseHandle_t handle; cusparseSafeCall(cusparseCreate(&handle));
/**************************/
/* SETTING UP THE PROBLEM */
/**************************/
const int N = 4; // --- Number of rows and columns
// --- Host side dense matrices
double *h_A_dense = (double*)malloc(N * N * sizeof(double));
double *h_x_dense = (double*)malloc(N * sizeof(double));
double *h_y_dense = (double*)malloc(N * sizeof(double));
// --- Column-major ordering
h_A_dense[0] = 0.4612; h_A_dense[4] = -0.0006; h_A_dense[8] = 0.3566; h_A_dense[12] = 0.0;
h_A_dense[1] = -0.0006; h_A_dense[5] = 0.4640; h_A_dense[9] = 0.0723; h_A_dense[13] = 0.0;
h_A_dense[2] = 0.3566; h_A_dense[6] = 0.0723; h_A_dense[10] = 0.7543; h_A_dense[14] = 0.0;
h_A_dense[3] = 0.; h_A_dense[7] = 0.0; h_A_dense[11] = 0.0; h_A_dense[15] = 0.1;
// --- Initializing the data and result vectors
for (int k = 0; k < N; k++) {
h_x_dense[k] = 1.;
h_y_dense[k] = 0.;
}
// --- Create device arrays and copy host arrays to them
double *d_A_dense; gpuErrchk(cudaMalloc(&d_A_dense, N * N * sizeof(double)));
double *d_x_dense; gpuErrchk(cudaMalloc(&d_x_dense, N * sizeof(double)));
double *d_y_dense; gpuErrchk(cudaMalloc(&d_y_dense, N * sizeof(double)));
gpuErrchk(cudaMemcpy(d_A_dense, h_A_dense, N * N * sizeof(double), cudaMemcpyHostToDevice));
gpuErrchk(cudaMemcpy(d_x_dense, h_x_dense, N * sizeof(double), cudaMemcpyHostToDevice));
gpuErrchk(cudaMemcpy(d_y_dense, h_y_dense, N * sizeof(double), cudaMemcpyHostToDevice));
// --- Descriptor for sparse matrix A
cusparseMatDescr_t descrA; cusparseSafeCall(cusparseCreateMatDescr(&descrA));
cusparseSafeCall(cusparseSetMatType (descrA, CUSPARSE_MATRIX_TYPE_GENERAL));
cusparseSafeCall(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE));
int nnzA = 0; // --- Number of nonzero elements in dense matrix A
const int lda = N; // --- Leading dimension of dense matrix
// --- Device side number of nonzero elements per row of matrix A
int *d_nnzPerVectorA; gpuErrchk(cudaMalloc(&d_nnzPerVectorA, N * sizeof(*d_nnzPerVectorA)));
cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, N, N, descrA, d_A_dense, lda, d_nnzPerVectorA, &nnzA));
// --- Host side number of nonzero elements per row of matrix A
int *h_nnzPerVectorA = (int *)malloc(N * sizeof(*h_nnzPerVectorA));
gpuErrchk(cudaMemcpy(h_nnzPerVectorA, d_nnzPerVectorA, N * sizeof(*h_nnzPerVectorA), cudaMemcpyDeviceToHost));
printf("Number of nonzero elements in dense matrix A = %i\n\n", nnzA);
for (int i = 0; i < N; ++i) printf("Number of nonzero elements in row %i for matrix = %i \n", i, h_nnzPerVectorA[i]);
printf("\n");
// --- Device side sparse matrix
double *d_A; gpuErrchk(cudaMalloc(&d_A, nnzA * sizeof(*d_A)));
int *d_A_RowIndices; gpuErrchk(cudaMalloc(&d_A_RowIndices, (N + 1) * sizeof(*d_A_RowIndices)));
int *d_A_ColIndices; gpuErrchk(cudaMalloc(&d_A_ColIndices, nnzA * sizeof(*d_A_ColIndices)));
cusparseSafeCall(cusparseDdense2csr(handle, N, N, descrA, d_A_dense, lda, d_nnzPerVectorA, d_A, d_A_RowIndices, d_A_ColIndices));
// --- Host side sparse matrices
double *h_A = (double *)malloc(nnzA * sizeof(*h_A));
int *h_A_RowIndices = (int *)malloc((N + 1) * sizeof(*h_A_RowIndices));
int *h_A_ColIndices = (int *)malloc(nnzA * sizeof(*h_A_ColIndices));
gpuErrchk(cudaMemcpy(h_A, d_A, nnzA * sizeof(*h_A), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_RowIndices, d_A_RowIndices, (N + 1) * sizeof(*h_A_RowIndices), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_ColIndices, d_A_ColIndices, nnzA * sizeof(*h_A_ColIndices), cudaMemcpyDeviceToHost));
printf("\nOriginal matrix A in CSR format\n\n");
for (int i = 0; i < nnzA; ++i) printf("A[%i] = %f ", i, h_A[i]); printf("\n");
printf("\n");
for (int i = 0; i < (N + 1); ++i) printf("h_A_RowIndices[%i] = %i \n", i, h_A_RowIndices[i]); printf("\n");
printf("\n");
for (int i = 0; i < nnzA; ++i) printf("h_A_ColIndices[%i] = %i \n", i, h_A_ColIndices[i]);
printf("\n");
for (int i = 0; i < N; ++i) printf("h_x[%i] = %f \n", i, h_x_dense[i]); printf("\n");
const double alpha = 1.;
const double beta = 0.;
cusparseSafeCall(cusparseDcsrmv(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, N, nnzA, &alpha, descrA, d_A, d_A_RowIndices, d_A_ColIndices, d_x_dense,
&beta, d_y_dense));
gpuErrchk(cudaMemcpy(h_y_dense, d_y_dense, N * sizeof(double), cudaMemcpyDeviceToHost));
printf("\nResult vector\n\n");
for (int i = 0; i < N; ++i) printf("h_y[%i] = %f ", i, h_y_dense[i]); printf("\n");
}
You might want to have a look at the very good CUSP library. They implement sparse matrices in a variety of formats (coo, csr, ellpack, diagonal and a hybrid between ellpack and coo). Each with their own advantages as described in the documentation. Most of them are "standard" sparse matrix formats about which you can find more information online. Not a complete answer to your question perhaps, but it should provide a starting point.