MPI Gatherv with submatrices - c++

I'm having trouble with getting MPI_Gatherv to work how I intend, and was wondering those of you who are more experienced can see what I'm doing wrong.
I have a large matrix (TEST) of [N, M]. Each process does some work on a subset [nrows, M] (WORK_MATRIX) and then every process gathers these submatrices (along the row dimension) into the full matrix.
It seems like it doesn't gather any of the data, and I'm struggling to figure out why!
Here I'm using Eigen to wrap these (contiguous) matrices.
Output:
mpirun -np 5 ./pseudo.x
1 1 1 1 1
0 1 2 3 4
TEST: 5 10
0 0 2 0 0 0 0 0 0 0
1 1 2 0 0 0 0 0 0 0
2 2 2 0 0 0 0 0 0 0
3 2 2 0 0 0 0 0 0 0
4 2 0 0 0 0 0 0 0 0
I've created a simple version of the code below:
mpiicc -I/path/to/Eigen -o pseudo.x pseudo.cpp
#include <mpi.h>
#include <Eigen/Dense>
#include <iostream>
using namespace Eigen;
using namespace std;
int main(int argc, char ** argv) {
int RSIZE = 5;
int CSIZE = 10;
int rank;
int num_tasks;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &num_tasks);
MatrixXd TEST_MATRIX = MatrixXd::Zero(RSIZE, CSIZE);
VectorXi recv = VectorXi::Zero(num_tasks);
VectorXi displs = VectorXi::Zero(num_tasks);
int nrows = (RSIZE + rank) / num_tasks;
MPI_Allgather(&nrows, 1, MPI_INT, recv.data(), 1, MPI_INT, MPI_COMM_WORLD);
int start = 0;
for (int i = 0; i < rank; i++)
start += recv[i];
MPI_Allgather(&start, 1, MPI_INT, displs.data(), 1, MPI_INT, MPI_COMM_WORLD);
if (rank == 0) {
cout << recv.transpose() << endl;
cout << displs.transpose() << endl;
}
MatrixXd WORK_MATRIX = MatrixXd::Zero(nrows, CSIZE);
for (int row = 0; row < nrows; row++)
for (int col = 0; col < CSIZE; col++)
WORK_MATRIX(row, col) += rank;
MPI_Datatype rowsized, row;
int sizes[2] = { RSIZE, CSIZE };
int subsizes[2] = { nrows, CSIZE };
int starts[2] = { 0, 0 };
MPI_Type_create_subarray(2, sizes, subsizes, starts, MPI_ORDER_C, MPI_DOUBLE, &rowsized);
MPI_Type_create_resized(rowsized, 0, sizeof(double), &row);
MPI_Type_commit(&row);
MPI_Allgatherv(WORK_MATRIX.data(), recv[rank], row, TEST_MATRIX.data(), recv.data(), displs.data(), row, MPI_COMM_WORLD);
if (rank == 0) {
cout << "TEST: " << TEST_MATRIX.rows() << " " << TEST_MATRIX.cols() << endl;
for (int i = 0; i < TEST_MATRIX.rows(); i++) {
for (int j = 0; j < TEST_MATRIX.cols(); j++) {
cout << TEST_MATRIX(i, j) << " ";
}
cout << endl;
}
}
}

in C, 2D matrixes are stored by rows, and I doubt the Eigen changes that.
that means you do not need to resize your datatypes, and the displacement should be adjusted
start += recv[i] * CSIZE;
As a matter of taste, you do not need two MPI_Allgather() at all since nrows and start can be computed locally.
I'd rather suggest you simply create a derived datatype for one row with MPI_Type_contiguous() (and this type should not be resized), since MPI_Type_create_subarray() is really an overkill here.

Related

Sorting a table C++

I have a table of numbers that look like this:
2 8 4 0
3 1 0 9
1 2 3 4
5 4 14 2
I put all the numbers in an array { 2,8,4,0,3,1... }. Is there a way to sort it by the first column only using a 1D array so that it ends up like this:
1 2 3 4
2 8 4 0
3 1 0 9
5 4 14 2
I know there's a way of doing it with a 2D array, but, assuming I know the number of columns, is it possible with only a 1D array?
I'd create an array of indexes into your data, and then sort those indexes; this will save a decent number of the copies.
Your sort would then examine the value of the number at the given index.
ie for your example - indexes would be 1,2,3,4
and then sorted would read 3,1,2,4
edit: this was 1 based; the code 0 based. Makes no difference.
Essentially converting your 1d array into 2. Since the bulk of your data is still contiguous (especially for large numbers of columns) reading should still be fast.
Example code:
std::vector<int> getSortedIndexes(std::vector<int> data, int size) {
int count = data.size() / size;
std::vector<int> indexes(count);
// fill in indexes from 0 to "count" since that's the size of our vector
std::iota(indexes.begin(), indexes.end(), 0);
// don't write your own sorting implementation .... really; don't.
std::sort(indexes.begin(), indexes.end(), [data, size](int indexA, int indexB) {
return data[indexA*size] < data[indexB*size];
});
return indexes;
}
For arrays of non-user defined types it is easy to do the task using the standard C function qsort.
Here is a demonstrative program.
#include <iostream>
#include <cstdlib>
int cmp( const void *a, const void *b )
{
const int *left = static_cast<const int *>( a );
const int *right = static_cast<const int *>( b );
return ( *right < *left ) - ( *left < *right );
}
int main()
{
const size_t N = 4;
int a[N * N] =
{
2, 8, 4, 0, 3, 1, 0, 9, 1, 2, 3, 4, 5, 4, 14, 2
};
for ( size_t i = 0; i < N; i++ )
{
for ( size_t j = 0; j < N; j++ )
{
std::cout << a[N * i + j] << ' ';
}
std::cout << '\n';
}
std::cout << '\n';
std::qsort( a, N, sizeof( int[N] ), cmp );
for ( size_t i = 0; i < N; i++ )
{
for ( size_t j = 0; j < N; j++ )
{
std::cout << a[N * i + j] << ' ';
}
std::cout << '\n';
}
std::cout << '\n';
}
The program output is
2 8 4 0
3 1 0 9
1 2 3 4
5 4 14 2
1 2 3 4
2 8 4 0
3 1 0 9
5 4 14 2
So all you need is to write the function
int cmp( const void *a, const void *b )
{
const int *left = static_cast<const int *>( a );
const int *right = static_cast<const int *>( b );
return ( *right < *left ) - ( *left < *right );
}
and add just one line in your program
std::qsort( a, N, sizeof( int[N] ), cmp );
You can use bubblesort:
void sort_by_name(int* ValueArray, int NrOfValues, int RowWidth)
{
int CycleCount = NrOfValues / RowWidth;
int temp;
(int j = 0; j < CycleCount ; j++)
{
for (int i = 1; i < (CycleCount - j); i++)
{
if(ValueArray[((i-1)*RowWidth)] > ValueArray[(i*RowWidth)])
{
for(int k = 0; k<RowWidth; k++)
{
temp = ValueArray[(i*RowWidth)+k]
ValueArray[(i*RowWidth)+k] = ValueArray[((i-1)*RowWidth)+k];
ValueArray[((i-1)*RowWidth)+k] = temp;
}
}
}
}
}
keep in mind that simply making your array 2D will be a MUCH BETTER solution
edit: variable naming

MPI_Scatter issue, can't scatter and gather a picture matrix

.Hello, i have a problem with my c++ code. I'm trying to make a parallel implementation from my sequential code for sobel operator using OpenCV.
My actual idea is scatter a picture using 2d buffer, make the sobel operation to averaged_rows*cols. and then make the gathering When i have sent the averaged_rows and every rank receive it, i try to use MPI_Scatter and this execution error appears:
sent to 1
sent to 2
sent to 3
recieved by 1
recieved by 2
recieved by 3
[roronoasins-GL552VW:3245] *** An error occurred in MPI_Scatter
[roronoasins-GL552VW:3245] *** reported by process [1759117313,1]
[roronoasins-GL552VW:3245] *** on communicator MPI_COMM_WORLD
[roronoasins-GL552VW:3245] *** MPI_ERR_TRUNCATE: message truncated
[roronoasins-GL552VW:3245] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[roronoasins-GL552VW:3245] *** and potentially your MPI job)
[roronoasins-GL552VW:03239] 2 more processes have sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[roronoasins-GL552VW:03239] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
What I actually do is scatter pic buffer, bcast the picture to the rest of ranks and the gathering.
MPI_Scatter(pic, cols*rows_av, MPI_INT, picAux, cols*rows_av, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (pic3, cols*rows, MPI_INT, 0, MPI_COMM_WORLD);
int ip_gx, ip_gy, sum;
for(int y = ip*pic_struct[2]; y < (ip+1)*pic_struct[2] -1; y++){
for(int x = 1; x < pic_struct[1]- 1; x++){
int gx = x_gradient(pic3, x, y);
int gy = y_gradient(pic3, x, y);
int sum = abs(gx) + abs(gy);
sum = sum > 255 ? 255:sum;
sum = sum < 0 ? 0 : sum;
picAux[y][x] = sum;
}
}
MPI_Gather(picAux, cols*rows_av, MPI_INT, pic, cols*rows_av, MPI_INT, 0, MPI_COMM_WORLD);
I'd like to know what is happening with the Scatter function, i thought that i could scatter single picture pieces to the rest ranks to calculate sobel, maybe i'm wrong.
My code is here if u want to check it. Thanks for your time.
// > compile with mpic++ mpi_sobel.cpp -o mpi_sobel `pkg-config --libs opencv` -fopenmp -lstdc++
#include <iostream>
#include <cmath>
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <omp.h>
#include <mpi.h>
using namespace std;
using namespace cv;
Mat src, dst;
/*
Computes the x component of the gradient vector
at a given point in a image.
returns gradient in the x direction
| 1 0 -1 |
Gx = | 2 0 -2 |
| 1 0 -1 |
*/
int x_gradient(int** image, int x, int y)
{
return image[y-1][x-1] +
2*image[y][x-1] +
image[y+1][x-1] -
image[y-1][x+1] -
2*image[y][x+1] -
image[y+1][x+1];
}
/*
Computes the y component of the gradient vector
at a given point in a image
returns gradient in the y direction
| 1 2 1 |
Gy = | 0 0 0 |
|-1 -2 -1 |
*/
int y_gradient(int** image, int x, int y)
{
return image[y+1][x-1] +
2*image[y+1][x] +
image[y+1][x+1] -
image[y-1][x-1] -
2*image[y-1][x] -
image[y-1][x+1];
}
int main(int argc, char** argv)
{
string picture;
if (argc == 2) {
picture = argv[1];
src = imread(argv[1], CV_LOAD_IMAGE_GRAYSCALE);
}
else {
picture = "input/logan.jpg";
src = imread(picture.c_str(), CV_LOAD_IMAGE_GRAYSCALE);
}
if( !src.data )
{ return -1; }
dst.create(src.rows, src.cols, src.type());
int rows_av, rows_extra;
Size s = src.size();
int rows = s.height;
int cols = s.width;
int pic[rows][cols];int picAux[rows][cols];
int ** pic3;
pic3 = new int*[rows];
for(int y = 0; y < rows; y++)
pic3[y] = new int[cols];
int pic_struct[3], pic_struct_recv[3];
int np, ip;
double start_time = omp_get_wtime();
if (MPI_Init(&argc, &argv) != MPI_SUCCESS){
exit(1);
}
MPI_Comm_size(MPI_COMM_WORLD, &np);
MPI_Comm_rank(MPI_COMM_WORLD, &ip);
MPI_Status status;
if(ip==0)
{
for(int y = 0; y < rows ; y++)
for(int x = 0; x < cols; x++)
{
pic3[y][x] = src.at<uchar>(y,x);
pic[y][x] = 0;
picAux[y][x] = 0;
}
src.release();
rows_av = rows/np;
//cols_av = cols/np;
pic_struct[0] = rows;
pic_struct[1] = cols;
pic_struct[2] = rows_av;
//pic_struct[3] = cols_av:
for(int i=1; i < np; i++)
{
//rows = (i <= rows_extra) ? rows_av+1 : rows_av;
pic_struct[0] = rows;
MPI_Send(&pic_struct, sizeof(pic_struct), MPI_BYTE, i, 0, MPI_COMM_WORLD);
cout << "sent to " << i << endl;
}
}else{//ip
MPI_Recv(&pic_struct, sizeof(pic_struct), MPI_BYTE, 0, 0, MPI_COMM_WORLD, &status);
cout << "recieved by " << ip << endl;
}
MPI_Scatter(pic, cols*rows_av, MPI_INT, picAux, cols*rows_av, MPI_INT, 0, MPI_COMM_WORLD);
MPI_Bcast (pic3, cols*rows, MPI_INT, 0, MPI_COMM_WORLD);
cout << "bcast" << endl;
//MPI_Barrier(MPI_COMM_WORLD);
int ip_gx, ip_gy, sum;
for(int y = ip*pic_struct[2]; y < (ip+1)*pic_struct[2] -1; y++){
for(int x = 1; x < pic_struct[1]- 1; x++){
ip_gx = x_gradient(src, x, y);
ip_gy = y_gradient(src, x, y);
sum = abs(ip_gx) + abs(ip_gy);
sum = sum > 255 ? 255:sum;
sum = sum < 0 ? 0 : sum;
picAux[y][x] = sum;
}
}
MPI_Gather(picAux, cols*rows_av, MPI_INT, pic, cols*rows_av, MPI_INT, 0, MPI_COMM_WORLD);
cout << "gather" << endl;
MPI_Finalize();
if(!ip)
{
double time = omp_get_wtime() - start_time;
for( int i = 0 ; i < rows ; i++ )
{
delete [] pic3[i] ;
delete [] pic3 ;
}
cout << "Number of processes: " << np << endl;
cout << "Rows, Cols: " << rows << " " << cols << endl;
cout << "Rows, Cols(Division): " << rows_av << ", " << cols << endl << endl;
cout << "Processing time: " << time << endl;
for(int i=0; i < 6 ; i++) picture.erase(picture.begin());
for(int i=0; i < 4 ; i++) picture.pop_back();
picture.insert(0,"output/");
picture += "-sobel.jpg";
for(int y = 0; y < rows; y++)
for(int x = 0; x < cols; x++)
dst.at<uchar>(y,x) = pic[y][x];
if(imwrite(picture.c_str(), dst)) cout << "Picture correctly saved as " << picture << endl;
else cout << "\nError has occurred being saved." << endl;
}
return 0;
}
Update: I forgot rows_av in ranks != 0 and pic3 sending is fixed. I've packed src in contiguous buffer and it is right in each rank.
updated code here: https://pastebin.com/jPV9mGFW
I have noticed that into the 3/4 dark there is noise, with this new issue i dont know if gathering is the problem now or i am doing the operations with number_process*rows/total_processes wrong.
MPI_Scatter(pic, cols*rows_av, MPI_UNSIGNED_CHAR, picAux, cols*rows_av, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
int ip_gx, ip_gy, sum;
for(int y = ip*rows_av+1; y < (ip+1)*rows_av-1; y++){
for(int x = 1; x < cols ; x++){
ip_gx = x_gradient(src, x, y);
ip_gy = y_gradient(src, x, y);
sum = abs(ip_gx) + abs(ip_gy);
sum = sum > 255 ? 255:sum;
sum = sum < 0 ? 0 : sum;
picAux[y][x] = sum;
//picAux[y*rows_av+x] = sum;
}
}
MPI_Gather(picAux, cols*rows_av, MPI_UNSIGNED_CHAR, pic, cols*rows_av, MPI_UNSIGNED_CHAR, 0, MPI_COMM_WORLD);
Loop updated and image is full calculated now but i cant use images bigger than 2048x1536.
for(int y = 1; y < rows_av-1; y++){
for(int x = 1; x < cols ; x++){
ip_gx = x_gradient(src, x, ip*rows_av+y);
ip_gy = y_gradient(src, x, ip*rows_av+y);
sum = abs(ip_gx) + abs(ip_gy);
sum = sum > 255 ? 255:sum;
sum = sum < 0 ? 0 : sum;
picAux[y*cols+x] = sum;
}
}
How could i send larger images than 2048x1536?
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node roronoasins-GL552VW exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
Images size issue was the stack limited size. with ulimit -s unlimited works fine but im now working to improve memory efficiency. Last code will be updated in the pastebin code above.

Fill 2-dimensional array with zeros by flipping groups of cells

There is a problem where I need to fill an array with zeros, with the following assumptions:
in the array there can only be 0 and 1
we can only change 0 to 1 and 1 to 0
when we meet 1 in array, we have to change it to 0, such that its neighbours are also changed, for instance, for the array like the one below:
1 0 1
1 1 1
0 1 0
When we change element at (1,1), we then got the array like this:
1 1 1
0 0 0
0 0 0
We can't change the first row
We can only change the elements that are in the array
The final result is the number of times we have to change 1 to 0 to zero out the array
1) First example, array is like this one below:
0 1 0
1 1 1
0 1 0
the answer is 1.
2) Second example, array is like this one below:
0 1 0 0 0 0 0 0
1 1 1 0 1 0 1 0
0 0 1 1 0 1 1 1
1 1 0 1 1 1 0 0
1 0 1 1 1 0 1 0
0 1 0 1 0 1 0 0
The answer is 10.
There also can be situations that its impossible to zero out the array, then the answer should be "impossible".
Somehow I can't get this working: for the first example, I got the right answer (1) but for the second example, program says impossible instead of 10.
Any ideas what's wrong in my code?
#include <iostream>
using namespace std;
int main(int argc, char **argv)
{
int n,m;
cin >> n >> m;
bool tab[n][m];
for(int i=0; i<n; i++)
for(int j=0; j<m; j++)
cin >> tab[i][j];
int counter = 0;
for(int i=0; i<n-1; i++)
{
for(int j=0; j<m-1; j++)
{
if(tab[i][j] == 1 && i > 0 && j > 0)
{
tab[i-1][j] = !tab[i-1][j];
tab[i+1][j] = !tab[i+1][j];
tab[i][j+1] = !tab[i][j+1];
tab[i][j-1] = !tab[i][j-1];
tab[i][j] = !tab[i][j];
counter ++;
}
}
}
bool impossible = 0;
for(int i=0; i<n; i++)
{
for(int j=0; j<m; j++)
{
if(tab[i][j] == 1)
{
cout << "impossible\n";
impossible = 1;
break;
}
}
if(impossible)
break;
}
if(!impossible)
cout << counter << "\n";
return 0;
}
I believe that the reason your program was returning impossible in the 6x8 matrix is because you have been traversing in a left to right / top to bottom fashion, replacing every instance of 1 you encountered with 0. Although this might have seemed as the right solution, all it did was scatter the 1s and 0s around the matrix by modifying it's neighboring values. I think that the way to approach this problem is to start from bottom to top/ right to left and push the 1s towards the first row. In a way cornering (trapping) them until they can get eliminated.
Anyway, here's my solution to this problem. I'm not entirely sure if this is what you were going after, but I think it does the job for the three matrices you provided. The code is not very sophisticated and it would be nice to test it with some harder problems to see if it truly works.
#include <iostream>
static unsigned counter = 0;
template<std::size_t M, std::size_t N>
void print( const bool (&mat) [M][N] )
{
for (std::size_t i = 0; i < M; ++i)
{
for (std::size_t j = 0; j < N; ++j)
std::cout<< mat[i][j] << " ";
std::cout<<std::endl;
}
std::cout<<std::endl;
}
template<std::size_t M, std::size_t N>
void flipNeighbours( bool (&mat) [M][N], unsigned i, unsigned j )
{
mat[i][j-1] = !(mat[i][j-1]);
mat[i][j+1] = !(mat[i][j+1]);
mat[i-1][j] = !(mat[i-1][j]);
mat[i+1][j] = !(mat[i+1][j]);
mat[i][j] = !(mat[i][j]);
++counter;
}
template<std::size_t M, std::size_t N>
bool checkCornersForOnes( const bool (&mat) [M][N] )
{
return (mat[0][0] || mat[0][N-1] || mat[M-1][0] || mat[M-1][N-1]);
}
template<std::size_t M, std::size_t N>
bool isBottomTrue( bool (&mat) [M][N], unsigned i, unsigned j )
{
return (mat[i+1][j]);
}
template<std::size_t M, std::size_t N>
bool traverse( bool (&mat) [M][N] )
{
if (checkCornersForOnes(mat))
{
std::cout<< "-Found 1s in the matrix corners." <<std::endl;
return false;
}
for (std::size_t i = M-2; i > 0; --i)
for (std::size_t j = N-2; j > 0; --j)
if (isBottomTrue(mat,i,j))
flipNeighbours(mat,i,j);
std::size_t count_after_traversing = 0;
for (std::size_t i = 0; i < M; ++i)
for (std::size_t j = 0; j < N; ++j)
count_after_traversing += mat[i][j];
if (count_after_traversing > 0)
{
std::cout<< "-Found <"<<count_after_traversing<< "> 1s in the matrix." <<std::endl;
return false;
}
return true;
}
#define MATRIX matrix4
int main()
{
bool matrix1[3][3] = {{1,0,1},
{1,1,1},
{0,1,0}};
bool matrix2[3][3] = {{0,1,0},
{1,1,1},
{0,1,0}};
bool matrix3[5][4] = {{0,1,0,0},
{1,0,1,0},
{1,1,0,1},
{1,1,1,0},
{0,1,1,0}};
bool matrix4[6][8] = {{0,1,0,0,0,0,0,0},
{1,1,1,0,1,0,1,0},
{0,0,1,1,0,1,1,1},
{1,1,0,1,1,1,0,0},
{1,0,1,1,1,0,1,0},
{0,1,0,1,0,1,0,0}};
std::cout<< "-Problem-" <<std::endl;
print(MATRIX);
if (traverse( MATRIX ) )
{
std::cout<< "-Answer-"<<std::endl;
print(MATRIX);
std::cout<< "Num of flips = "<<counter <<std::endl;
}
else
{
std::cout<< "-The Solution is impossible-"<<std::endl;
print(MATRIX);
}
}
Output for matrix1:
-Problem-
1 0 1
1 1 1
0 1 0
-Found 1s in the matrix corners.
-The Solution is impossible-
1 0 1
1 1 1
0 1 0
Output for matrix2:
-Problem-
0 1 0
1 1 1
0 1 0
-Answer-
0 0 0
0 0 0
0 0 0
Num of flips = 1
Output for matrix3:
-Problem-
0 1 0 0
1 0 1 0
1 1 0 1
1 1 1 0
0 1 1 0
-Found <6> 1s in the matrix.
-The Solution is impossible-
0 1 1 0
1 0 1 1
0 0 0 0
0 0 0 1
0 0 0 0
Output for matrix4 (which addresses your original question):
-Problem-
0 1 0 0 0 0 0 0
1 1 1 0 1 0 1 0
0 0 1 1 0 1 1 1
1 1 0 1 1 1 0 0
1 0 1 1 1 0 1 0
0 1 0 1 0 1 0 0
-Answer-
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0
Num of flips = 10
Ok, here comes my somewhat different attempt.
Idea
Note: I assume here that "We can't change the first row" means "We can't change the outmost row".
Some terminology:
With toggling a bit I mean changing it's value from 0 to 1 or 1 to 0.
With flipping a bit I mean toggling said bit and the 4 bits around it.
The act of toggling a bit is commutative. That is, it does not matter in what order we toggle it—the end result will always be the same (this is a trivial statement). This means that flipping is also a commutative action, and we are free to flip bits in any order we like.
The only way to toggle a value on the edge of the matrix is by flipping the bit right next to it an uneven amount of times. As we're looking for the lowest possible flips, we want to flip it a maximum of 1 time. So, in a scenario like the on below, x will need to be flipped exactly once, and y will need to be flipped exactly 0 times.
. .
1 x
0 y
. ,
From this we can draw two conclusions:
A corner of the matrix can never be toggled—if a 1 on the corner is found it is not possible with any number of flips to make the matrix zero. Your first example can thus be discarded without even flipping a single bit.
A bit next to a corner must have the same same value as the bit on the other side. This matrix that you posted in a comment can thus as well be discarded without flipping a single bit (bottom right corner).
Two examples of the conditions above:
0 1 .
0 x .
. . .
Not possible, as x needs to be flipped exactly once and exactly zero times.
0 1 .
1 x .
. . .
Possible, x needs to be flipped exactly once.
Algorithm
We can now make an recursive argument, and I propose the following:
We are given an m by n matrix.
Check the corner conditions above as stated above (i.e. corner != 1, bits next to corner has to be the same value). If either criteria are violated, return impossible.
Go around the edge of the matrix. If a 1 is encountered, flip the closest bit inside, and add 1 to the counter.
Restart now from #1 with a m - 2 by n - 2 matrix (top and bot row removed, left and right column) if either dimension is > 2, otherwise print the counter and quit.
Implementation
Initially I had thought this would turn out nice and pretty, but truth be told it is a little more cumbersome than I originally thought it would be as we have to keep track of a lot of indices. Please ask questions if you're wondering about the implementation, but it is in essence a pure translation of the steps above.
#include <iostream>
#include <vector>
using Matrix = std::vector<std::vector<int>>;
void flip_bit(Matrix& mat, int i, int j, int& counter)
{
mat[i][j] = !mat[i][j];
mat[i - 1][j] = !mat[i - 1][j];
mat[i + 1][j] = !mat[i + 1][j];
mat[i][j - 1] = !mat[i][j - 1];
mat[i][j + 1] = !mat[i][j + 1];
++counter;
}
int flip(Matrix& mat, int n, int m, int p = 0, int counter = 0)
{
// I use p for 'padding', i.e. 0 means the full array, 1 means the outmost edge taken away, 2 the 2 most outmost edges, etc.
// max indices of the sub-array
int np = n - p - 1;
int mp = m - p - 1;
// Checking corners
if (mat[p][p] || mat[np][p] || mat[p][mp] || mat[np][mp] || // condition #1
(mat[p + 1][p] != mat[p][p + 1]) || (mat[np - 1][p] != mat[np][p + 1]) || // condition #2
(mat[p + 1][mp] != mat[p][mp - 1]) || (mat[np - 1][mp] != mat[np][mp - 1]))
return -1;
// We walk over all edge values that are *not* corners and
// flipping the bit that are *inside* the current bit if it's 1
for (int j = p + 1; j < mp; ++j) {
if (mat[p][j]) flip_bit(mat, p + 1, j, counter);
if (mat[np][j]) flip_bit(mat, np - 1, j, counter);
}
for (int i = p + 1; i < np; ++i) {
if (mat[i][p]) flip_bit(mat, i, p + 1, counter);
if (mat[i][mp]) flip_bit(mat, i, mp - 1, counter);
}
// Finished or flip the next sub-array?
if (np == 1 || mp == 1)
return counter;
else
return flip(mat, n, m, p + 1, counter);
}
int main()
{
int n, m;
std::cin >> n >> m;
Matrix mat(n, std::vector<int>(m, 0));
for (int i = 0; i < n; ++i) {
for (int j = 0; j < m; ++j) {
std::cin >> mat[i][j];
}
}
int counter = flip(mat, n, m);
if (counter < 0)
std::cout << "impossible" << std::endl;
else
std::cout << counter << std::endl;
}
Output
3 3
1 0 1
1 1 1
0 1 0
impossible
3 3
0 1 0
1 1 1
0 1 0
1
6 8
0 1 0 0 0 0 0 0
1 1 1 0 1 0 1 0
0 0 1 1 0 1 1 1
1 1 0 1 1 1 0 0
1 0 1 1 1 0 1 0
0 1 0 1 0 1 0 0
10
4 6
0 1 0 0
1 0 1 0
1 1 0 1
1 1 1 0
1 1 1 0
impossible
If tab[0][j] is 1, you have to toggle tab[1][j] to clear it. You then cannot toggle row 1 without unclearing row 0. So it seems like a reduction step. You repeat the step until there is one row left. If that last row is not clear by luck, my intuition is that it's the "impossible" case.
#include <memory>
template <typename Elem>
class Arr_2d
{
public:
Arr_2d(unsigned r, unsigned c)
: rows_(r), columns_(c), data(new Elem[rows_ * columns_]) { }
Elem * operator [] (unsigned row_idx)
{ return(data.get() + (row_idx * columns_)); }
unsigned rows() const { return(rows_); }
unsigned columns() const { return(columns_); }
private:
const unsigned rows_, columns_;
std::unique_ptr<Elem []> data;
};
inline void toggle_one(bool &b) { b = !b; }
void toggle(Arr_2d<bool> &tab, unsigned row, unsigned column)
{
toggle_one(tab[row][column]);
if (column > 0)
toggle_one(tab[row][column - 1]);
if (row > 0)
toggle_one(tab[row - 1][column]);
if (column < (tab.columns() - 1))
toggle_one(tab[row][column + 1]);
if (row < (tab.rows() - 1))
toggle_one(tab[row + 1][column]);
}
int solve(Arr_2d<bool> &tab)
{
int count = 0;
unsigned i = 0;
for ( ; i < (tab.rows() - 1); ++i)
for (unsigned j = 0; j < tab.columns(); ++j)
if (tab[i][j])
{
toggle(tab, i + 1, j);
++count;
}
for (unsigned j = 0; j < tab.columns(); ++j)
if (tab[i][j])
// Impossible.
return(-count);
return(count);
}
unsigned ex1[] = {
0, 1, 0,
1, 1, 1,
0, 1, 0
};
unsigned ex2[] = {
0, 1, 0, 0, 0, 0, 0, 0,
1, 1, 1, 0, 1, 0, 1, 0,
0, 0, 1, 1, 0, 1, 1, 1,
1, 1, 0, 1, 1, 1, 0, 0,
1, 0, 1, 1, 1, 0, 1, 0,
0, 1, 0, 1, 0, 1, 0, 0
};
Arr_2d<bool> load(unsigned rows, unsigned columns, const unsigned *data)
{
Arr_2d<bool> res(rows, columns);
for (unsigned i = 0; i < rows; ++i)
for (unsigned j = 0; j < columns; ++j)
res[i][j] = !!*(data++);
return(res);
}
#include <iostream>
int main()
{
{
Arr_2d<bool> tab = load(3, 3, ex1);
std::cout << solve(tab) << '\n';
}
{
Arr_2d<bool> tab = load(6, 8, ex2);
std::cout << solve(tab) << '\n';
}
return(0);
}
The problem is stated like this:
y
yxy If you flip x, then you have to flip all the ys
y
But it's easy if you think about it like this:
x
yyy If you flip x, then you have to flip all the ys
y
It's the same thing, but now the solution is obvious -- You must flip all the 1s in row 0, which will flip some bits in rows 1 and 2, then you must flip all the 1s in row 1, etc, until you get to the end.
If this is indeed the Lights Out game, then there are plenty of resources that detail how to solve the game. It is also quite likely that this is a duplicate of Lights out game algorithm, as has already been mentioned by other posters.
Let's see if we can't solve the first sample puzzle provided, however, and at least present a concrete description of an algorithm.
The initial puzzle appears to be solvable:
1 0 1
1 1 1
0 1 0
The trick is that you can clear 1's in the top row by changing the values in the row underneath them. I'll provide coordinates by row and column, using a 1-based offset, meaning that the top left value is (1, 1) and the bottom right value is (3, 3).
Change (2, 1), then (2, 3), then (3, 2). I'll show the intermediate states of the board with the * for the cell being changed in the next step.
1 0 1 (2,1) 0 0 1 (2,3) 0 0 0 (3, 2) 0 0 0
* 1 1 ------> 0 0 * ------> 0 1 0 ------> 0 0 0
0 1 0 1 1 0 1 * 1 0 0 0
This board can be solved, and the number of moves appears to be 3.
The pseudo-algorithm is as follows:
flipCount = 0
for each row _below_ the top row:
for each element in the current row:
if the element in the row above is 1, toggle the element in this row:
increment flipCount
if the board is clear, output flipCount
if the board isnt clear, output "Impossible"
I hope this helps; I can elaborate further if required but this is the core of the standard lights out solution. BTW, it is related to Gaussian Elimination; linear algebra crops up in some odd situations :)
Finally, in terms of what is wrong with your code, it appears to be the following loop:
for(int i=0; i<n-1; i++)
{
for(int j=0; j<m-1; j++)
{
if(tab[i][j] == 1 && i > 0 && j > 0)
{
tab[i-1][j] = !tab[i-1][j];
tab[i+1][j] = !tab[i+1][j];
tab[i][j+1] = !tab[i][j+1];
tab[i][j-1] = !tab[i][j-1];
tab[i][j] = !tab[i][j];
counter ++;
}
}
}
Several issues occur to me, but first assumptions again:
i refers to the ith row and there are n rows
j refers to the jth column and there are m columns
I'm now referring to indices that start from 0 instead of 1
If this is the case, then the following is observed:
You could run your for i loop from 1 instead of 0, which means you no longer have to check whether i > 0 in the if statement
You should drop the for j > 0 in the if statement; that check means that you can't flip anything in the first column
You need to change the n-1 in the for i loop as you need to run this for the final row
You need to change the m-1 in the for j loop as you need to run this for the final column (see point 2 also)
You need to check the cell in the row above the current row, so you you should be checking tab[i-1][j] == 1
Now you need to add bounds tests for j-1, j+1 and i+1 to avoid reading outside valid ranges of the matrix
Put these together and you have:
for(int i=1; i<n; i++)
{
for(int j=0; j<m; j++)
{
if(tab[i-1][j] == 1)
{
tab[i-1][j] = !tab[i-1][j];
if (i+1 < n)
tab[i+1][j] = !tab[i+1][j];
if (j+1 < m)
tab[i][j+1] = !tab[i][j+1];
if (j > 0)
tab[i][j-1] = !tab[i][j-1];
tab[i][j] = !tab[i][j];
counter ++;
}
}
}
A little class that can take as a input file or test all possible combination for first row with only zeros, on 6,5 matrix:
#include <iostream>
#include <fstream>
#include <vector>
#include <string>
#include <cstdlib>
#include <ctime>
typedef std::vector< std::vector<int> > Matrix;
class MatrixCleaner
{
public:
void swapElement(int row, int col)
{
if (row >= 0 && row < (int)matrix.size() && col >= 0 && col < (int)matrix[row].size())
matrix[row][col] = !matrix[row][col];
}
void swapElements(int row, int col)
{
swapElement(row - 1, col);
swapElement(row, col - 1);
swapElement(row, col);
swapElement(row, col + 1);
swapElement(row + 1, col);
}
void printMatrix()
{
for (auto &v : matrix)
{
for (auto &val : v)
{
std::cout << val << " ";
}
std::cout << "\n";
}
}
void loadMatrix(std::string path)
{
std::ifstream fileStream;
fileStream.open(path);
matrix.resize(1);
bool enconteredNumber = false;
bool skipLine = false;
bool skipBlock = false;
for (char c; fileStream.get(c);)
{
if (skipLine)
{
if (c != '*')
skipBlock = true;
if (c != '\n')
continue;
else
skipLine = false;
}
if (skipBlock)
{
if (c == '*')
skipBlock = false;
continue;
}
switch (c)
{
case '0':
matrix.back().push_back(0);
enconteredNumber = true;
break;
case '1':
matrix.back().push_back(1);
enconteredNumber = true;
break;
case '\n':
if (enconteredNumber)
{
matrix.resize(matrix.size() + 1);
enconteredNumber = false;
}
break;
case '#':
if(!skipBlock)
skipLine = true;
break;
case '*':
skipBlock = true;
break;
default:
break;
}
}
while (matrix.size() > 0 && matrix.back().empty())
matrix.pop_back();
fileStream.close();
}
void loadRandomValidMatrix(int seed = -1)
{
//Default matrix
matrix = {
{ 0,0,0,0,0 },
{ 0,0,0,0,0 },
{ 0,0,0,0,0 },
{ 0,0,0,0,0 },
{ 0,0,0,0,0 },
{ 0,0,0,0,0 },
};
int setNum = seed;
if(seed < 0)
if(seed < -1)
setNum = std::rand() % -seed;
else
setNum = std::rand() % 33554432;
for (size_t r = 1; r < matrix.size(); r++)
for (size_t c = 0; c < matrix[r].size(); c++)
{
if (setNum & 1)
swapElements(r, c);
setNum >>= 1;
}
}
bool test()
{
bool retVal = true;
for (int i = 0; i < 33554432; i++)
{
loadRandomValidMatrix(i);
if( (i % 1000000) == 0 )
std::cout << "i= " << i << "\n";
if (clean() < 0)
{
// std::cout << "x";
std::cout << "\n" << i << "\n";
retVal = false;
break;
}
else
{
// std::cout << ".";
}
}
return retVal;
}
int clean()
{
int numOfSwaps = 0;
try
{
for (size_t r = 1; r < matrix.size(); r++)
{
for (size_t c = 0; c < matrix[r].size(); c++)
{
if (matrix.at(r - 1).at(c))
{
swapElements(r, c);
numOfSwaps++;
}
}
}
}
catch (...)
{
return -2;
}
if (!matrix.empty())
for (auto &val : matrix.back())
{
if (val == 1)
{
numOfSwaps = -1;
break;
}
}
return numOfSwaps;
}
Matrix matrix;
};
int main(int argc, char **argv)
{
std::srand(std::time(NULL));
MatrixCleaner matrixSwaper;
if (argc > 1)
{
matrixSwaper.loadMatrix(argv[argc - 1]);
std::cout << "intput:\n";
matrixSwaper.printMatrix();
int numOfSwaps = matrixSwaper.clean();
std::cout << "\noutput:\n";
matrixSwaper.printMatrix();
if (numOfSwaps > 0)
std::cout << "\nresult = " << numOfSwaps << " matrix is clean now " << std::endl;
else if (numOfSwaps == 0)
std::cout << "\nresult = " << numOfSwaps << " nothing to clean " << std::endl;
else
std::cout << "\nresult = " << numOfSwaps << " matrix cannot be clean " << std::endl;
}
else
{
std::cout << "Testing ";
if (matrixSwaper.test())
std::cout << " PASS\n";
else
std::cout << " FAIL\n";
}
std::cin.ignore();
return 0;
}

Solving sparse definite positive linear systems in CUDA

We are experiencing problems while using cuSOLVER's cusolverSpScsrlsvchol function, probably due to misunderstanding of the cuSOLVER library.
Motivation: we are solving the Poisson equation -divgrad x = b on a rectangular grid. In 2 dimensions with a 5-stencil (1, 1, -4, 1, 1), the Laplacian on the grid provides a (quite sparse) matrix A. Moreover, the charge distribution on the grid gives a (dense) vector b. A is positive definite and symmetric.
Now we solve A * x = b for x using nvidia's new cuSOLVER library that comes with CUDA 7.0 . It provides a function cusolverSpScsrlsvchol which should do the sparse Cholesky factorisation for floats.
Note: we are able to correctly solve the system with the alternative sparse QR factorisation function cusolverSpScsrlsvqr. For a 4 x 4 grid with all b entries on the edge being 1 and the rest 0, we get for x:
1 1 0.999999 1 1 1 0.999999 1 1 1 1 1 1 1 1 1
Our problems:
cusolverSpScsrlsvchol returns wrong results for x:
1 3.33333 2.33333 1 3.33333 2.33333 1.33333 1 2.33333 1.33333 0.666667 1 1 1 1 1
(solved, see answer below) Converting the CSR matrix A to a dense matrix and showing the output gives weird numbers (10^-44 and the like). The respective data from the CSR format are correct and validated with python numpy.
(solved, see answer below) The alternative sparse LU and partial pivoting with cusolverSpScsrlsvlu cannot even be found:
$ nvcc -std=c++11 cusparse_test3.cu -o cusparse_test3 -lcusparse -lcusolver
cusparse_test3.cu(208): error: identifier "cusolverSpScsrlsvlu" is undefined
What are we doing wrong? Thanks for your help!
Our C++ CUDA code:
#include <iostream>
#include <cuda_runtime.h>
#include <cuda.h>
#include <cusolverSp.h>
#include <cusparse.h>
#include <vector>
#include <cassert>
// create poisson matrix with Dirichlet bc. of a rectangular grid with
// dimension NxN
void assemble_poisson_matrix_coo(std::vector<float>& vals, std::vector<int>& row, std::vector<int>& col,
std::vector<float>& rhs, int Nrows, int Ncols) {
//nnz: 5 entries per row (node) for nodes in the interior
// 1 entry per row (node) for nodes on the boundary, since we set them explicitly to 1.
int nnz = 5*Nrows*Ncols - (2*(Ncols-1) + 2*(Nrows-1))*4;
vals.resize(nnz);
row.resize(nnz);
col.resize(nnz);
rhs.resize(Nrows*Ncols);
int counter = 0;
for(int i = 0; i < Nrows; ++i) {
for (int j = 0; j < Ncols; ++j) {
int idx = j + Ncols*i;
if (i == 0 || j == 0 || j == Ncols-1 || i == Nrows-1) {
vals[counter] = 1.;
row[counter] = idx;
col[counter] = idx;
counter++;
rhs[idx] = 1.;
// if (i == 0) {
// rhs[idx] = 3.;
// }
} else { // -laplace stencil
// above
vals[counter] = -1.;
row[counter] = idx;
col[counter] = idx-Ncols;
counter++;
// left
vals[counter] = -1.;
row[counter] = idx;
col[counter] = idx-1;
counter++;
// center
vals[counter] = 4.;
row[counter] = idx;
col[counter] = idx;
counter++;
// right
vals[counter] = -1.;
row[counter] = idx;
col[counter] = idx+1;
counter++;
// below
vals[counter] = -1.;
row[counter] = idx;
col[counter] = idx+Ncols;
counter++;
rhs[idx] = 0;
}
}
}
assert(counter == nnz);
}
int main() {
// --- create library handles:
cusolverSpHandle_t cusolver_handle;
cusolverStatus_t cusolver_status;
cusolver_status = cusolverSpCreate(&cusolver_handle);
std::cout << "status create cusolver handle: " << cusolver_status << std::endl;
cusparseHandle_t cusparse_handle;
cusparseStatus_t cusparse_status;
cusparse_status = cusparseCreate(&cusparse_handle);
std::cout << "status create cusparse handle: " << cusparse_status << std::endl;
// --- prepare matrix:
int Nrows = 4;
int Ncols = 4;
std::vector<float> csrVal;
std::vector<int> cooRow;
std::vector<int> csrColInd;
std::vector<float> b;
assemble_poisson_matrix_coo(csrVal, cooRow, csrColInd, b, Nrows, Ncols);
int nnz = csrVal.size();
int m = Nrows * Ncols;
std::vector<int> csrRowPtr(m+1);
// --- prepare solving and copy to GPU:
std::vector<float> x(m);
float tol = 1e-5;
int reorder = 0;
int singularity = 0;
float *db, *dcsrVal, *dx;
int *dcsrColInd, *dcsrRowPtr, *dcooRow;
cudaMalloc((void**)&db, m*sizeof(float));
cudaMalloc((void**)&dx, m*sizeof(float));
cudaMalloc((void**)&dcsrVal, nnz*sizeof(float));
cudaMalloc((void**)&dcsrColInd, nnz*sizeof(int));
cudaMalloc((void**)&dcsrRowPtr, (m+1)*sizeof(int));
cudaMalloc((void**)&dcooRow, nnz*sizeof(int));
cudaMemcpy(db, b.data(), b.size()*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dcsrVal, csrVal.data(), csrVal.size()*sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(dcsrColInd, csrColInd.data(), csrColInd.size()*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dcooRow, cooRow.data(), cooRow.size()*sizeof(int), cudaMemcpyHostToDevice);
cusparse_status = cusparseXcoo2csr(cusparse_handle, dcooRow, nnz, m,
dcsrRowPtr, CUSPARSE_INDEX_BASE_ZERO);
std::cout << "status cusparse coo2csr conversion: " << cusparse_status << std::endl;
cudaDeviceSynchronize(); // matrix format conversion has to be finished!
// --- everything ready for computation:
cusparseMatDescr_t descrA;
cusparse_status = cusparseCreateMatDescr(&descrA);
std::cout << "status cusparse createMatDescr: " << cusparse_status << std::endl;
// optional: print dense matrix that has been allocated on GPU
std::vector<float> A(m*m, 0);
float *dA;
cudaMalloc((void**)&dA, A.size()*sizeof(float));
cusparseScsr2dense(cusparse_handle, m, m, descrA, dcsrVal,
dcsrRowPtr, dcsrColInd, dA, m);
cudaMemcpy(A.data(), dA, A.size()*sizeof(float), cudaMemcpyDeviceToHost);
std::cout << "A: \n";
for (int i = 0; i < m; ++i) {
for (int j = 0; j < m; ++j) {
std::cout << A[i*m + j] << " ";
}
std::cout << std::endl;
}
cudaFree(dA);
std::cout << "b: \n";
cudaMemcpy(b.data(), db, (m)*sizeof(int), cudaMemcpyDeviceToHost);
for (auto a : b) {
std::cout << a << ",";
}
std::cout << std::endl;
// --- solving!!!!
// cusolver_status = cusolverSpScsrlsvchol(cusolver_handle, m, nnz, descrA, dcsrVal,
// dcsrRowPtr, dcsrColInd, db, tol, reorder, dx,
// &singularity);
cusolver_status = cusolverSpScsrlsvqr(cusolver_handle, m, nnz, descrA, dcsrVal,
dcsrRowPtr, dcsrColInd, db, tol, reorder, dx,
&singularity);
cudaDeviceSynchronize();
std::cout << "singularity (should be -1): " << singularity << std::endl;
std::cout << "status cusolver solving (!): " << cusolver_status << std::endl;
cudaMemcpy(x.data(), dx, m*sizeof(float), cudaMemcpyDeviceToHost);
// relocated these 2 lines from above to solve (2):
cusparse_status = cusparseDestroy(cusparse_handle);
std::cout << "status destroy cusparse handle: " << cusparse_status << std::endl;
cusolver_status = cusolverSpDestroy(cusolver_handle);
std::cout << "status destroy cusolver handle: " << cusolver_status << std::endl;
for (auto a : x) {
std::cout << a << " ";
}
std::cout << std::endl;
cudaFree(db);
cudaFree(dx);
cudaFree(dcsrVal);
cudaFree(dcsrColInd);
cudaFree(dcsrRowPtr);
cudaFree(dcooRow);
return 0;
}
1.cusolverSpScsrlsvchol returns wrong results for x:
1 3.33333 2.33333 1 3.33333 2.33333 1.33333 1 2.33333 1.33333 0.666667 1 1 1 1 1
You said:
A is positive definite and symmetric.
No, it is not. It is not symmetric.
cusolverSpcsrlsvqr() has no requirement that the A matrix be symmetric.
cusolverSpcsrlsvchol() does have that requirement:
A is an m×m symmetric postive definite sparse matrix
This is the printout your code provides for the A matrix:
A:
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 1 0 0 0 -1 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 -1 0 0 0 0 0 0 0 0 0
0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 4 -1 0 0 -1 0 0 0 0 0 0
0 0 0 0 0 -1 4 0 0 0 -1 0 0 0 0 0
0 0 0 0 0 0 -1 1 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 0
0 0 0 0 0 -1 0 0 0 4 -1 0 0 0 0 0
0 0 0 0 0 0 -1 0 0 -1 4 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 -1 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 -1 0 0 0 1 0 0
0 0 0 0 0 0 0 0 0 0 -1 0 0 0 1 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
If that were symmetric, I would expect the second row:
0 1 0 0 0 -1 0 0 0 0 0 0 0 0 0 0
to match the 2nd column:
0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
By the way, a suggestion about Stack Overflow. If you answer your own question, my suggestion is that you intend it to be a complete answer. Some people might see an answered question and skip it. Probably better to edit such content into your question, thus focusing your question (I think) down to a single question. SO also doesn't work as well in my opinion when you ask multiple questions per question. That sort of behavior makes the question unnecessarily more difficult to answer, and I don't think it is serving you well here.
Although the matrix arising from Cartesian discretization of the Poisson equation is not positive definite, this question regards the inversion of sparse positive definite linear systems.
In the meanwhile cusolverSpScsrlsvchol becomes available for the device channel, I think it will be useful for potentially interested users to perform inversions of sparse positive definite linear systems using the cuSPARSE library. Here is a fully worked example:
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <assert.h>
#include <cuda_runtime.h>
#include <cusparse_v2.h>
/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { exit(code); }
}
}
extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }
/***************************/
/* CUSPARSE ERROR CHECKING */
/***************************/
static const char *_cusparseGetErrorEnum(cusparseStatus_t error)
{
switch (error)
{
case CUSPARSE_STATUS_SUCCESS:
return "CUSPARSE_STATUS_SUCCESS";
case CUSPARSE_STATUS_NOT_INITIALIZED:
return "CUSPARSE_STATUS_NOT_INITIALIZED";
case CUSPARSE_STATUS_ALLOC_FAILED:
return "CUSPARSE_STATUS_ALLOC_FAILED";
case CUSPARSE_STATUS_INVALID_VALUE:
return "CUSPARSE_STATUS_INVALID_VALUE";
case CUSPARSE_STATUS_ARCH_MISMATCH:
return "CUSPARSE_STATUS_ARCH_MISMATCH";
case CUSPARSE_STATUS_MAPPING_ERROR:
return "CUSPARSE_STATUS_MAPPING_ERROR";
case CUSPARSE_STATUS_EXECUTION_FAILED:
return "CUSPARSE_STATUS_EXECUTION_FAILED";
case CUSPARSE_STATUS_INTERNAL_ERROR:
return "CUSPARSE_STATUS_INTERNAL_ERROR";
case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
case CUSPARSE_STATUS_ZERO_PIVOT:
return "CUSPARSE_STATUS_ZERO_PIVOT";
}
return "<unknown>";
}
inline void __cusparseSafeCall(cusparseStatus_t err, const char *file, const int line)
{
if(CUSPARSE_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSPARSE error in file '%s', line %Ndims\Nobjs %s\nerror %Ndims: %s\nterminating!\Nobjs",__FILE__, __LINE__,err, \
_cusparseGetErrorEnum(err)); \
cudaDeviceReset(); assert(0); \
}
}
extern "C" void cusparseSafeCall(cusparseStatus_t err) { __cusparseSafeCall(err, __FILE__, __LINE__); }
/********/
/* MAIN */
/********/
int main()
{
// --- Initialize cuSPARSE
cusparseHandle_t handle; cusparseSafeCall(cusparseCreate(&handle));
const int Nrows = 4; // --- Number of rows
const int Ncols = 4; // --- Number of columns
const int N = Nrows;
// --- Host side dense matrix
double *h_A_dense = (double*)malloc(Nrows*Ncols*sizeof(*h_A_dense));
// --- Column-major ordering
h_A_dense[0] = 0.4612f; h_A_dense[4] = -0.0006f; h_A_dense[8] = 0.3566f; h_A_dense[12] = 0.0f;
h_A_dense[1] = -0.0006f; h_A_dense[5] = 0.4640f; h_A_dense[9] = 0.0723f; h_A_dense[13] = 0.0f;
h_A_dense[2] = 0.3566f; h_A_dense[6] = 0.0723f; h_A_dense[10] = 0.7543f; h_A_dense[14] = 0.0f;
h_A_dense[3] = 0.f; h_A_dense[7] = 0.0f; h_A_dense[11] = 0.0f; h_A_dense[15] = 0.1f;
// --- Create device array and copy host array to it
double *d_A_dense; gpuErrchk(cudaMalloc(&d_A_dense, Nrows * Ncols * sizeof(*d_A_dense)));
gpuErrchk(cudaMemcpy(d_A_dense, h_A_dense, Nrows * Ncols * sizeof(*d_A_dense), cudaMemcpyHostToDevice));
// --- Descriptor for sparse matrix A
cusparseMatDescr_t descrA; cusparseSafeCall(cusparseCreateMatDescr(&descrA));
cusparseSafeCall(cusparseSetMatType (descrA, CUSPARSE_MATRIX_TYPE_GENERAL));
cusparseSafeCall(cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE));
int nnz = 0; // --- Number of nonzero elements in dense matrix
const int lda = Nrows; // --- Leading dimension of dense matrix
// --- Device side number of nonzero elements per row
int *d_nnzPerVector; gpuErrchk(cudaMalloc(&d_nnzPerVector, Nrows * sizeof(*d_nnzPerVector)));
cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, Nrows, Ncols, descrA, d_A_dense, lda, d_nnzPerVector, &nnz));
// --- Host side number of nonzero elements per row
int *h_nnzPerVector = (int *)malloc(Nrows * sizeof(*h_nnzPerVector));
gpuErrchk(cudaMemcpy(h_nnzPerVector, d_nnzPerVector, Nrows * sizeof(*h_nnzPerVector), cudaMemcpyDeviceToHost));
printf("Number of nonzero elements in dense matrix = %i\n\n", nnz);
for (int i = 0; i < Nrows; ++i) printf("Number of nonzero elements in row %i = %i \n", i, h_nnzPerVector[i]);
printf("\n");
// --- Device side dense matrix
double *d_A; gpuErrchk(cudaMalloc(&d_A, nnz * sizeof(*d_A)));
int *d_A_RowIndices; gpuErrchk(cudaMalloc(&d_A_RowIndices, (Nrows + 1) * sizeof(*d_A_RowIndices)));
int *d_A_ColIndices; gpuErrchk(cudaMalloc(&d_A_ColIndices, nnz * sizeof(*d_A_ColIndices)));
cusparseSafeCall(cusparseDdense2csr(handle, Nrows, Ncols, descrA, d_A_dense, lda, d_nnzPerVector, d_A, d_A_RowIndices, d_A_ColIndices));
// --- Host side dense matrix
double *h_A = (double *)malloc(nnz * sizeof(*h_A));
int *h_A_RowIndices = (int *)malloc((Nrows + 1) * sizeof(*h_A_RowIndices));
int *h_A_ColIndices = (int *)malloc(nnz * sizeof(*h_A_ColIndices));
gpuErrchk(cudaMemcpy(h_A, d_A, nnz*sizeof(*h_A), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_RowIndices, d_A_RowIndices, (Nrows + 1) * sizeof(*h_A_RowIndices), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_ColIndices, d_A_ColIndices, nnz * sizeof(*h_A_ColIndices), cudaMemcpyDeviceToHost));
printf("\nOriginal matrix in CSR format\n\n");
for (int i = 0; i < nnz; ++i) printf("A[%i] = %.0f ", i, h_A[i]); printf("\n");
printf("\n");
for (int i = 0; i < (Nrows + 1); ++i) printf("h_A_RowIndices[%i] = %i \n", i, h_A_RowIndices[i]); printf("\n");
for (int i = 0; i < nnz; ++i) printf("h_A_ColIndices[%i] = %i \n", i, h_A_ColIndices[i]);
// --- Allocating and defining dense host and device data vectors
double *h_x = (double *)malloc(Nrows * sizeof(double));
h_x[0] = 100.0; h_x[1] = 200.0; h_x[2] = 400.0; h_x[3] = 500.0;
double *d_x; gpuErrchk(cudaMalloc(&d_x, Nrows * sizeof(double)));
gpuErrchk(cudaMemcpy(d_x, h_x, Nrows * sizeof(double), cudaMemcpyHostToDevice));
/******************************************/
/* STEP 1: CREATE DESCRIPTORS FOR L AND U */
/******************************************/
cusparseMatDescr_t descr_L = 0;
cusparseSafeCall(cusparseCreateMatDescr (&descr_L));
cusparseSafeCall(cusparseSetMatIndexBase (descr_L, CUSPARSE_INDEX_BASE_ONE));
cusparseSafeCall(cusparseSetMatType (descr_L, CUSPARSE_MATRIX_TYPE_GENERAL));
cusparseSafeCall(cusparseSetMatFillMode (descr_L, CUSPARSE_FILL_MODE_LOWER));
cusparseSafeCall(cusparseSetMatDiagType (descr_L, CUSPARSE_DIAG_TYPE_NON_UNIT));
/********************************************************************************************************/
/* STEP 2: QUERY HOW MUCH MEMORY USED IN CHOLESKY FACTORIZATION AND THE TWO FOLLOWING SYSTEM INVERSIONS */
/********************************************************************************************************/
csric02Info_t info_A = 0; cusparseSafeCall(cusparseCreateCsric02Info(&info_A));
csrsv2Info_t info_L = 0; cusparseSafeCall(cusparseCreateCsrsv2Info (&info_L));
csrsv2Info_t info_Lt = 0; cusparseSafeCall(cusparseCreateCsrsv2Info (&info_Lt));
int pBufferSize_M, pBufferSize_L, pBufferSize_Lt;
cusparseSafeCall(cusparseDcsric02_bufferSize(handle, N, nnz, descrA, d_A, d_A_RowIndices, d_A_ColIndices, info_A, &pBufferSize_M));
cusparseSafeCall(cusparseDcsrsv2_bufferSize (handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnz, descr_L, d_A, d_A_RowIndices, d_A_ColIndices, info_L, &pBufferSize_L));
cusparseSafeCall(cusparseDcsrsv2_bufferSize (handle, CUSPARSE_OPERATION_TRANSPOSE, N, nnz, descr_L, d_A, d_A_RowIndices, d_A_ColIndices, info_Lt, &pBufferSize_Lt));
int pBufferSize = max(pBufferSize_M, max(pBufferSize_L, pBufferSize_Lt));
void *pBuffer = 0; gpuErrchk(cudaMalloc((void**)&pBuffer, pBufferSize));
/******************************************************************************************************/
/* STEP 3: ANALYZE THE THREE PROBLEMS: CHOLESKY FACTORIZATION AND THE TWO FOLLOWING SYSTEM INVERSIONS */
/******************************************************************************************************/
int structural_zero;
cusparseSafeCall(cusparseDcsric02_analysis(handle, N, nnz, descrA, d_A, d_A_RowIndices, d_A_ColIndices, info_A, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer));
cusparseStatus_t status = cusparseXcsric02_zeroPivot(handle, info_A, &structural_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){ printf("A(%d,%d) is missing\n", structural_zero, structural_zero); }
cusparseSafeCall(cusparseDcsrsv2_analysis(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnz, descr_L, d_A, d_A_RowIndices, d_A_ColIndices, info_L, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer));
cusparseSafeCall(cusparseDcsrsv2_analysis(handle, CUSPARSE_OPERATION_TRANSPOSE, N, nnz, descr_L, d_A, d_A_RowIndices, d_A_ColIndices, info_Lt, CUSPARSE_SOLVE_POLICY_USE_LEVEL, pBuffer));
/*************************************/
/* STEP 4: FACTORIZATION: A = L * L' */
/*************************************/
int numerical_zero;
cusparseSafeCall(cusparseDcsric02(handle, N, nnz, descrA, d_A, d_A_RowIndices, d_A_ColIndices, info_A, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer));
status = cusparseXcsric02_zeroPivot(handle, info_A, &numerical_zero);
if (CUSPARSE_STATUS_ZERO_PIVOT == status){ printf("L(%d,%d) is zero\n", numerical_zero, numerical_zero); }
printf("\nNon-zero elements in Cholesky matrix\n\n");
gpuErrchk(cudaMemcpy(h_A, d_A, nnz * sizeof(double), cudaMemcpyDeviceToHost));
for (int k=0; k<nnz; k++) printf("%f\n", h_A[k]);
cusparseSafeCall(cusparseDcsr2dense(handle, Nrows, Ncols, descrA, d_A, d_A_RowIndices, d_A_ColIndices, d_A_dense, Nrows));
printf("\nCholesky matrix\n\n");
for(int i = 0; i < Nrows; i++) {
std::cout << "[ ";
for(int j = 0; j < Ncols; j++)
std::cout << h_A_dense[i * Ncols + j] << " ";
std::cout << "]\n";
}
/*********************/
/* STEP 5: L * z = x */
/*********************/
// --- Allocating the intermediate result vector
double *d_z; gpuErrchk(cudaMalloc(&d_z, N * sizeof(double)));
const double alpha = 1.;
cusparseSafeCall(cusparseDcsrsv2_solve(handle, CUSPARSE_OPERATION_NON_TRANSPOSE, N, nnz, &alpha, descr_L, d_A, d_A_RowIndices, d_A_ColIndices, info_L, d_x, d_z, CUSPARSE_SOLVE_POLICY_NO_LEVEL, pBuffer));
/**********************/
/* STEP 5: L' * y = z */
/**********************/
// --- Allocating the host and device side result vector
double *h_y = (double *)malloc(Ncols * sizeof(double));
double *d_y; gpuErrchk(cudaMalloc(&d_y, Ncols * sizeof(double)));
cusparseSafeCall(cusparseDcsrsv2_solve(handle, CUSPARSE_OPERATION_TRANSPOSE, N, nnz, &alpha, descr_L, d_A, d_A_RowIndices, d_A_ColIndices, info_Lt, d_z, d_y, CUSPARSE_SOLVE_POLICY_USE_LEVEL, pBuffer));
cudaMemcpy(h_x, d_y, N * sizeof(double), cudaMemcpyDeviceToHost);
printf("\n\nFinal result\n");
for (int k=0; k<N; k++) printf("x[%i] = %f\n", k, h_x[k]);
}
Concerning 2: we have destroyed the cusparse handle too early (probably too much micro-tweaking to find the error sources....). Besides, the dense format is column-major which is why we need to transpose A to make it print properly!
Concerning 3: cusolverSpScsrlsvlu only exists on the host for the moment -- it's written in the documentation in a wonderfully obvious way under 6.2.1 remark 5.... http://docs.nvidia.com/cuda/cusolver/index.html#cusolver-lt-t-gt-csrlsvlu
Another possibility to solve a sparse, positive definite linear system is using the cuSOLVER library and, in particular, the cusolverSpDcsrlsvchol routine. It works very similar to the cuSOLVER routines used to Solving general sparse linear systems in CUDA, but uses a Cholesky factorization A = G * G^H, where G is the Cholesky factor, a lower triangular matrix.
As for the routines in Solving general sparse linear systems in CUDA and as of CUDA 10.0, only the host channel is at the moment available. Note that the reorder parameter has no effect and singularity is -1 if the matrix A is positive definite.
Below, a fully worked example:
#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cusparse.h>
#include <cusolverSp.h>
//https://www.physicsforums.com/threads/all-the-ways-to-build-positive-definite-matrices.561438/
//https://it.mathworks.com/matlabcentral/answers/101132-how-do-i-determine-if-a-matrix-is-positive-definite-using-matlab
/*******************/
/* iDivUp FUNCTION */
/*******************/
//extern "C" int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
__host__ __device__ int iDivUp(int a, int b){ return ((a % b) != 0) ? (a / b + 1) : (a / b); }
/********************/
/* CUDA ERROR CHECK */
/********************/
// --- Credit to http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api
void gpuAssert(cudaError_t code, const char *file, int line, bool abort = true)
{
if (code != cudaSuccess)
{
fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) { exit(code); }
}
}
extern "C" void gpuErrchk(cudaError_t ans) { gpuAssert((ans), __FILE__, __LINE__); }
/**************************/
/* CUSOLVE ERROR CHECKING */
/**************************/
static const char *_cusolverGetErrorEnum(cusolverStatus_t error)
{
switch (error)
{
case CUSOLVER_STATUS_SUCCESS:
return "CUSOLVER_SUCCESS";
case CUSOLVER_STATUS_NOT_INITIALIZED:
return "CUSOLVER_STATUS_NOT_INITIALIZED";
case CUSOLVER_STATUS_ALLOC_FAILED:
return "CUSOLVER_STATUS_ALLOC_FAILED";
case CUSOLVER_STATUS_INVALID_VALUE:
return "CUSOLVER_STATUS_INVALID_VALUE";
case CUSOLVER_STATUS_ARCH_MISMATCH:
return "CUSOLVER_STATUS_ARCH_MISMATCH";
case CUSOLVER_STATUS_EXECUTION_FAILED:
return "CUSOLVER_STATUS_EXECUTION_FAILED";
case CUSOLVER_STATUS_INTERNAL_ERROR:
return "CUSOLVER_STATUS_INTERNAL_ERROR";
case CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
}
return "<unknown>";
}
inline void __cusolveSafeCall(cusolverStatus_t err, const char *file, const int line)
{
if (CUSOLVER_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSOLVE error in file '%s', line %d, error: %s \nterminating!\n", __FILE__, __LINE__, \
_cusolverGetErrorEnum(err)); \
assert(0); \
}
}
extern "C" void cusolveSafeCall(cusolverStatus_t err) { __cusolveSafeCall(err, __FILE__, __LINE__); }
/***************************/
/* CUSPARSE ERROR CHECKING */
/***************************/
static const char *_cusparseGetErrorEnum(cusparseStatus_t error)
{
switch (error)
{
case CUSPARSE_STATUS_SUCCESS:
return "CUSPARSE_STATUS_SUCCESS";
case CUSPARSE_STATUS_NOT_INITIALIZED:
return "CUSPARSE_STATUS_NOT_INITIALIZED";
case CUSPARSE_STATUS_ALLOC_FAILED:
return "CUSPARSE_STATUS_ALLOC_FAILED";
case CUSPARSE_STATUS_INVALID_VALUE:
return "CUSPARSE_STATUS_INVALID_VALUE";
case CUSPARSE_STATUS_ARCH_MISMATCH:
return "CUSPARSE_STATUS_ARCH_MISMATCH";
case CUSPARSE_STATUS_MAPPING_ERROR:
return "CUSPARSE_STATUS_MAPPING_ERROR";
case CUSPARSE_STATUS_EXECUTION_FAILED:
return "CUSPARSE_STATUS_EXECUTION_FAILED";
case CUSPARSE_STATUS_INTERNAL_ERROR:
return "CUSPARSE_STATUS_INTERNAL_ERROR";
case CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED:
return "CUSPARSE_STATUS_MATRIX_TYPE_NOT_SUPPORTED";
case CUSPARSE_STATUS_ZERO_PIVOT:
return "CUSPARSE_STATUS_ZERO_PIVOT";
}
return "<unknown>";
}
inline void __cusparseSafeCall(cusparseStatus_t err, const char *file, const int line)
{
if (CUSPARSE_STATUS_SUCCESS != err) {
fprintf(stderr, "CUSPARSE error in file '%s', line %Ndims\Nobjs %s\nerror %Ndims: %s\nterminating!\Nobjs", __FILE__, __LINE__, err, \
_cusparseGetErrorEnum(err)); \
cudaDeviceReset(); assert(0); \
}
}
extern "C" void cusparseSafeCall(cusparseStatus_t err) { __cusparseSafeCall(err, __FILE__, __LINE__); }
/********/
/* MAIN */
/********/
int main()
{
// --- Initialize cuSPARSE
cusparseHandle_t handle; cusparseSafeCall(cusparseCreate(&handle));
const int Nrows = 4; // --- Number of rows
const int Ncols = 4; // --- Number of columns
const int N = Nrows;
// --- Host side dense matrix
double *h_A_dense = (double*)malloc(Nrows*Ncols*sizeof(*h_A_dense));
// --- Column-major ordering
h_A_dense[0] = 1.78; h_A_dense[4] = 0.0; h_A_dense[8] = 0.1736; h_A_dense[12] = 0.0;
h_A_dense[1] = 0.00; h_A_dense[5] = 3.1; h_A_dense[9] = 0.0; h_A_dense[13] = 0.0;
h_A_dense[2] = 0.1736; h_A_dense[6] = 0.0; h_A_dense[10] = 5.0; h_A_dense[14] = 0.0;
h_A_dense[3] = 0.00; h_A_dense[7] = 0.0; h_A_dense[11] = 0.0; h_A_dense[15] = 2.349;
//create device array and copy host to it
double *d_A_dense; gpuErrchk(cudaMalloc(&d_A_dense, Nrows * Ncols * sizeof(*d_A_dense)));
gpuErrchk(cudaMemcpy(d_A_dense, h_A_dense, Nrows * Ncols * sizeof(*d_A_dense), cudaMemcpyHostToDevice));
// --- Descriptor for sparse matrix A
cusparseMatDescr_t descrA; cusparseSafeCall(cusparseCreateMatDescr(&descrA));
cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL);
cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ZERO);
int nnz = 0; // --- Number of nonzero elements in dense matrix
const int lda = Nrows; // --- Leading dimension of dense matrix
// --- Device side number of nonzero elements per row
int *d_nnzPerVector; gpuErrchk(cudaMalloc(&d_nnzPerVector, Nrows * sizeof(*d_nnzPerVector)));
cusparseSafeCall(cusparseDnnz(handle, CUSPARSE_DIRECTION_ROW, Nrows, Ncols, descrA, d_A_dense, lda, d_nnzPerVector, &nnz));
// --- Host side number of nonzero elements per row
int *h_nnzPerVector = (int *)malloc(Nrows * sizeof(*h_nnzPerVector));
gpuErrchk(cudaMemcpy(h_nnzPerVector, d_nnzPerVector, Nrows * sizeof(*h_nnzPerVector), cudaMemcpyDeviceToHost));
printf("Number of nonzero elements in dense matrix = %i\n\n", nnz);
for (int i = 0; i < Nrows; ++i) printf("Number of nonzero elements in row %i = %i \n", i, h_nnzPerVector[i]);
printf("\n");
// --- Device side dense matrix
double *d_A; gpuErrchk(cudaMalloc(&d_A, nnz * sizeof(*d_A)));
int *d_A_RowIndices; gpuErrchk(cudaMalloc(&d_A_RowIndices, (Nrows + 1) * sizeof(*d_A_RowIndices)));
int *d_A_ColIndices; gpuErrchk(cudaMalloc(&d_A_ColIndices, nnz * sizeof(*d_A_ColIndices)));
cusparseSafeCall(cusparseDdense2csr(handle, Nrows, Ncols, descrA, d_A_dense, lda, d_nnzPerVector, d_A, d_A_RowIndices, d_A_ColIndices));
// --- Host side dense matrix
double *h_A = (double *)malloc(nnz * sizeof(*h_A));
int *h_A_RowIndices = (int *)malloc((Nrows + 1) * sizeof(*h_A_RowIndices));
int *h_A_ColIndices = (int *)malloc(nnz * sizeof(*h_A_ColIndices));
gpuErrchk(cudaMemcpy(h_A, d_A, nnz*sizeof(*h_A), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_RowIndices, d_A_RowIndices, (Nrows + 1) * sizeof(*h_A_RowIndices), cudaMemcpyDeviceToHost));
gpuErrchk(cudaMemcpy(h_A_ColIndices, d_A_ColIndices, nnz * sizeof(*h_A_ColIndices), cudaMemcpyDeviceToHost));
for (int i = 0; i < nnz; ++i) printf("A[%i] = %.0f ", i, h_A[i]); printf("\n");
for (int i = 0; i < (Nrows + 1); ++i) printf("h_A_RowIndices[%i] = %i \n", i, h_A_RowIndices[i]); printf("\n");
for (int i = 0; i < nnz; ++i) printf("h_A_ColIndices[%i] = %i \n", i, h_A_ColIndices[i]);
// --- Allocating and defining dense host and device data vectors
double *h_y = (double *)malloc(Nrows * sizeof(double));
h_y[0] = 1.0; h_y[1] = 1.0; h_y[2] = 1.0; h_y[3] = 1.0;
double *d_y; gpuErrchk(cudaMalloc(&d_y, Nrows * sizeof(double)));
gpuErrchk(cudaMemcpy(d_y, h_y, Nrows * sizeof(double), cudaMemcpyHostToDevice));
// --- Allocating the host and device side result vector
double *h_x = (double *)malloc(Ncols * sizeof(double));
double *d_x; gpuErrchk(cudaMalloc(&d_x, Ncols * sizeof(double)));
// --- CUDA solver initialization
cusolverSpHandle_t solver_handle;
cusolverSpCreate(&solver_handle);
// --- Using Cholesky factorization
int singularity;
cusolveSafeCall(cusolverSpDcsrlsvcholHost(solver_handle, N, nnz, descrA, h_A, h_A_RowIndices, h_A_ColIndices, h_y, 0.000001, 0, h_x, &singularity));
printf("Showing the results...\n");
for (int i = 0; i < N; i++) printf("%f\n", h_x[i]);
}

Sudoku solver keeps getting stuck for some reason

So I had to write a program for a computer project for high school and I thought of doing a sudoko solver. The 'solve' algorithm is implemented like this:-
For any points where only one element 'fits' looking at rows, columns, 3x3 set, put that number in. Do this repeatedly till it can't be done anymore. This is seen in the 'singleLeft' function.
If a number 'fits' in some point but nowhere else in the associated row, column or 3x3 set, put that number in. This can be seen in the 'checkOnlyAllowed' function.
If we're not done yet, do a 'guess' - take some number that 'fits' in the point, put it in there and then solve again using this algorithm (recurse) - if it works, we're done.
So far, I have this code:
#include <iostream>
#include <fstream>
#include <cstdlib>
using namespace std;
//Prints a message and exits the application.
void error(const char msg[])
{
cout << "An error occurred!" << endl;
cout << "Description: " << msg << endl;
exit(0);
}
//A representation of a sudoku board. Can be read from a file or from memory.
class Sudoku
{
protected:
//For a point x, y and a number n in the board, mAllowed[x][y][n]
//is 1 if n is allowed in that point, 0 if not.
int mAllowed[9][9][10];
int filledIn;
public:
/*
* For mBoard[i][j], the location is (i,j) in the below map:
*
* (0,0) (0,1) (0,2) (0,3) (0,4) (0,5) (0,6) (0,7) (0,8)
* (1,0) (1,1) (1,2) (1,3) (1,4) (1,5) (1,6) (1,7) (1,8)
* (2,0) (2,1) (2,2) (2,3) (2,4) (2,5) (2,6) (2,7) (2,8)
*
* (3,0) (3,1) (3,2) (3,3) (3,4) (3,5) (3,6) (3,7) (3,8)
* (4,0) (4,1) (4,2) (4,3) (4,4) (4,5) (4,6) (4,7) (4,8)
* (5,0) (5,1) (5,2) (5,3) (5,4) (5,5) (5,6) (5,7) (5,8)
*
* (6,0) (6,1) (6,2) (6,3) (6,4) (6,5) (6,6) (6,7) (6,8)
* (7,0) (7,1) (7,2) (7,3) (7,4) (7,5) (7,6) (7,7) (7,8)
* (8,0) (8,1) (8,2) (8,3) (8,4) (8,5) (8,6) (8,7) (8,8)
*
*/
int mBoard[9][9];
//Read in from file with given name.
Sudoku(char filename[])
{
filledIn = 0;
int i, j, k;
//Fill the board with 0s.
for (i = 0; i < 9; ++i)
for (j = 0; j < 9; ++j)
mBoard[i][j] = 0;
//Set every number to 'allowed' initially.
for (i = 0; i < 9; ++i)
for (j = 0; j < 9; ++j)
for (k = 1; k <= 9; ++k)
mAllowed[i][j][k] = 1;
//Read in from the file.
ifstream file(filename);
if (!file)
error("File doesn't exist!");
for (i = 0; i < 9; ++i)
for (j = 0; j < 9; ++j)
if (file)
{
int m;
file >> m;
if (m)
set(i, j, m);
}
else
error("Not enough entries in file!");
}
//Solve the board!
int solve()
{
int prevFilledIn;
do
{
prevFilledIn = filledIn;
singleLeft();
checkOnlyAllowed();
} while (filledIn - prevFilledIn > 3);
if (filledIn < 81)
guess();
return filledIn == 81;
}
//Given a point i, j, this looks for places where this point
//disallows a number and sets the 'mAllowed' table accordingly.
void fixAllowed(int i, int j)
{
int n = mBoard[i][j], k;
for (k = 0; k < 9; ++k)
mAllowed[i][k][n] = 0;
for (k = 0; k < 9; ++k)
mAllowed[k][j][n] = 0;
//Look in 3x3 sets too. First, set each coordinate to the
//highest multiple of 3 below itself. This takes us to the
//top-left corner of the 3x3 set this point was in. Then,
//add vectorially all points (x,y) where x and y each are
//one of 0, 1 or 2 to visit each point in this set.
int x = (i / 3) * 3;
int y = (j / 3) * 3;
for (k = 0; k < 3; ++k)
for (int l = 0; l < 3; ++l)
mAllowed[x + k][y + l][n] = 0;
mAllowed[i][j][n] = 1;
}
//Sets a point i, j to n.
void set(int i, int j, int n)
{
mBoard[i][j] = n;
fixAllowed(i, j);
++filledIn;
}
//Try using 'single' on a point, ie, only one number can fit in this
//point, so put it in and return 1. If more than one number can fit,
//return 0.
int trySinglePoint(int i, int j)
{
int c = 0, m;
for (m = 1; m <= 9; ++m)
c += mAllowed[i][j][m];
if (c == 1)
{
for (m = 1; m <= 9; ++m)
if (mAllowed[i][j][m])
set(i, j, m);
//printBoard();
return 1;
}
return 0;
}
//Try to solve by checking for spots that have only one number remaining.
void singleLeft()
{
for (;;)
{
for (int i = 0; i < 9; ++i)
for (int j = 0; j < 9; ++j)
if (!mBoard[i][j])
if (trySinglePoint(i, j))
goto logic_worked;
//If we reached here, board is either full or unsolvable by this logic, so
//our job is done.
return;
logic_worked:
continue;
}
}
//Within rows, columns or sets, whether this number is 'allowed' in spots
//other than i, j.
int onlyInRow(int n, int i, int j)
{
for (int k = 0; k < 9; ++k)
if (k != j && mAllowed[i][k][n])
return 0;
return 1;
}
int onlyInColumn(int n, int i, int j)
{
for (int k = 0; k < 9; ++k)
if (k != i && mAllowed[k][j][n])
return 0;
return 1;
}
int onlyInSet(int n, int i, int j)
{
int x = (i / 3) * 3;
int y = (j / 3) * 3;
for (int k = 0; k < 3; ++k)
for (int l = 0; l < 3; ++l)
if (!(x + k == i && y + l == j) && mAllowed[x + k][y + l][n])
return 0;
return 1;
}
//If a number is 'allowed' in only one spot within a row, column or set, it's
//guaranteed to have to be there.
void checkOnlyAllowed()
{
for (int i = 0; i < 9; ++i)
for (int j = 0; j < 9; ++j)
if (!mBoard[i][j])
for (int m = 1; m <= 9; ++m)
if (mAllowed[i][j][m])
if (onlyInRow(m, i, j) || onlyInColumn(m, i, j) || onlyInSet(m, i, j))
set(i, j, m);
}
//Copy from a given board.
void copyBoard(int board[9][9])
{
filledIn = 0;
for (int i = 0; i < 9; ++i)
for (int j = 0; j < 9; ++j)
{
if (board[i][j] > 0)
++filledIn;
mBoard[i][j] = board[i][j];
}
}
//Try to solve by 'guessing'.
void guess()
{
for (int i = 0; i < 9; ++i)
for (int j = 0; j < 9; ++j)
for (int n = 1; n <= 9; ++n)
if (!mBoard[i][j])
if (mAllowed[i][j][n] == 1)
{
//Do a direct copy so that it gets the 'mAllowed'
//table too.
Sudoku s = *this;
//Try solving with this number at this spot.
s.set(i, j, n);
if (s.solve())
{
//It was able to do it! Copy and report success!
copyBoard(s.mBoard);
return;
}
}
}
//Print the board (for debug purposes)
void printBoard()
{
for (int i = 0; i < 9; ++i)
{
for (int j = 0; j < 9; ++j)
cout << mBoard[i][j] << " ";
cout << endl;
}
cout << endl;
char s[5];
cin >> s;
}
};
int main(int argc, char **argv)
{
//char filename[42];
//cout << "Enter filename: ";
//cin >> filename;
char *filename = argv[1];
Sudoku s(filename);
if (!s.solve())
error("Couldn't solve!");
cout << "Solved! Here's the solution:" << endl << endl;
for (int i = 0; i < 9; ++i)
{
for (int j = 0; j < 9; ++j)
cout << s.mBoard[i][j] << " ";
cout << endl;
}
return 0;
}
(code including line numbers: http://sprunge.us/AiUc?cpp)
Now I understand that it isn't very good style, but it came out of a late-night coding session and also we use an older compiler in the school lab so I had to do some things differently (in that compiler, the standard headers have the '.h' extension, variables declared in for loops are in outside-for scope, ... ).
The file should contain whitespace-delimited digits for each spot in the board starting from the top-left going left to right and top to bottom, with empty spots signified by '0's.
For the following file, it works rather well:
5 3 0 0 7 0 0 0 0
6 0 0 1 9 5 0 0 0
0 9 8 0 0 0 0 6 0
8 0 0 0 6 0 0 0 3
4 0 0 8 0 3 0 0 1
7 0 0 0 2 0 0 0 6
0 6 0 0 0 0 2 8 0
0 0 0 4 1 9 0 0 5
0 0 0 0 8 0 0 7 9
However, this one gives it trouble:
0 9 4 0 0 0 1 3 0
0 0 0 0 0 0 0 0 0
0 0 0 0 7 6 0 0 2
0 8 0 0 1 0 0 0 0
0 3 2 0 0 0 0 0 0
0 0 0 2 0 0 0 6 0
0 0 0 0 5 0 4 0 0
0 0 0 0 0 8 0 0 7
0 0 6 3 0 4 0 0 8
If I comment out the print statements and track the progress I can see that it starts by heading out in the wrong direction at points. Eventually it gets stuck toward the end and the backtracking never gets far back enough. I think it's something wrong with the 'checkOnlyAllowed' part...
What do you think could be the problem?
Also - I know I could've used a bitfield for the 'mAllowed' table but we don't officially know about bitwise operations yet in school. :P
At line 170 you have a goto that is jumping out of a for loop, then continuing. This could give you some weird behavior with continuing the wrong loop, behavior that might depend on the specific compiler.
Try replacing lines 164-177 with:
164 for (;;)
165 {
166 bool successfullyContributedToTheBoard = false;
167 for (int i = 0; i < 9; ++i)
168 for (int j = 0; j < 9; ++j)
169 if (!mBoard[i][j])
170 if (trySinglePoint(i, j))
171 successfullyContributedToTheBoard = true;
172 if (!successfullyContributedToTheBoard)
173 return;
174 }
I didn't look at your code but your strategy is exactly the same as the one I used to code a Sudoku solver. But I can't remember it being very slow. I got solutions in an instant. The maximum number of "guesses" the program had do make was 3 during my tests. That was for Sudoku problems which were supposed to be very hard. Three is not a big number with respect to back tracking and you can pick a cell which has only a few possibilities left (two or three) which limits the search space to about 20-30 states only (for hard Sudoku problems).
What I'm saying is, it's possible to use this strategy and solve Sudoku problems really fast. You only have to figure out how to optimize your code. Try to avoid redundant work. Try to remember things so you don't need to recalculate them again and again.
Alright, I got it working! It seems that the i, j loop within 'guess' was unecessary - ie., it should only do a guess on one empty spot because its 'child processes' will handle the rest. Fixing this actually made the code simpler. Now it works really well, and actually its very quick!
Thanks for your help, everyone. ;-)