Status: execution failed, when invoking cusolverDnDgeqrf from CUDA library - c++

I try to perform a QR factorization on GPU using the cusolver library from CUDA.
I reduced my problem to the example below.
Basically, the few steps are :
I allocate memory and initialized a [5x3] matrix with 1s on the
host,
I allocate memory and copy the matrix on the device
I initialize the solver handler with cusolverDnCreate
I determine the size of the needed work space with cusolverDnDgeqrf_bufferSize
And, finally, try to do the QR factorization with cusolverDnDgeqrf
Unfortunately, the last command systematically fails by returning a CUSOLVER_STATUS_EXECUTION_FAILED (int value = 6) and I can't figure out what went wrong!
Here is the faulty code:
#include <cusolverDn.h>
#include <cuda_runtime_api.h>
int main(void)
{
int N = 5, P = 3;
double *hostData;
cudaMallocHost((void **) &hostData, N * sizeof(double));
for (int i = 0; i < N * P; ++i)
hostData[i] = 1.;
double *devData;
cudaMalloc((void**)&devData, N * sizeof(double));
cudaMemcpy((void*)devData, (void*)hostData, N * sizeof(double), cudaMemcpyHostToDevice);
cusolverStatus_t retVal;
cusolverDnHandle_t solverHandle;
retVal = cusolverDnCreate(&solverHandle);
std::cout << "Handler creation : " << retVal << std::endl;
double *devTau, *work;
int szWork;
cudaMalloc((void**)&devTau, P * sizeof(double));
retVal = cusolverDnDgeqrf_bufferSize(solverHandle, N, P, devData, N, &szWork);
std::cout << "Work space sizing : " << retVal << std::endl;
cudaMalloc((void**)&work, szWork * sizeof(double));
int *devInfo;
cudaMalloc((void **)&devInfo, 1);
retVal = cusolverDnDgeqrf(solverHandle, N, P, devData, N, devTau, work, szWork, devInfo); //CUSOLVER_STATUS_EXECUTION_FAILED
std::cout << "QR factorization : " << retVal << std::endl;
int hDevInfo = 0;
cudaMemcpy((void*)devInfo, (void*)&hDevInfo, 1 * sizeof(int), cudaMemcpyDeviceToHost);
std::cout << "Info device : " << hDevInfo << std::endl;
cudaFree(devInfo);
cudaFree(work);
cudaFree(devTau);
cudaFree(devData);
cudaFreeHost(hostData);
cudaDeviceReset();
}
Would you see any obvious error in my code, please let me know!
Many thanks.

Any time you are having trouble with a cuda code, you should always use proper cuda error checking and run your code with cuda-memcheck, before asking for help.
You may also want to be aware of the fact that a fully worked QR factorization example is given in the relevant CUDA/cusolver sample code and there is also sample code in the documentation.
With proper error checking, you may have discovered:
this is not correct:
cudaMalloc((void **)&devInfo, 1);
the second parameter is the size in bytes, so it should be sizeof(int), not 1. This error results in an error in a cudaMemcpyAsync operation internal to the cusolverDnDgeqrf call, which would show up in cuda-memcheck output.
This is not correct:
cudaMemcpy((void*)devInfo, (void*)&hDevInfo, 1 * sizeof(int), cudaMemcpyDeviceToHost);
the order of the pointer parameters is destination first, followed by source. So you have those parameters reversed, and this call would throw a runtime API error that you could observe if you were doing proper error checking (or visible in cuda-memcheck output).
Once you fix those errors, then the qrf call will actually return a zero status (no error). But we're not quite done yet (again, proper error checking would let us know we are not quite done yet.)
In addition to the above errors, you have made some additional sizing errors. Your matrix is of size N*P, so it has N*P elements, and you are initializing that many elements here:
for (int i = 0; i < N * P; ++i)
hostData[i] = 1.;
but you are not allocating for that many elements on the host here:
cudaMallocHost((void **) &hostData, N * sizeof(double));
or on the device here:
cudaMalloc((void**)&devData, N * sizeof(double));
and you are not transferring that many elements here:
cudaMemcpy((void*)devData, (void*)hostData, N * sizeof(double), cudaMemcpyHostToDevice);
So in the 3 cases above, if you change N*sizeof(double) to N*P*sizeof(double) you will be able to fix those errors, and the code then runs with no errors reported by cuda-memcheck, and also no errors returned from any of the API calls.

Related

Windows threading synchronization performance issue

I have a threading issue under windows.
I am developing a program that runs complex physical simulations for different conditions. Say a condition per hour of the year, would be 8760 simulations. I am grouping those simulations per thread such that each thread runs a for loop of 273 simulations (on average)
I bought an AMD ryzen 9 5950x with 16 cores (32 threads) for this task. On Linux, all the threads seem to be between 98% to 100% usage, while under windows I get this:
(The first bar is the I/O thread reading data, the smaller bars are the process threads. Red: synchronization, green: process, purple: I/O)
This is from Visual Studio's concurrency visualizer, which tells me that 63% of the time was spent on thread synchronization. As far as I can tell, my code is the same for both the Linux and windows executions.
I made my best to make the objects immutable to avoid issues and that provided a big gain with my old 8-thread intel i7. However with many more threads, this issue arises.
For threading, I have tried a custom parallel for, and the taskflow library. Both perform identically for what I want to do.
Is there something fundamental about windows threads that produces this behaviour?
The custom parallel for code:
/**
* parallel for
* #tparam Index integer type
* #tparam Callable function type
* #param start start index of the loop
* #param end final +1 index of the loop
* #param func function to evaluate
* #param nb_threads number of threads, if zero, it is determined automatically
*/
template<typename Index, typename Callable>
static void ParallelFor(Index start, Index end, Callable func, unsigned nb_threads=0) {
// Estimate number of threads in the pool
if (nb_threads == 0) nb_threads = getThreadNumber();
// Size of a slice for the range functions
Index n = end - start + 1;
Index slice = (Index) std::round(n / static_cast<double> (nb_threads));
slice = std::max(slice, Index(1));
// [Helper] Inner loop
auto launchRange = [&func] (int k1, int k2) {
for (Index k = k1; k < k2; k++) {
func(k);
}
};
// Create pool and launch jobs
std::vector<std::thread> pool;
pool.reserve(nb_threads);
Index i1 = start;
Index i2 = std::min(start + slice, end);
for (unsigned i = 0; i + 1 < nb_threads && i1 < end; ++i) {
pool.emplace_back(launchRange, i1, i2);
i1 = i2;
i2 = std::min(i2 + slice, end);
}
if (i1 < end) {
pool.emplace_back(launchRange, i1, end);
}
// Wait for jobs to finish
for (std::thread &t : pool) {
if (t.joinable()) {
t.join();
}
}
}
A complete C++ project illustrating the issue is uploaded here
Main.cpp:
//
// Created by santi on 26/08/2022.
//
#include "input_data.h"
#include "output_data.h"
#include "random.h"
#include "par_for.h"
void fillA(Matrix& A){
Random rnd;
rnd.setTimeBasedSeed();
for(int i=0; i < A.getRows(); ++i)
for(int j=0; j < A.getRows(); ++j)
A(i, j) = (int) rnd.randInt(0, 1000);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
for(const int& t: time_indices){
Matrix b = input_data.getAt(t);
Matrix A(input_data.getDim(), input_data.getDim());
fillA(A);
Matrix x = A * b;
output_data.setAt(t, x);
}
}
void process(int time_steps, int dim, int n_threads){
InputData input_data(time_steps, dim);
OutputData output_data(time_steps, dim);
// correct the number of threads
if ( n_threads < 1 ) { n_threads = ( int )getThreadNumber( ); }
// generate indices
std::vector<int> time_indices = arrange<int>(time_steps);
// compute the split of indices per core
std::vector<ParallelChunkData<int>> chunks = prepareParallelChunks(time_indices, n_threads );
// run in parallel
ParallelFor( 0, ( int )chunks.size( ), [ & ]( int k ) {
// run chunk
worker(input_data, output_data, chunks[k].indices, k );
} );
}
int main(){
process(8760, 5000, 0);
return 0;
}
The performance problem you see is definitely caused by the many memory allocations, as already suspected by Matt in his answer. To expand on this: Here is a screenshot from Intel VTune running on an AMD Ryzen Threadripper 3990X with 64 cores (128 threads):
As you can see, almost all of the time is spent in malloc or free, which get called from the various Matrix operations. The bottom part of the image shows the timeline of the activity of a small selection of the threads: Green means that the thread is inactive, i.e. waiting. Usually only one or two threads are actually active. Allocations and freeing memory accesses a shared resource, causing the threads to wait for each other.
I think you have only two real options:
Option 1: No dynamic allocations anymore
The most efficient thing to do would be to rewrite the code to preallocate everything and get rid of all the temporaries. To adapt it to your example code, you could replace the b = input_data.getAt(t); and x = A * b; like this:
void MatrixVectorProduct(Matrix const & A, Matrix const & b, Matrix & x)
{
for (int i = 0; i < x.getRows(); ++i) {
for (int j = 0; j < x.getCols(); ++j) {
x(i, j) = 0.0;
for (int k = 0; k < A.getCols(); ++k) {
x(i,j) += (A(i,k) * b(k,j));
}
}
}
}
void getAt(int t, Matrix const & input_data, Matrix & b) {
for (int i = 0; i < input_data.getRows(); ++i)
b(i, 0) = input_data(i, t);
}
void worker(const InputData& input_data,
OutputData& output_data,
const std::vector<int>& time_indices,
int thread_index){
std::cout << "Thread " << thread_index << " [" << time_indices[0]<< ", " << time_indices[time_indices.size() - 1] << "]\n";
Matrix A(input_data.getDim(), input_data.getDim());
Matrix b(input_data.getDim(), 1);
Matrix x(input_data.getDim(), 1);
for (const int & t: time_indices) {
getAt(t, input_data.getMat(), b);
fillA(A);
MatrixVectorProduct(A, b, x);
output_data.setAt(t, x);
}
std::cout << "Thread " << thread_index << ": Finished" << std::endl;
}
This fixes the performance problems.
Here is a screenshot from VTune, where you can see a much better utilization:
Option 2: Using a special allocator
The alternative is to use a different allocator that handles allocating and freeing memory more efficiently in multithreaded scenarios. One that I had very good experience with is mimalloc (there are others such as hoard or the one from TBB). You do not need to modify your source code, you just need to link with a specific library as described in the documentation.
I tried mimalloc with your source code, and it gave near 100% CPU utilization without any code changes.
I also found a post on the Intel forums with a similar problem, and the solution there was the same (using a special allocator).
Additional notes
Matrix::allocSpace() allocates the memory by using pointers to arrays. It is better to use one contiguous array for the whole matrix instead of multiple independent arrays. That way, all elements are located behind each other in memory, allowing more efficient access.
But in general I suggest to use a dedicated linear algebra library such as Eigen instead of the hand rolled matrix implementation to exploit vectorization (SSE2, AVX,...) and to get the benefits of a highly optimized library.
Ensure that you compile your code with optimizations enabled.
Disable various cross-checks if you do not need them: assert() (i.e. define NDEBUG in the preprocessor), and for MSVC possibly /GS-.
Ensure that you actually have enough memory installed.
You said that all your memory was pre-allocated, but in the worker function I see this...
Matrix b = input_data.getAt(t);
which allocates and fills a new matrix b, and this...
Matrix A(input_data.getDim(), input_data.getDim());
which allocates and fills a new matrix A, and this...
Matrix x = A * b;
which allocates and fills a new matrix x.
The heap is a global data structure, so the thread synchronization time you're seeing is probably contention in the memory allocate/free functions.
These are in a tight loop. You should fix this loop to access b by reference, and reuse the other 2 matrices for every iteration.

the dimension and read problem when using DPC++ image arrays

This problem really confuse me. I wanna use DPC++ to read a group of images, so the sycl::image was used ,below are my code.
#define N 4 //dimensin
#define M 128 //dimension
#define C 4 //4 channel
#define L 2 // 2 images
int * host_array3_2 = malloc_host<int>(N*M*C*L, Q);
image im3(host_array3_2, image_channel_order::rgba, image_channel_type::unsigned_int32, range{ M,N,L}); //the image format
The kernel code are as follows, I use accessor with image_array tags to read the data:
Q.submit([&](handler &h) {
auto out = sycl::stream(1024, 1024 * 2, h);
accessor<int4, 2, access::mode::read, access::target::image_array> acs3(im3, h);//the accessor format
h.parallel_for(nd_range{ range{ M ,N,L }, range{ N,N,L } }, [=](nd_item<3> it) {
int idx = it.get_global_linear_id();
if (idx == 0){
confuse here: out << acs3.get_count() << " " << acs3.get_range() << " \n";
//const auto &ss = acs3[0]; no compile error
confuse here: //ss.read(int2(0, 1)); compiler error: "array subscript out of range" "SYCL kernel cannot call a variadic function"}
});
});
In addition to the read problem , I found that the range is{128,4,4},
why the third dimension is 4? Isn’t the value of L(2)?
And it seems that the third dimension is only depend on the second dimension, no matter what the L is. Can anybody answer me?
An interesting phenomenon, under the release compile mode ,it will be ok ,and the output are also different. that's strange

Why my inversions of matrices are such slow with LAPACKE in C++ : MAGMA Alternative and set up

I am using LAPACK to inverse a matrix: I did a reference passing, i.e by working on the address. Here below the function with an input matrix and an output matrix referenced by their address.
The issue is that I am obliged to convert the F_matrix into 1D array and I think this is a waste of performances on the runtime level : which way could I find to get rid of this supplementary task which is time consuming I think if I call a lot of times the
function matrix_inverse_lapack.
Below the function concerned :
// Passing Matrixes by Reference
void matrix_inverse_lapack(vector<vector<double>> const &F_matrix, vector<vector<double>> &F_output) {
// Index for loop and arrays
int i, j, ip, idx;
// Size of F_matrix
int N = F_matrix.size();
int *IPIV = new int[N];
// Statement of main array to inverse
double *arr = new double[N*N];
// Output Diagonal block
double *diag = new double[N];
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
arr[idx] = F_matrix[i][j];
}
}
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, arr, N, IPIV);
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, arr, N, IPIV);
for (i = 0; i<N; i++){
for (j = 0; j<N; j++){
idx = i*N + j;
F_output[i][j] = arr[idx];
}
}
delete[] IPIV;
delete[] arr;
}
For example, I call it this way :
vector<vector<double>> CO_CL(lsize*(2*Dim_x+Dim_y), vector<double>(lsize*(2*Dim_x+Dim_y), 0));
... some code
matrix_inverse_lapack(CO_CL, CO_CL);
The performances on inversion are not which are expected, I think this is due to this conversion 2D -> 1D that I described in the function matrix_inverse_lapack.
Update
I was advised to install MAGMA on my MacOS Big Sur 11.3 but I have a lot of difficulties to set up it.
I have a AMD Radeon Pro 5600M graphic card. I have already installed by default Big Sur version all the Framework OpenCL (maybe I am wrong by saying that). Anyone could tell the procedure to follow for the installation of MAGMA. I saw that on a MAGMA software exists on http://magma.maths.usyd.edu.au/magma/ but it is really expensive and doesn't correspond to what I want : I just need all the SDK (headers and libraries) , if possible built with my GPU card. I have already installed all the Intel OpenAPI SDK on my MacOS. Maybe, I could link it to a MAGMA installation.
I saw another link https://icl.utk.edu/magma/software/index.html where MAGMA seems to be public : there is none link with the non-free version above, isn't there ?
First of all let me complain that OP did not provide all necessary data. The program is almost complete, but it is not a minimal, reproducible example. This is important because (a) it wastes time and (b) it hides potentially relevant information, eg. about the matrix initialization. Second, OP did not provide any details on the compilation, which, again may be relevant.
Last, but not least, OP didn't check the status code for possible errors from Lapack functions, and this could also be important for correct interpretation of the results.
Let's start from a minimal reproducible example:
#include <lapacke.h>
#include <vector>
#include <chrono>
#include <iostream>
using Matrix = std::vector<std::vector<double>>;
std::ostream &operator<<(std::ostream &out, Matrix const &v)
{
const auto size = std::min<int>(10, v.size());
for (int i = 0; i < size; i++)
{
for (int j = 0; j < size; j++)
{
out << v[i][j] << "\t";
}
if (size < std::ssize(v)) out << "...";
out << "\n";
}
return out;
}
void matrix_inverse_lapack(Matrix const &F_matrix, Matrix &F_output, std::vector<int> &IPIV_buffer,
std::vector<double> &matrix_buffer)
{
// std::cout << F_matrix << "\n";
auto t0 = std::chrono::steady_clock::now();
const int N = F_matrix.size();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
matrix_buffer[idx] = F_matrix[i][j];
}
}
auto t1 = std::chrono::steady_clock::now();
// LAPACKE routines
int info1 = LAPACKE_dgetrf(LAPACK_ROW_MAJOR, N, N, matrix_buffer.data(), N, IPIV_buffer.data());
int info2 = LAPACKE_dgetri(LAPACK_ROW_MAJOR, N, matrix_buffer.data(), N, IPIV_buffer.data());
auto t2 = std::chrono::steady_clock::now();
for (int i = 0; i < N; i++)
{
for (int j = 0; j < N; j++)
{
auto idx = i * N + j;
F_output[i][j] = matrix_buffer[idx];
}
}
auto t3 = std::chrono::steady_clock::now();
auto whole_fun_time = std::chrono::duration<double>(t3 - t0).count();
auto lapack_time = std::chrono::duration<double>(t2 - t1).count();
// std::cout << F_output << "\n";
std::cout << "status: " << info1 << "\t" << info2 << "\t" << (info1 == 0 && info2 == 0 ? "Success" : "Failure")
<< "\n";
std::cout << "whole function: " << whole_fun_time << "\n";
std::cout << "LAPACKE matrix operations: " << lapack_time << "\n";
std::cout << "conversion: " << (whole_fun_time - lapack_time) / whole_fun_time * 100.0 << "%\n";
}
int main(int argc, const char *argv[])
{
const int M = 5; // numer of test repetitions
const int N = (argc > 1) ? std::stoi(argv[1]) : 10;
std::cout << "Matrix size = " << N << "\n";
std::vector<int> IPIV_buffer(N);
std::vector<double> matrix_buffer(N * N);
// Test matrix_inverse_lapack M times
for (int i = 0; i < M; i++)
{
Matrix CO_CL(N);
for (auto &v : CO_CL) v.resize(N);
int idx = 1;
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx + 1.0 / idx;
idx++;
}
}
matrix_inverse_lapack(CO_CL, CO_CL, IPIV_buffer, matrix_buffer);
}
}
Here, operator<< is an overkill, but may be useful for anyone wanting to verify half-manually that the code works (by uncommenting lines 26 and 58), and ensuring that the code is correct is more important that measuring its performance.
The code can be compiled with
g++ -std=c++20 -O3 main.cpp -llapacke
The program relies on an external library, lapacke, which needs to be installed, headers + binaries, for the code to compile and run.
My code differs a bit from OP's: it is closer to "modern C++" in that it refrains from using naked pointers; I also added external buffers to matrix_inverse_lapack to suppress continual launching of memory allocator and deallocator, a small improvement that reduces the 2D-1D-2D conversion overhead in a measurable way. I also had to initialize the matrix and find a way to read in OP's mind what the value of N could be. I also added some timer readings for benchmarking. Apart from this, the logic of the code is unchanged.
Now a benchmark carried out on a decent workstation. It lists the percentage of time the conversion takes relative to the total time taken by matrix_inverse_lapack. In other words, I measure the conversion overhead:
N = 10, 3.5%
N = 30, 1.5%
N = 100, 1%
N = 300, 0.5%
N = 1000, 0.35%
N = 3000, 0.1%
The time taken by Lapack nicely scales as N3, as expected (data not shown). The time to invert a matrix is about 16 seconds for N = 3000, and about 5-6 s (5 microseconds) for N = 10.
I assume the overhead of even 3% is completely acceptable. I believe OP uses matrices of size larger then 100, in which case the overhead at or below 1% is certainly acceptable.
So what OP (or anyone having a similar problem) could have done wrong to obtain "unacceptable overhead conversion values"? Here's my short list
Improper compilation
Improper matrix initialization (for tests)
Improper benchmarking
1. Improper compilation
If one forgets to compile in Release mode, one ends up with optimized Lapacke competing with unoptimized conversion. On my machine this peaks at an 33% overhead for N = 20.
2. Improper matrix initialization (for tests)
If one initializes the matrix like this:
for (auto &v : CO_CL)
{
for (auto &x : v)
{
x = idx; // rather than, eg., idx + 1.0/idx
idx++;
}
}
then the matrix is singular, lapack returns quite quickly with the status different from 0. This increases the relative importance of the conversion part. But singular matrices are not what one wants to invert (it's impossible to do).
3. Improper benchmarking
Here's an example of the program output for N = 10:
./a.out 10
Matrix size = 10
status: 0 0 Success
whole function: 0.000127658
LAPACKE matrix operations: 0.000126783
conversion: 0.685425%
status: 0 0 Success
whole function: 1.2497e-05
LAPACKE matrix operations: 1.2095e-05
conversion: 3.21677%
status: 0 0 Success
whole function: 1.0535e-05
LAPACKE matrix operations: 1.0197e-05
conversion: 3.20835%
status: 0 0 Success
whole function: 9.741e-06
LAPACKE matrix operations: 9.422e-06
conversion: 3.27482%
status: 0 0 Success
whole function: 9.939e-06
LAPACKE matrix operations: 9.618e-06
conversion: 3.2297%
One can see that the first call to lapack functions can take 10 times more time than the subsequent calls. This is quite a stable pattern, as if Lapack needed some time for self-initialization. It can affect the measurements for small N badly.
4. What else can be done?
OP apperas to believe that his approach to 2D arrays is good and Lapack is strange and old-fashionable in its packing a 2D array into a 1D array. No. It is Lapack who is right.
If one defines a 2D array as vector<vector<double>>, one obtains one advantage: code simplicity. This comes at a price. Each row of such a matrix is allocated separateley from the others. Thus, a matrix 100 by 100 may be stored in 100 completely different memory blocks. This has a bad impact on the cache (and prefetcher) utilization. Lapck (and other linear algebra packages) enforces compactification of the data in a single, continuous array. This is so to minimize cache and prefetcher misses. If OP had used such an approach from the very beginning, he would probably have gained more than 1-3% that they pay now for the conversion.
This compactification can be achieved in at least three ways.
Write a custom class for a 2D matrix, with the internal data stored in a 1D array and convenient access member funnctions (e.g.: operator ()), or find a library that does just that
Write a custom allocator for std::vector (or find a library). This allocator should allocate the memory from a preallocated 1D vector exactly matching the data storage pattern used by Lapack
Use std::vector<double*> and initailze the pointers with the addresses pointing at the appropriate elements of a preallocated 1D array.
Each of the above solutions forces some changes to the surrounding code, which OP might not want to do. All depends on the code complexity and expected performance gains.
EDIT: Alternative libraries
An alternative approach is to use a library that is known for being a highly optimzed one. Lapack by itself can be regardered as a standard interface with many implementations and it may happen that OP uses an unoptimized one. Which library to choose may depend on the hardware/software platform OP is interested in and may vary in time.
As for now (mid-2021) a decent suggestions are:
Lapack https://www.netlib.org/lapack/
Atlas https://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software http://math-atlas.sourceforge.net/
OpenBlas https://www.openblas.net/
Magma https://developer.nvidia.com/magma
Plasma https://bitbucket.org/icl/plasma/src/main/
If OP uses martices of sizes at least 100, then GPU-oriented MAGMA might be worth trying.
An easier (installation, running) way might with a parallel CPU library, e.g. Plasma. Plsama is Lapack-compliant, it has been being developed by a large team of people, including Jack Dongarra, it also should be rather easy to compile it locally as it is provided with a CMake script.
An example how much a parallel CPU-based, multicore implementation can outperform a single-threaded implementation of the LU-decomposition can be found for example here: https://cse.buffalo.edu/faculty/miller/Courses/CSE633/Tummala-Spring-2014-CSE633.pdf (short answer: 5 to 15 times for matrices of size 1000).

glMapBufferRange freezing OpenGL driver

After generating a set of data using a compute shader and storing it in a Shader Storage buffer, I am attempting to read from that buffer to print out the data using the code:
#define INDEX_AT(x,y,z,i) (xyzToId(Vec3i((x), (y), (z)),\
Vec3i(NUM_RAYS_X,\
NUM_RAYS_Y,\
POINTS_ON_RAY))\
* 3 + (i))
PRINT_GL_ERRORS();
glBindBuffer(GL_SHADER_STORAGE_BUFFER, dPositionBuffer);
float* data_ptr = NULL;
for (int ray_i = 0; ray_i < POINTS_ON_RAY; ray_i++)
{
for (int y = 0; y < NUM_RAYS_Y; y++)
{
int x = 0;
data_ptr = NULL;
data_ptr = (float*)glMapBufferRange(
GL_SHADER_STORAGE_BUFFER,
INDEX_AT(x, y, ray_i, 0) * sizeof(float),
3 * (NUM_RAYS_X) * sizeof(float),
GL_MAP_READ_BIT);
if (data_ptr == NULL)
{
PRINT_GL_ERRORS();
return false;
}
else
{
for (int x = 0; x < NUM_RAYS_X; x++)
{
std::cout << "("
<< data_ptr[x * 3 + 0] << ","
<< data_ptr[x * 3 + 1] << ","
<< data_ptr[x * 3 + 2] << ") , ";
}
}
glUnmapBuffer(GL_SHADER_STORAGE_BUFFER);
PRINT_GL_ERRORS();
std::cout << std::endl;
}
std::cout << "\n" << std::endl;
}
where the function xyzToId converts three dimensional coordinates into a one-dimensional index.
When I attempt to run this, however, the program crashes at the call to glMapBufferRange, giving the error message:
The NVIDIA OpenGL driver lost connection with the display driver due to exceeding the Windows Time-Out limit and is unable to continue.
The application must close.
Error code: 7
Would you like to visit
http://nvidia.custhelp.com/cgi-bin/nvidia.cfg/php/enduser/std_adp.php?p_faqid=3007
for help?
The buffer that I am mapping is not very large at all, only 768 floats, and previous calls to glMapBuffer on a different shader storage buffer (of only two floats) completed with no problems. I can't seem to find any information relevant to this error online, and everything that I have read about the speed of glMapBufferRange indicates that a buffer of this size should only take on the order of tens of milliseconds to map, not the two second timeout that the program is crashing on.
Am I missing something about how glMapBufferRange should be used?
It was an unrelated error. Today I learned that OpenGL sometimes buffers commands, and several actions (like mapping a buffer) forces it to finish all the commands in its queue. In this case, it was the action of actually dispatching the compute shader itself.
Today I also learned that indexing a shader storage buffer out of bounds will cause the OpenGL driver to freeze up just like it would if it was taking to long to complete.
All in all, this was largely a case of errors masquerading as different errors and popping up in the wrong spot.

CUDA kernel function output variable isn't modified

I am trying to pass object to kernel. This object has basically two variables, one acts as the input and the other as the output of the kernel. But when I launch kernel the output variable does not change. But when I add another variable to kernel and assign the output value to this variable as well, it suddenly works for both of them.
I've read in another thread (While loop fails in CUDA kernel) that the compiler can evaluate kernel as empty for optimizing purposes if it doesn't produce any output.
So it is possible that this input/output object that I'm passing as the only kernel argument isn't somehow recognized by the compiler as an output? And if that's true. Is there an elegant way (I would like to avoid adding another kernel argument) such as compiling option that can prevent this?
This is the class for this object.
class Replica
{
public :
signed char gA[1024];
int MA;
__device__ __host__ Replica(){
}
};
And this is the kernel that is basically a sum reduction.
__global__ void sumKerA(Replica* Rd)
{
int t = threadIdx.x;
int b = blockIdx.x;
__shared__ signed short gAs[1024];
gAs[t] = Rd[b].gA[t];
for (unsigned int stride = 1024 >> 1; stride > 0; stride >>= 1){
__syncthreads();
if (t < stride){
gAs[t] += gAs[t + stride];
}
}
__syncthreads();
if (t == 0){
Rd[b].MA = gAs[0];
}
}
And finally my host code.
int main ()
{
// replicas - array of objects
Replica R[128];
for (int i = 0; i < 128; ++i){
for (int j = 0; j < 1024; ++j){
R[i].gA[j] = 2*(rand() % 2) - 1;
}
R[i].MA = 0;
}
Replica* Rd;
cudaSetDevice(0);
cudaMalloc((void **)&Rd,128*sizeof(Replica));
cudaMemcpy(Rd,R,128*sizeof(Replica),cudaMemcpyHostToDevice);
dim3 DimBlock(1024,1,1);
dim3 DimGridA(128,1,1);
sumKerA <<< DimBlock, DimGridA >>> (Rd);
cudaThreadSynchronize();
cudaMemcpy(&R,Rd,128*sizeof(Replica),cudaMemcpyDeviceToHost);
// cudaMemcpy(&M,Md,128*sizeof(int),cudaMemcpyDeviceToHost);
for (int i = 0; i < 128; ++i){
cout << R[i].MA << " ";
}
cudaFree(Rd);
return 0;
}
Based on your reduction code, it appears that you intend to launch 1024 threads per block.
In that case, this is incorrect:
dim3 DimBlock(1024,1,1);
dim3 DimGridA(128,1,1);
sumKerA <<< DimBlock, DimGridA >>> (Rd);
The first kernel configuration parameter is the dimensions of the grid. The second parameter is the dimension of the threadblock. If you want 1024 threads per block, while launching 128 blocks, your kernel launch should look like this:
sumKerA <<< DimGridA, DimBlock >>> (Rd);
If you add proper cuda error checking to your code, I expect you would see a kernel launch failure, because using the block variable (blockIdx.x) to index into the Rd array of 128 elements would index beyond the end of the array, in your original case.
If you modify the Replica objects pointed to by Rd in your kernel, that is externally visible state, so any code that modifies those objects cannot be "optimized away" by the compiler.
Also note that cudaThreadSynchronize() is deprecated in favor of cudaDeviceSynchronize() (they have the same behavior.)