How to fix CUBLAS_STATUS_ARCH_MISMATCH? - c++

I am trying to use CUBLAS to perform a simple matrix multiplication. I am using the following function
#ifdef CUBLAS_API_H_
// cuBLAS API errors
static const char *_cudaGetErrorEnum(cublasStatus_t error)
{
switch (error)
{
case CUBLAS_STATUS_SUCCESS:
return "CUBLAS_STATUS_SUCCESS";
case CUBLAS_STATUS_NOT_INITIALIZED:
return "CUBLAS_STATUS_NOT_INITIALIZED";
case CUBLAS_STATUS_ALLOC_FAILED:
return "CUBLAS_STATUS_ALLOC_FAILED";
case CUBLAS_STATUS_INVALID_VALUE:
return "CUBLAS_STATUS_INVALID_VALUE";
case CUBLAS_STATUS_ARCH_MISMATCH:
return "CUBLAS_STATUS_ARCH_MISMATCH";
case CUBLAS_STATUS_MAPPING_ERROR:
return "CUBLAS_STATUS_MAPPING_ERROR";
case CUBLAS_STATUS_EXECUTION_FAILED:
return "CUBLAS_STATUS_EXECUTION_FAILED";
case CUBLAS_STATUS_INTERNAL_ERROR:
return "CUBLAS_STATUS_INTERNAL_ERROR";
}
return "<unknown>";
}
#endif
void gpu_blas_mmul(cublasHandle_t &handle, cudaStream_t &stream, const real_t *A, const real_t *B, real_t *C, const int m, const int k, const int n) {
int lda=m,ldb=k,ldc=m;
const real_t alf = 1;
const real_t bet = 0;
const real_t *alpha = &alf;
const real_t *beta = &bet;
cublasSetStream(handle, stream);
// Do the actual multiplication
cublasStatus_t err = GEMM(handle, CUBLAS_OP_N, CUBLAS_OP_N, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc);
if(err!=0)
{
std::cout<<"CUBLAS err : "<<_cudaGetErrorEnum(err)<<"\n";
}
}
In a header file, GEMM is defined as
#define GEMM cublasDgemm
#define real_t double
The function is called like this:
gpu_blas_mmul(cublas[i], streams[P/2-i-1], A, B, C, N, N, N);
A, B and C are device memory locations and I am trying to multiply two NxN matrices (both stored in column-major format).
streams is a P/2 length array of CUDA Streams and cublas is an array of CUBLAS handles and i counts up from 0 to P/2-1. Both arrays contain valid handles and streams respectively (no errors when creating them). I am compiling the code for sm2.0. So double-precision shouldn't be a problem.
The code works fine when called from one file. This section has its own cublasCreate and cublasDestroy calls. The same function when called from another location throws the error "CUBLAS_STATUS_ARCH_MISMATCH".
What could be wrong?
Thank you,
Thomas

Turns out that I WAS using invalid CUDA stream and/or CUBLAS handles. I was overrunning array bounds (the arrays storing the CUDA streams and CUBLAS handles)
The cryptic error message gave me no idea as to what was happening. However, starting from a basic example again and working up led me to finding the issue.
Hope someone finds this helpful! :)

Related

Avoid writing two version of a function by hiding/unhiding lines of code

There are two versions of a function (the code below is a simplified version). Both versions are used in the program. In the actual function, the differences between the two versions can occur at two or three different places.
How to avoid writing both versions in the code without sacrificing performance, through template or other means? This is an attempt
to make the code more readable.
Performance is critical because it will get run many many times, and I am writing benchmark for different implementations.
(Also, is this an ok api, if I am writing a library for a few people?)
Example:
int set_intersect(const int* A, const int s_a,
const int* B, const int s_b,
int* C = 0){
//if (int* C == 0), we are running version
//0 of the function.
//int* C is not known during compilation
//time for version 1.
int Count0 = 0;
//counter for version 0 of the function.
const int* const C_original(C);
//counter and pointer for version 1 of
//the function
int a = 0;
int b = 0;
int A_now;
int B_now;
while(a < s_a && b < s_b){
A_now = A[a];
B_now = B[b];
a += (A_now <= B_now);
b += (B_now <= A_now);
if (A_now == B_now){
if (C == 0){
Count0++;
} else {
C++;
*(C)=A_now;
}
}
}
if (C == 0){
return Count0;
}else{
return C - C_original;
}
}
Thanks.
Updates:
Conditional compile-time inclusion/exclusion of code based on template argument(s)
(some of those templates look so long)
Remove/Insert code at compile time without duplication in C++
(this is more similar to my case. my case is simpler though.)
I guess the following can work, but it adds a new argument.
int set_intersect(const int* A, const int s_a,
const int* B, const int s_b,
int* C = 0,
char flag);
put all code for version 0 into if (flag == '0') { /* version 0 code */ }
put all code for version 1 into if (flag == '1') { /* version 1 code */}
Probably can put the flag variable into template (as Barmar suggested in comments), that way, it doesn't feel like adding another argument for the function. Can also replace the 0 and 1 with enum (like enum class set_intersection_type {find_set, size_only}). Calling the function will be like set_intersect<find_set>(const int* A, const int s_a, const int* B, const int s_b, int* C) or set_intersect<size_only>(const int* A, const int s_a, const int* B, const int s_b) Hopefully this is more readable than before, and the compiler is smart enough to see what is going on.
Another problem is, what if someone uses the findset version (version 1), and then forgets to change the default argument (int C* = 0)? It is possible to call the function this way: set_intersect<find_set>(const int* A, const int s_a, const int* B, const int s_b).
May be I can use dasblinkenlight's idea in the comments. Create two wrapper functions (set_intersection, set_intersection_size). Each wrapper calls the actual function with different arguments. Also list the actual function as a private function so no one can call it directly.
For the different implementations of set intersections, maybe can create a common wrapper with templates. Calling the wrapper would be similar to set_intersection<basic>, set_intersection<binary_search>, or set_intersection_size<simd> etc. This seems to look better.
Generally seems doable, question is whether you want to do that. Would say no. From what I can tell you do two different things:
Version 0 computes the size of the intersection
Version 1 computes the size of the intersection, and writes the intersection to the location past C*, assuming there is enough space to store it.
I would not only for speed, but also for clarity make two distinct functions, set_intersection and set_intersection_size, but if you insist on having one I would benchmark your code against std::set_intersection, and if possible just redirect to the ::std version if C != 0.
In your current version I would not use your library. However I would also be hard-pressed to come up with a situation where I would prefer a custom-made version of set_intersection to the STL version. If I ever needed performance better than the STL, I would expect to have identified the point in code as the bottleneck, and I would not use a library call at all, but write the code myself, possibly in assembly and unrolling the loop etc..
What bugs me a bit is how this is supposed to work:
const int* const Count1(C);
//counter and pointer for version 1 of
//the function
...
Count1++;
*(Count1)=A_now;
If it is known at compilation time what version you want you can use Conditional Compilation.
#define Version_0 //assuming you know this compilation is version 0
Then you can go:
int set_intersect(...)
#ifdef Version_0
//Version 0 of the code
#else
//Version 1 of the code
This way only one version of the code gets compiled.
If you don't know which version it is for compilation, I suggest having two separate functions so you don't have to check for the version every instance of the function.
Have a type that you specialize on a bool parameter:
template<bool b>
struct Counter
{
};
template<>
struct Counter<false>
{
int c;
Counter(int *)
: c(0)
{
}
int operator++() { return ++c; }
void storeA(const int a_now) {}
};
template<>
struct Counter<true>
{
const int* const c;
Counter(int * c_orig)
: c(c_orig)
{
}
int operator++() { return ++C; }
void storeA(const int a_now) { *C = a_now; }
}
Then specialize your algorithm on Counter as a template argument. Note that this will be exactly the same for both cases, that is, you don't need to specialize:
template<typename Counter>
struct SetIntersectHelper
{
static int set_intersect(const int* A, const int s_a,
const int* B, const int s_b,
int* C)
{
// your function's body, using Counter
}
};
Now, you're ready to add the generic method:
int set_intersect(const int* A, const int s_a,
const int* B, const int s_b,
int* C = 0)
{
return C ? SetIntersectHelper< Counter< true > >::set_intersect(A, s_a, B, s_b, C):
SetIntersectHelper< Counter< false > >::set_intersect(A, s_a, B, s_b, C);
}

how to use sba(sparse bundle adjustment)

I want to use sba to do bundle adjustment task, and I would like to use sba-1.6(http://users.ics.forth.gr/~lourakis/sba/). But the user manual do not tell exactly how to use it. and I am kind of confused.
For example, I want to use this function sba_mot_levmar which has a parameter p I do not understand what it is. The problem here is that the provided examples make the rotation part in p to be 0. so that is p?
and after call this function, what is in p?
int sba_mot_levmar(
const int n, /* number of points */
const int m, /* number of images */
const int mcon,
char *vmask,
double *p, /* initial parameter vector p0: (a1, ..., am).
* aj are the image j parameters, size m*cnp */
const int cnp,/* number of parameters for ONE camera; e.g. 6 for Euclidean cameras */
double *x,
double *covx,
const int mnp,
void (*proj)(int j, int i, double *aj, double *xij, void *adata),
void (*projac)(int j, int i, double *aj, double *Aij, void *adata),
void *adata,
const int itmax,
const double opts[SBA_OPTSSZ]
double info[SBA_INFOSZ]
)
There's good tutorials for how to use sba with Ros, yet I am not sure if it is Lourakis implementation :
-http://wiki.ros.org/sba/Tutorials/IntroductionToSBA
it explain an example, and lately I found a wrapper for it in python (if you don't care about the language used):
-https://pypi.org/project/sba/
I believe these are easier to use and run than the straight-forward way you mention

const correctness of fftw3

Summary
What can I do if a function of an external library expect a non-const pointer (double *) but it is known that the values of it remains the same so according to const correctness I should pass a const pointer (const double *)?
The situation
I'd like to create a function that calculates the auto-correlation of a vector but unfortunately it seems that fftw3 (which is a C API) doesn't care about const correctness.
The function I'd like to call is:
fftw_plan fftw_plan_dft_r2c_1d(int n0,
double *in, fftw_complex *out,
unsigned flags);
And the code I'd like to create:
vector<double> autocorr(const vector<double>& data)
{
vector<double> ret(data.size(), 0);
// prepare other variables
fftw_plan a = fftw_plan_dft_r2c_1d(size, &data.front(), tmp, FFTW_ESTIMATE);
// do the rest of the work
return ret;
}
Of course, this will not work because the argument of my function is const vector<double>& data so I can't call &data.front(). What is the most appropriate solution to keep the const-correctness of my code?
If you're confronted with a C API that promises more than it shows with respect to const-correctness, it is time to const_cast:
vector<double> autocorr(const vector<double>& data)
{
vector<double> ret(data.size(), 0);
// prepare other variables
fftw_plan a = fftw_plan_dft_r2c_1d(size, const_cast<double*>(&data.front()), tmp, FFTW_ESTIMATE);
// do the rest of the work
return ret;
}
Also note this sentence in the documentation:
in and out point to the input and output arrays of the transform, which may be the same (yielding an in-place transform). These arrays are overwritten during planning, unless FFTW_ESTIMATE is used in the flags.
Since you're using the FFTW_ESTIMATE flag, you should be fine in this case.
The reason FFTW developers decided not to duplicate this function for the sake of const is that in C, const isn't at all a big deal, and FFTW is a C library.
First file a bug report against the library, then const_cast away the const-ness of &data.front(), for example const_cast<double*>(&data.front())

const void* as a complex number for MKL Blas routine in C++

I got stuck with calling the MKL Blas function cblas_zgemv
There are two coeffitiens alpha and beta which are complex numbers:
alpha
REAL for sgemv
DOUBLE PRECISION for dgemv
COMPLEX for cgemv, scgemv
DOUBLE COMPLEX for zgemv, dzgemv
. But in the definition of the function:
void cblas_zgemv (const CBLAS_ORDER order, const CBLAS_TRANSPOSE TransA,
const MKL_INT M, const MKL_INT N, const void *alpha, const void *A,
const MKL_INT lda, const void *X, const MKL_INT incX, const void *beta,
void *Y, const MKL_INT incY);
I have tried to set the alpha = complex(1.0,0) but this return me an error:
error: no suitable conversion function from "complex<double>" to "const void *" exists
What can I do? I don't understand what this const void* is...
The function expects a pointer to the complex value, not the value itself. You'll need a variable to store the value in, and then pass the address of that:
std::complex<double> alpha(1,0);
cblas_zgemv(..., &alpha, ...);
I believe this is safe since lapack_complex_double is layout-compatible with (and, in C++, is an alias for) std::complex<double>. To be on the safe side, you might prefer to use lapack_complex_double when calling that library.

LAPACK on Win32

I have been exploring algorithms that require some work on matrices, and I have gotten some straightforward code working on my Linux machine. Here is an excerpt:
extern "C" {
// link w/ LAPACK
extern void dpptrf_(const char *uplo, const int *n, double *ap, int *info);
extern void dpptri_(const char *uplo, const int *n, double *ap, int *info);
// BLAS todo: get sse2 up in here (ATLAS?)
extern void dgemm_(const char *transa, const char *transb, const int *m,
const int *n, const int *k, const double *alpha, const double *a,
const int *lda, const double *b, const int *ldb, const double *beta,
double *c, const int *ldc);
}
// in-place: be sure that (N*(N+1)/2) doubles have been initialized
inline void invert_mat_sym_packed(double *vd, int n) {
int out = 0;
dpptrf_("U",&n,vd,&out);
ASSERT(!out);
dpptri_("U",&n,vd,&out);
ASSERT(!out);
}
// use with col-major ordering!!!
inline void mult_cm(double *a, double *b, double alpha, int m, int k, int n, double *c) {
int lda = m, ldb = k, ldc = m; double beta = 1.0;
dgemm_("N","N",&m,&n,&k,&alpha,a,&lda,b,&ldb,&beta,c,&ldc);
}
all I had to do was sudo apt-get install liblapack, and link against the library.
I am now trying to get this code working from MinGW using the 32-bit dll's from here but I am seeing segfaults and invalid output. I will proceed with gdb to determine the location of the error but I suspect there's a better, cleaner, more portable way to get this done.
What I did to get it to compile was install fortran for mingw (mingw-get install fortran) and link to the 32bit BLAS and LAPACK dll's from the earlier link.
I'm not sure how much I'm missing here... How does everybody else get their LAPACK going when coding with gcc for win32?
What I'm looking for is an easy-to-use C interface. I don't want wrapper classes all over the place.
I tried to find a download for Intel MKL... Ain't even free software!?
I solved the problem. It had nothing to do with the way I was calling the routines, I failed to memset my buffers to zero prior to accumulating values onto them.
Calling fortran routines is basically just as straightforward as it is to do from Linux.
However, another rather serious problem has appeared: Once I use the lapack routines my program no longer handles exceptions. See here.