OPENCV CUDA -- getCudaEnabledDeviceCount return 0 - c++

I am a fresh to opencv with cuda.
I use opencv2.4.6 and CUDA4.2.
I have successfuly compile the opencv with cuda.
when i use the code:
int cuda_count;
cudaError_t error = cudaGetDeviceCount( &cuda_count );
it returns cudaSuccess and cuda_count=1
But, when i use the code:
int num_devices = cv::gpu::getCudaEnabledDeviceCount();
the num_devices returns 0
why?
my complete code is:
int main()
{
int num_devices = cv::gpu::getCudaEnabledDeviceCount();
int cuda_count;
cudaError_t error = cudaGetDeviceCount( &cuda_count );
if(num_devices <=0 )
{
std::cerr << "no" << std::endl;
return -1;
}
int enable_devivce_id = -1;
}

you must have been compiled OpenCV without CUDA support
gpu::getCudaEnabledDeviceCount Returns the number of installed
CUDA-enabled devices.
C++: int gpu::getCudaEnabledDeviceCount()
Use this function before any
other GPU functions calls. If OpenCV is compiled without GPU support,
this function returns 0.

Related

Pytorch inference time difference between CUDA 10.0 & 10.2

We have a working library that uses LibTorch 1.5.0, built with CUDA 10.0 which runs as expected.
We are working on upgrading to CUDA 10.2 for various non-PyTorch related reasons. We noticed that when we run LibTorch inference on the newly compiled LibTorch (compiled exactly the same, except changing to CUDA 10.2), the runtime is about 20x slower.
We also checked it using the precompiled binaries.
This was tested on 3 different machines using 3 different GPUs (Tesla T4, GTX980 & P1000) and all gives consistent ~20x slower on CUDA 10.2
(Both on Windows 10 & Ubuntu 16.04), all with the latest drivers and on 3 different torch scripts (of the same architecture)
I've simplified the code to be extremely minimal without external dependencies other than Torch
int main(int argc, char** argv)
{
// Initialize CUDA device 0
cudaSetDevice(0);
std::string networkPath = DEFAULT_TORCH_SCRIPT;
if (argc > 1)
{
networkPath = argv[1];
}
auto jitModule = std::make_shared<torch::jit::Module>(torch::jit::load(networkPath, torch::kCUDA));
if (jitModule == nullptr)
{
std::cerr << "Failed creating module" << std::endl;
return EXIT_FAILURE;
}
// Meaningless data, just something to pass to the module to run on
// PATCH_HEIGHT & WIDTH are defined as 256
uint8_t* data = new uint8_t[PATCH_HEIGHT * PATCH_WIDTH * 3];
memset(data, 0, PATCH_HEIGHT * PATCH_WIDTH * 3);
auto stream = at::cuda::getStreamFromPool(true, 0);
bool res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);
std::cout << "Warmed up" << std::endl;
res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);
delete[] data;
return 0;
}
// Inference function
bool infer(std::shared_ptr<JitModule>& jitModule, at::cuda::CUDAStream& stream, const uint8_t* inputData, int width, int height)
{
std::vector<torch::jit::IValue> tensorInput;
// This function simply uses cudaMemcpy to copy to device and create a torch::Tensor from that data
// I can paste it if it's relevant but didn't now to keep as clean as possible
if (!prepareInput(inputData, width, height, tensorInput, stream))
{
return false;
}
// Reduce memory usage, without gradients
torch::NoGradGuard noGrad;
{
at::cuda::CUDAStreamGuard streamGuard(stream);
auto totalTimeStart = std::chrono::high_resolution_clock::now();
jitModule->forward(tensorInput);
// The synchronize here is just for timing sake, not use in production
cudaStreamSynchronize(stream.stream());
auto totalTimeStop = std::chrono::high_resolution_clock::now();
printf("forward sync time = %.3f milliseconds\n",
std::chrono::duration<double, std::milli>(totalTimeStop - totalTimeStart).count());
}
return true;
}
When compiling this with Torch that was compiled using CUDA 10.0 we get a runtime of 18 ms and when we run it with Torch compiled with CUDA 10.2, we get a runtime of 430 ms
Any thoughts on that?
This issue was also posted on PyTorch Forums.
Issue on GitHub
UPDATE
I profiled this small program using both CUDAs
It seems that both use very different kernels
96.5% of the 10.2 computes are conv2d_grouped_direct_kernel which takes ~60-100ms on my P1000
where as the top kernels in the 10.0 run are
47.1% - cudnn::detail::implicit_convolve_sgemm (~1.5 ms)
23.1% - maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt (~0.4 ms)
8.5% - maxwell_scudnn_128x32_relu_small_nn (~0.4ms)
so it's easy to see where the time difference comes from. Now the question is, why.

OpenCV gives Assertion failed error when running on GPU using OpenCL

I have an Nvidia GTX 970M GPU & I am trying to run a face detection algorithm in c++ that runs on the GPU using OpenCL.
The function where this error occurs is :
ocl::OclCascadeClassifier::detectMultiScale()
The error I get is :
OpenCV Error: Assertion failed (localThreads[0] * localThreads[1] * localThreads[2] <= kernelWorkGroupSize) in cv::ocl::openCLVerifyKernel
I know that this problem is related to the GPU of the device but I do not know how to fix this. I have tried using OpenCV versions 2 and 3 but both give the same problem.
The problem was that it was trying to use the Intel HD Graphics GPU instead of the Nvidia GPU. I solved this by choosing the Nvidia GPU as the OpenCL Device.
The code I used was :
cv::ocl::DevicesInfo devInfo;
int res = cv::ocl::getOpenCLDevices(devInfo);
if (res == 0)
{
std::cerr << "There is no OPENCL Here !" << std::endl;
}
else
{
for (unsigned int i = 0; i < devInfo.size(); ++i)
{
std::cout << "Device : " << devInfo[i]->deviceName << " is present" << std::endl;
}
}
cv::ocl::setDevice(devInfo[1]);

cuModuleLoadDataEx ignores all options

This question is similar to cuModuleLoadDataEx options but I would like to bring the topic up again and in addition provide more information.
When loading a PTX string with the NV driver via cuModuleLoadDataEx it seems to ignore all options all together. I provide full working examples so that anyone interested can directly and with no effort reproduce this. First a small PTX kernel (save this as small.ptx) then the C++ program that loads the PTX kernel.
.version 3.1
.target sm_20, texmode_independent
.address_size 64
.entry main()
{
ret;
}
main.cc
#include<cstdlib>
#include<iostream>
#include<fstream>
#include<sstream>
#include<string>
#include<map>
#include "cuda.h"
int main(int argc,char *argv[])
{
CUdevice cuDevice;
CUcontext cuContext;
CUfunction func;
CUresult ret;
CUmodule cuModule;
cuInit(0);
std::cout << "trying to get device 0\n";
ret = cuDeviceGet(&cuDevice, 0);
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "trying to create a context\n";
ret = cuCtxCreate(&cuContext, 0, cuDevice);
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "loading PTX string from file " << argv[1] << "\n";
std::ifstream ptxfile( argv[1] );
std::stringstream buffer;
buffer << ptxfile.rdbuf();
ptxfile.close();
std::string ptx_kernel = buffer.str();
std::cout << "Loading PTX kernel with driver\n" << ptx_kernel;
const unsigned int jitNumOptions = 3;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
void **jitOptVals = new void*[jitNumOptions];
// set up size of compilation log buffer
jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
int jitLogBufferSize = 1024*1024;
jitOptVals[0] = (void *)&jitLogBufferSize;
// set up pointer to the compilation log buffer
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char *jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;
// set up wall clock time
jitOptions[2] = CU_JIT_WALL_TIME;
float jitTime = -2.0;
jitOptVals[2] = &jitTime;
ret = cuModuleLoadDataEx( &cuModule , ptx_kernel.c_str() , jitNumOptions, jitOptions, (void **)jitOptVals );
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "walltime: " << jitTime << "\n";
std::cout << std::string(jitLogBuffer) << "\n";
}
Build (assuming CUDA is installed under /usr/local/cuda, I use CUDA 5.0):
g++ -I/usr/local/cuda/include -L/usr/local/cuda/lib64/ main.cc -o main -lcuda
If someone is able to extract any sensible information from the compilation process that would be great! The documentation of CUDA driver API where cuModuleLoadDataEx is explained (and which options it is supposed to accept) http://docs.nvidia.com/cuda/cuda-driver-api/index.html
If I run this, the log is empty and jitTime wasn't even touched by the NV driver:
./main small.ptx
trying to get device 0
trying to create a context
loading PTX string from file empty.ptx
Loading PTX kernel with driver
.version 3.1
.target sm_20, texmode_independent
.address_size 64
.entry main()
{
ret;
}
walltime: -2
EDIT:
I managed to get the JIT compile time. However it seems that the driver expects an array of 32bit values as OptVals. Not as stated in the manual as an array of pointers (void *) which are on my system 64 bits. So, this works:
const unsigned int jitNumOptions = 1;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
int *jitOptVals = new int[jitNumOptions];
jitOptions[0] = CU_JIT_WALL_TIME;
// here the call to cuModuleLoadDataEx
std::cout << "walltime: " << (float)jitOptions[0] << "\n";
I believe that it is not possible to do the same with an array of void *. The following code does not work:
const unsigned int jitNumOptions = 1;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
void **jitOptVals = new void*[jitNumOptions];
jitOptions[0] = CU_JIT_WALL_TIME;
// here the call to cuModuleLoadDataEx
// here I also would have a problem casting a 64 bit void * to a float (32 bit)
EDIT
Looking at the JIT compilation time jitOptVals[0] was misleading. As mentioned in the comments, the JIT compiler caches previous translations and won't update the JIT compile time if it finds a cached compilation. Since I was looking whether this value has changed or not I assumed that the call ignores the options all together. Which it doesn't. It's works fine.
Your jitOptVals should not contain pointers to your values, instead cast the values to void*:
// set up size of compilation log buffer
jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
int jitLogBufferSize = 1024*1024;
jitOptVals[0] = (void *)jitLogBufferSize;
// set up pointer to the compilation log buffer
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char *jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;
// set up wall clock time
jitOptions[2] = CU_JIT_WALL_TIME;
float jitTime = -2.0;
//Keep jitOptVals[2] empty as it only an Output value:
//jitOptVals[2] = (void*)jitTime;
and after cuModuleLoadDataEx, you get your jitTime like jitTime = (float)jitOptions[2];

Octave c++ and VS2010

I'm trying to Use Octave with Visual C++.
I have downloaded octave-3.6.1-vs2010-setup-1.exe. Created a new project, added octave include folder to include path, octinterp.lib and octave.lib to lib path, and I added Octave bin folder as running directory.
The program compiles and runs fine except feval function that causes the exception:
Microsoft C++ exception: octave_execution_exception at memory location 0x0012faef
and on Octave side:
Invalid resizing operation or ambiguous assignment to an out-of-bounds array element.
What am I doing wrong?
Code for a standalone program:
#include <octave/octave.h>
#include <octave/oct.h>
#include <octave/parse.h>
int main(int argc, char **argv)
{
if (octave_main (argc, argv, true))
{
ColumnVector NumRands(2);
NumRands(0) = 10;
NumRands(1) = 1;
octave_value_list f_arg, f_ret;
f_arg(0) = octave_value(NumRands);
f_ret = feval("rand",f_arg,1);
Matrix unis(f_ret(0).matrix_value());
}
else
{
error ("Octave interpreter initialization failed");
}
return 0;
}
Thanks in advance.
I tried it myself, and the problem seems to originate from the feval line.
Now I don't have an explanation as to why, but the problem was solved by simply switching to the "Release" configuration instead of the "Debug" configuration.
I am using the Octave3.6.1_vs2010 build, with VS2010 on WinXP.
Here is the code I tested:
#include <iostream>
#include <octave/oct.h>
#include <octave/octave.h>
#include <octave/parse.h>
int main(int argc, char **argv)
{
// Init Octave interpreter
if (!octave_main(argc, argv, true)) {
error("Octave interpreter initialization failed");
}
// x = rand(10,1)
ColumnVector sz(2);
sz(0) = 10; sz(1) = 1;
octave_value_list in = octave_value(sz);
octave_value_list out = feval("rand", in, 1);
// print random numbers
if (!error_state && out.length () > 0) {
Matrix x( out(0).matrix_value() );
std::cout << "x = \n" << x << std::endl;
}
return 0;
}
with an output:
x =
0.165897
0.0239711
0.957456
0.830028
0.859441
0.513797
0.870601
0.0643697
0.0605021
0.153486
I'd guess that its actually stopped pointing at the next line and the error actually lies at this line:
f_arg(0) = octave_value(NumRands);
You seem to be attempting to get a value (which value?) from a vector and then assigning it to element 0 of a vector that has not been defined as a vector.
I don't really know though ... I've never tried writing octave code like that. I'm just trying to work it out by translating the code to standard matlab/octave code and that line seems really odd to me ...

Unable to find source of exception: cudaError_enum at memory location

I am trying to identify the source of an Microsoft C++ exception:
First-chance exception at 0x770ab9bc in test_fft.exe: Microsoft C++ exception: cudaError_enum at memory location 0x016cf234...
My build environment is:
IDE: Microsoft Visual C++ 2010 Express
NVIDIA Driver: 301.27
CUDA: NVIDIA CUDA Toolkit v4.2 (32-bit)
SDK: NVIDIA GPU Computing SDK 4.2 (32-bit)
Problem scope: I am trying to wrap the CUFFT behind a C++ class. This way I can hide the translation from one data type to the cufftComplex, execution of the FFT and memory transfers from the calling code.
Class header:
#ifndef SIGNAL_PROCESSING_FFT_HPP
#define SIGNAL_PROCESSING_FFT_HPP
#include "signal_processing\types.hpp"
#include <boost/cstdint.hpp>
#include <cufft.h>
#include <vector>
namespace signal_processing {
class FFT {
public:
FFT ( boost::uint32_t size );
virtual ~FFT();
void forward ( ComplexVectorT const& input, ComplexVectorT& output );
void reverse ( ComplexVectorT const& input, ComplexVectorT& output );
private:
cufftComplex* m_device_data;
cufftComplex* m_host_data;
cufftHandle m_plan;
boost::uint32_t m_size;
};
}
#endif // SIGNAL_PROCESSING_FFT_HPP
FFT constructor:
FFT::FFT ( boost::uint32_t size )
: m_size ( size )
{
CudaSafeCall ( cudaMalloc((void**)&m_device_data, sizeof(cufftComplex) * m_size ) );
m_host_data = (cufftComplex*) malloc ( m_size * sizeof(cufftComplex) );
CufftSafeCall ( cufftPlan1d ( &m_plan, m_size, CUFFT_C2C, 1 ) );
}
The Microsoft C++ exception is being thrown in the FFT constructor at the first line where the call to cudaMalloc. This error only seems to occur if I run the code using the FFT class with the Visual Studio debugger.
References
CudaSafeCall definition
#define CudaSafeCall(err) __cudaSafeCall ( err, __FILE__, __LINE__ )
__cudaSafeCall definition
inline void __cudaSafeCall ( cudaError err, const char* file, const int line )
{
#ifdef CUDA_ERROR_CHECK
if ( cudaSuccess != err )
{
std::cerr << boost::format ( "cudaSafeCall() failed at %1$s:%2$i : %3$s\n" )
% file
% line
% cudaGetErrorString ( err );
exit(-1);
}
#endif
return;
}
The observation you are making has to do with an exception that is caught and handled properly within the CUDA libraries. It is, in some cases, a normal part of CUDA GPU operation. I believe your application is returning no API errors in this case. If you were not within the VS environment that can report this, you would not observe this at all.
This is considered normal behavior under CUDA. I believe there were some attempts to eliminate it in CUDA 5.5. You might wish to try that, although it's not considered an issue either way.