Compiling Cuda code in Qt Creator on Windows

Compiling Cuda code in Qt Creator on Windows - c++

I have been trying for days to get a Qt project file running on a 32-bit Windows 7 system, in which I want/need to include Cuda code. This combination of things is either so simple that no one ever bothered to put an example online, or so difficult that nobody ever succeeded, it seems. Whatever way, the only helpful forum threads I found were the same issue on Linux or Mac, or with Visual Studio on a Windows.
All of these give all sorts of different errors, however, whether due to linking or clashing libraries, or spaces in file names or non-existing folders in the Windows version of the Cuda SDK.
Is there someone who has a clear .pro file to offer that does the trick?
I am aiming to compile a simple programme with ordinary C++ code in Qt style, with Qt 4.8 libraries, which reference several Cuda modules in .cu files. Something of the form:
TestCUDA \
TestCUDA.pro
main.cpp
test.cu

So I finally managed to assemble a .pro file that works on my and probably on all Windows systems. The following is an easy test programme that should probably do the trick. The following is a small project file plus test programme that works at least on my system.
The file system looks as follows:
TestCUDA \
TestCUDA.pro
main.cpp
vectorAddition.cu
The project file reads:
TARGET = TestCUDA
# Define output directories
DESTDIR = release
OBJECTS_DIR = release/obj
CUDA_OBJECTS_DIR = release/cuda
# Source files
SOURCES += src/main.cpp
# This makes the .cu files appear in your project
OTHER_FILES += vectorAddition.cu
# CUDA settings <-- may change depending on your system
CUDA_SOURCES += src/cuda/vectorAddition.cu
CUDA_SDK = "C:/ProgramData/NVIDIA Corporation/NVIDIA GPU Computing SDK 4.2/C" # Path to cuda SDK install
CUDA_DIR = "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v4.2" # Path to cuda toolkit install
SYSTEM_NAME = Win32 # Depending on your system either 'Win32', 'x64', or 'Win64'
SYSTEM_TYPE = 32 # '32' or '64', depending on your system
CUDA_ARCH = sm_11 # Type of CUDA architecture, for example 'compute_10', 'compute_11', 'sm_10'
NVCC_OPTIONS = --use_fast_math
# include paths
INCLUDEPATH += $$CUDA_DIR/include \
$$CUDA_SDK/common/inc/ \
$$CUDA_SDK/../shared/inc/
# library directories
QMAKE_LIBDIR += $$CUDA_DIR/lib/$$SYSTEM_NAME \
$$CUDA_SDK/common/lib/$$SYSTEM_NAME \
$$CUDA_SDK/../shared/lib/$$SYSTEM_NAME
# Add the necessary libraries
LIBS += -lcuda -lcudart
# The following library conflicts with something in Cuda
QMAKE_LFLAGS_RELEASE = /NODEFAULTLIB:msvcrt.lib
QMAKE_LFLAGS_DEBUG = /NODEFAULTLIB:msvcrtd.lib
# The following makes sure all path names (which often include spaces) are put between quotation marks
CUDA_INC = $$join(INCLUDEPATH,'" -I"','-I"','"')
# Configuration of the Cuda compiler
CONFIG(debug, debug|release) {
# Debug mode
cuda_d.input = CUDA_SOURCES
cuda_d.output = $$CUDA_OBJECTS_DIR/${QMAKE_FILE_BASE}_cuda.o
cuda_d.commands = $$CUDA_DIR/bin/nvcc.exe -D_DEBUG $$NVCC_OPTIONS $$CUDA_INC $$LIBS --machine $$SYSTEM_TYPE -arch=$$CUDA_ARCH -c -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda_d.dependency_type = TYPE_C
QMAKE_EXTRA_COMPILERS += cuda_d
}
else {
# Release mode
cuda.input = CUDA_SOURCES
cuda.output = $$CUDA_OBJECTS_DIR/${QMAKE_FILE_BASE}_cuda.o
cuda.commands = $$CUDA_DIR/bin/nvcc.exe $$NVCC_OPTIONS $$CUDA_INC $$LIBS --machine $$SYSTEM_TYPE -arch=$$CUDA_ARCH -c -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda.dependency_type = TYPE_C
QMAKE_EXTRA_COMPILERS += cuda
}
Note the QMAKE_LFLAGS_RELEASE = /NODEFAULTLIB:msvcrt.lib: it took me a long time to figure out, but this library seems to clash with other things in Cuda, which produces strange linking warnings and errors. If someone has an explanation for this, and potentially a prettier way to get around this, I'd like to hear it.
Also, since Windows file paths often include spaces (and NVIDIA's SDK by default does so too), it is necessary to artificially add quotation marks around the include paths. Again, if someone knows a more elegant way of solving this problem, I'd be interested to know.
The main.cpp file looks like this:
#include <cuda.h>
#include <builtin_types.h>
#include <drvapi_error_string.h>
#include <QtCore/QCoreApplication>
#include <QDebug>
// Forward declare the function in the .cu file
void vectorAddition(const float* a, const float* b, float* c, int n);
void printArray(const float* a, const unsigned int n) {
QString s = "(";
unsigned int ii;
for (ii = 0; ii < n - 1; ++ii)
s.append(QString::number(a[ii])).append(", ");
s.append(QString::number(a[ii])).append(")");
qDebug() << s;
}
int main(int argc, char* argv [])
{
QCoreApplication(argc, argv);
int deviceCount = 0;
int cudaDevice = 0;
char cudaDeviceName [100];
unsigned int N = 50;
float *a, *b, *c;
cuInit(0);
cuDeviceGetCount(&deviceCount);
cuDeviceGet(&cudaDevice, 0);
cuDeviceGetName(cudaDeviceName, 100, cudaDevice);
qDebug() << "Number of devices: " << deviceCount;
qDebug() << "Device name:" << cudaDeviceName;
a = new float [N]; b = new float [N]; c = new float [N];
for (unsigned int ii = 0; ii < N; ++ii) {
a[ii] = qrand();
b[ii] = qrand();
}
// This is the function call in which the kernel is called
vectorAddition(a, b, c, N);
qDebug() << "input a:"; printArray(a, N);
qDebug() << "input b:"; printArray(b, N);
qDebug() << "output c:"; printArray(c, N);
if (a) delete a;
if (b) delete b;
if (c) delete c;
}
The Cuda file vectorAddition.cu, which describes a simple vector addition, look like this:
#include <cuda.h>
#include <builtin_types.h>
extern "C"
__global__ void vectorAdditionCUDA(const float* a, const float* b, float* c, int n)
{
int ii = blockDim.x * blockIdx.x + threadIdx.x;
if (ii < n)
c[ii] = a[ii] + b[ii];
}
void vectorAddition(const float* a, const float* b, float* c, int n) {
float *a_cuda, *b_cuda, *c_cuda;
unsigned int nBytes = sizeof(float) * n;
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
// allocate and copy memory into the device
cudaMalloc((void **)& a_cuda, nBytes);
cudaMalloc((void **)& b_cuda, nBytes);
cudaMalloc((void **)& c_cuda, nBytes);
cudaMemcpy(a_cuda, a, nBytes, cudaMemcpyHostToDevice);
cudaMemcpy(b_cuda, b, nBytes, cudaMemcpyHostToDevice);
vectorAdditionCUDA<<<blocksPerGrid, threadsPerBlock>>>(a_cuda, b_cuda, c_cuda, n);
// load the answer back into the host
cudaMemcpy(c, c_cuda, nBytes, cudaMemcpyDeviceToHost);
cudaFree(a_cuda);
cudaFree(b_cuda);
cudaFree(c_cuda);
}
If you get this to work, then more complicated examples are self-evident, I think.
Edit (24-1-2013): I added the QMAKE_LFLAGS_DEBUG = /NODEFAULTLIB:msvcrtd.lib and the CONFIG(debug) with the extra D_DEBUG flag, such that it also compiles in debug mode.

Using msvc 2010 I found that the linker does not accept the -l parameter, however nvcc needs it. Therefore I made a simple change in the .pro file:
# Add the necessary libraries
CUDA_LIBS = cuda cudart
# The following makes sure all path names (which often include spaces) are put between quotation marks
CUDA_INC = $$join(INCLUDEPATH,'" -I"','-I"','"')
# LIBRARIES IN FORMAT NEEDED BY NVCC
NVCC_LIBS = $$join(CUDA_LIBS,' -l','-l', '')
# LIBRARIES IN FORMAT NEEDED BY VISUAL C++ LINKER
LIBS += $$join(CUDA_LIBS,'.lib ', '', '.lib')
And the nvcc command (release version):
cuda.commands = $$CUDA_DIR/bin/nvcc.exe $$NVCC_OPTIONS $$CUDA_INC $$NVCC_LIBS --machine $$SYSTEM_TYPE -arch=$$CUDA_ARCH -c -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
$$NVCC_LIBS was inserted instead of $$LIBS.
The whole .pro file, which works for me:
QT += core
QT -= gui
TARGET = TestCUDA
CONFIG += console
CONFIG -= app_bundle
TEMPLATE = app
# Define output directories
DESTDIR = release
OBJECTS_DIR = release/obj
CUDA_OBJECTS_DIR = release/cuda
# Source files
SOURCES += main.cpp
# This makes the .cu files appear in your project
OTHER_FILES += vectorAddition.cu
# CUDA settings <-- may change depending on your system
CUDA_SOURCES += vectorAddition.cu
#CUDA_SDK = "C:/ProgramData/NVIDIA Corporation/NVIDIA GPU Computing SDK 4.2/C" # Path to cuda SDK install
CUDA_DIR = "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v5.0" # Path to cuda toolkit install
SYSTEM_NAME = win32 # Depending on your system either 'Win32', 'x64', or 'Win64'
SYSTEM_TYPE = 32 # '32' or '64', depending on your system
CUDA_ARCH = sm_11 # Type of CUDA architecture, for example 'compute_10', 'compute_11', 'sm_10'
NVCC_OPTIONS = --use_fast_math
# include paths
INCLUDEPATH += $$CUDA_DIR/include
#$$CUDA_SDK/common/inc/ \
#$$CUDA_SDK/../shared/inc/
# library directories
QMAKE_LIBDIR += $$CUDA_DIR/lib/$$SYSTEM_NAME
#$$CUDA_SDK/common/lib/$$SYSTEM_NAME \
#$$CUDA_SDK/../shared/lib/$$SYSTEM_NAME
# The following library conflicts with something in Cuda
QMAKE_LFLAGS_RELEASE = /NODEFAULTLIB:msvcrt.lib
QMAKE_LFLAGS_DEBUG = /NODEFAULTLIB:msvcrtd.lib
# Add the necessary libraries
CUDA_LIBS = cuda cudart
# The following makes sure all path names (which often include spaces) are put between quotation marks
CUDA_INC = $$join(INCLUDEPATH,'" -I"','-I"','"')
NVCC_LIBS = $$join(CUDA_LIBS,' -l','-l', '')
LIBS += $$join(CUDA_LIBS,'.lib ', '', '.lib')
# Configuration of the Cuda compiler
CONFIG(debug, debug|release) {
# Debug mode
cuda_d.input = CUDA_SOURCES
cuda_d.output = $$CUDA_OBJECTS_DIR/${QMAKE_FILE_BASE}_cuda.o
cuda_d.commands = $$CUDA_DIR/bin/nvcc.exe -D_DEBUG $$NVCC_OPTIONS $$CUDA_INC $$NVCC_LIBS --machine $$SYSTEM_TYPE -arch=$$CUDA_ARCH -c -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda_d.dependency_type = TYPE_C
QMAKE_EXTRA_COMPILERS += cuda_d
}
else {
# Release mode
cuda.input = CUDA_SOURCES
cuda.output = $$CUDA_OBJECTS_DIR/${QMAKE_FILE_BASE}_cuda.o
cuda.commands = $$CUDA_DIR/bin/nvcc.exe $$NVCC_OPTIONS $$CUDA_INC $$NVCC_LIBS --machine $$SYSTEM_TYPE -arch=$$CUDA_ARCH -c -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda.dependency_type = TYPE_C
QMAKE_EXTRA_COMPILERS += cuda
}
I also added some essential declarations, i.e. QT += core for the app to work, and also removed the SDK part, which I did not find useful in this case.

I tried this combination to work. Could not make it work due to a number of dependencies in
my project.
My final solution was to break the application into two separate applications on Windows
1)
CUDA application developed in VC and running as a service/DLL in Windows
GUI interface developed in QT and using the DLL for CUDA related tasks.
Hope it saves some time of others

Related

How do I compile CUDA code with Qt in a windows/MSVC environment?

After a long troubleshooting, I managed to make my small test program work with Qt Creator (from what I've seen, several people got troubles for that).
I'm sharing here a solution (see my answer), feel free to comment or correct things that could be improved, especially if someone has a solution to the problems mentioned below.
Two problems I've encountered:
There is probably a way to make everything compile and link at once but when I try this, I always got a strange error stating that main.cpp cannot be found whereas all paths are correct in the MakeFile.
Also, I don't know exactly why but using -dlink or -dc option to enable relocatable code yield a _cudaRegisterLinkedBinary external symbol error. Does relocatable code can work with separate compilation? Is it a problem because of compile time allocation?

Here is a possible solution.
The basic idea was to code a matrix multiplication routine. To this purpose, I've simply used a matMul.cu file containing a wrapper function which call the CUDA kernel function. Then, this file was compiled using nvcc using the following command:
nvcc.exe -lib -o lib_cuda/matMul.lib -c matMul.cu
Having the .lib file, I could add the library in Qt using the 'Add library' tool with static linking which automatically add the last 8 lines in the .pro file.
Here are the project files:
.pro file:
QT -= gui
CONFIG += c++11 console
CONFIG -= app_bundle
SOURCES += main.cpp
OTHER_FILES =+ matMul.cu
# The following library conflicts with something in Cuda
QMAKE_LFLAGS_RELEASE = /NODEFAULTLIB:msvcrt.lib
QMAKE_LFLAGS_DEBUG = /NODEFAULTLIB:msvcrtd.lib
QMAKE_LFLAGS_DEBUG = /NODEFAULTLIB:libcmt.lib
# Used to avoid conflicting flags between CUDA and MSVC files, should make everything static
QMAKE_CFLAGS_DEBUG += /MTd
QMAKE_CFLAGS_RELEASE += /MT
QMAKE_CXXFLAGS_DEBUG += /MTd
QMAKE_CXXFLAGS_RELEASE += /MT
# CUDA settings <-- may change depending on your system
CUDA_DIR = "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v9.1" # Path to cuda toolkit install
SYSTEM_NAME = x64
# include paths
CUDA_INC += $$CUDA_DIR\include
INCLUDEPATH += $$CUDA_INC
# library directories
QMAKE_LIBDIR += $$CUDA_DIR/lib/$$SYSTEM_NAME \
# Add the necessary CUDA libraries
LIBS += -lcuda -lcudart
# Add project related libraries containing kernels
win32: LIBS += -L$$PWD/lib_cuda/ -lmatMul_d
INCLUDEPATH += $$PWD/lib_cuda
DEPENDPATH += $$PWD/lib_cuda
win32:!win32-g++: PRE_TARGETDEPS += $$PWD/lib_cuda/matMul_d.lib
else:win32-g++: PRE_TARGETDEPS += $$PWD/lib_cuda/libmatMul_d.a
main.cpp:
#include <cmath>
#include <chrono>
#include <iostream>
#include <cuda.h>
#include <cuda_runtime.h>
typedef struct
{
int width;
int height;
int stride;
float *elements;
} Matrix;
void matMul_wrapper(Matrix &C, const Matrix &A, const Matrix &B, cudaDeviceProp devProp);
int main()
{
int devCount;
cudaGetDeviceCount(&devCount);
cudaDeviceProp devProp;
for(int i=0; i < devCount; ++i)
{
cudaGetDeviceProperties(&devProp, i);
std::cout << "\nDevice: " << devProp.name << "\n";
std::cout << " Compute capability: " << devProp.major << "\n";
std::cout << " Max threads per block: " << devProp.maxThreadsPerBlock << "\n";
std::cout << " Warp size: " << devProp.warpSize << "\n\n";
}
Matrix A {1000, 1000, 1, new float[1000*1000]};
Matrix B {1000, 1000, 1, new float[1000*1000]};
Matrix C {B.width, A.height, 1, new float[1000*1000]};
for(int row=0; row < A.height; ++row)
{
for(int col=0; col < A.width; ++col)
A.elements[row*A.width + col] = (float)(row*A.width + col) / (float)100000;
}
for(int row=0; row < B.height; ++row)
{
for(int col=0; col < B.width; ++col)
B.elements[row*B.width + col] = (float)(row*B.width + col) / (float)100000;
}
std::cout << A.elements[20000] << '\n';
matMul_wrapper(C, A, B, devProp);
std::cout << A.elements[20000] << '\n';
delete[] A.elements;
delete[] B.elements;
delete[] C.elements;
return 0;
}
matMul.cu:
#include <cuda.h>
#include <cuda_runtime.h>
#define BLOCK_SIZE 16
typedef struct
{
int width;
int height;
int stride;
float *elements;
} Matrix;
__global__
void matMulKernel(Matrix C, const Matrix A, const Matrix B)
{
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int idx = row*C.width + col;
float out = 0;
if(idx < C.width * C.height)
{
for(int j=0; j < A.width; ++j)
out += A.elements[row*A.width + j] * B.elements[j*B.width + col];
}
C.elements[idx] = out;
}
void matMul_wrapper(Matrix &C, const Matrix &A, const Matrix &B, cudaDeviceProp devProp)
{
dim3 block(BLOCK_SIZE, BLOCK_SIZE, 1);
dim3 grid( (C.width + block.x - 1) / block.x,
(C.height + block.y - 1) / block.y,
1);
Matrix d_A {A.width, A.height, A.stride};
size_t size = A.height * A.width * sizeof(float);
cudaMallocManaged(&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);
Matrix d_B {B.width, B.height, B.stride};
size = B.height * B.width * sizeof(float);
cudaMallocManaged(&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);
Matrix d_C {C.width, C.height, C.stride};
size = C.height * C.width * sizeof(float);
cudaMallocManaged(&d_C.elements, size);
cudaMemcpy(d_C.elements, C.elements, size, cudaMemcpyHostToDevice);
matMulKernel<<<grid, block>>>(d_C, d_A, d_B);
cudaDeviceSynchronize();
cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);
}
Hope this will help.

I found a solution is working on Windows 10 and MSVC 2017.
It is essential to use a linker for your methods.
In my .cpp class, I use this linker:
extern "C" void testCuda(uint8_t* output, uint8_t* input, int blockCount, int threadCount);
And my .cu class is looks like:
// System includes
#include <stdlib.h>
#include <stdio.h>
// CUDA runtime
#include <cuda_runtime.h>
// helper functions and utilities to work with CUDA
#include "include/helper_cuda.h"
#include "include/helper_functions.h"
__global__ void testInside(uint8_t* output, uint8_t* input)
{
output[threadIdx.x + blockIdx.x * blockDim.x] = 256 - input[threadIdx.x + blockIdx.x * blockDim.x];
}
////////////////////////////////////////////////////////////////////////////////
//! Entry point for Cuda functionality on host side
//! #param argc command line argument count
//! #param argv command line arguments
//! #param data data to process on the device
//! #param len len of \a data
////////////////////////////////////////////////////////////////////////////////
extern "C" bool runTest(const int argc, const char** argv, char* data, int2* data_int2, unsigned int len)
{
// Find the best Cuda device
//findCudaDevice(argc, static_cast<const char**>(argv));
return false;
}
extern "C" void testCuda(uint8_t* output, uint8_t* input, int grid, int threads)
{
testInside << <grid, threads >> > (output, input);
}
In my .pro file, Ienter code here use this code snippet:
CUDA_OBJECTS_DIR = OBJECTS_DIR/../cuda
# C++ flags
QMAKE_CXXFLAGS_RELEASE =-O3
# MSVCRT link option (static or dynamic, it must be the same with your Qt SDK link option)
MSVCRT_LINK_FLAG_DEBUG = "/MDd"
MSVCRT_LINK_FLAG_RELEASE = "/MD"
# CUDA settings
CUDA_DIR = $$(CUDA_PATH) # Path to cuda toolkit install
SYSTEM_NAME = x64 # Depending on your system either 'Win32', 'x64', or 'Win64'
SYSTEM_TYPE = 64 # '32' or '64', depending on your system
CUDA_ARCH = sm_50 # Type of CUDA architecture
NVCC_OPTIONS = --use_fast_math
# include paths
INCLUDEPATH += $$CUDA_DIR/include \
$$CUDA_DIR/common/inc \
$$CUDA_DIR/../shared/inc
# library directories
QMAKE_LIBDIR += $$CUDA_DIR/lib/$$SYSTEM_NAME \
$$CUDA_DIR/common/lib/$$SYSTEM_NAME \
$$CUDA_DIR/../shared/lib/$$SYSTEM_NAME
# The following makes sure all path names (which often include spaces) are put between quotation marks
CUDA_INC = $$join(INCLUDEPATH,'" -I"','-I"','"')
# Add the necessary libraries
CUDA_LIB_NAMES = cudart_static kernel32 user32 gdi32 winspool comdlg32 \
advapi32 shell32 ole32 oleaut32 uuid odbc32 odbccp32 \
#freeglut glew32
for(lib, CUDA_LIB_NAMES) {
CUDA_LIBS += -l$$lib
}
LIBS += $$CUDA_LIBS
# Configuration of the Cuda compiler
CONFIG(debug, debug|release) {
# Debug mode
cuda_d.input = CUDA_SOURCES
cuda_d.output = $$CUDA_OBJECTS_DIR/${QMAKE_FILE_BASE}_cuda.obj
cuda_d.commands = $$CUDA_DIR/bin/nvcc.exe -D_DEBUG $$NVCC_OPTIONS $$CUDA_INC $$LIBS \
--machine $$SYSTEM_TYPE -arch=$$CUDA_ARCH \
--compile -cudart static -g -DWIN32 -D_MBCS \
-Xcompiler "/wd4819,/EHsc,/W3,/nologo,/Od,/Zi,/RTC1" \
-Xcompiler $$MSVCRT_LINK_FLAG_DEBUG \
-c -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda_d.dependency_type = TYPE_C
QMAKE_EXTRA_COMPILERS += cuda_d
}
else {
# Release mode
cuda.input = CUDA_SOURCES
cuda.output = $$CUDA_OBJECTS_DIR/${QMAKE_FILE_BASE}_cuda.obj
cuda.commands = $$CUDA_DIR/bin/nvcc.exe $$NVCC_OPTIONS $$CUDA_INC $$LIBS \
--machine $$SYSTEM_TYPE -arch=$$CUDA_ARCH \
--compile -cudart static -DWIN32 -D_MBCS \
-Xcompiler "/wd4819,/EHsc,/W3,/nologo,/O2,/Zi" \
-Xcompiler $$MSVCRT_LINK_FLAG_RELEASE \
-c -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda.dependency_type = TYPE_C
QMAKE_EXTRA_COMPILERS += cuda
}
I hope it helps someone.

Can't debug cpp code under Qt Creator when linking with CUDA libs

I have a problem with debugging a simple cpp code (it call some CUDA functions like cuInit(), cuDeviceGetCount()..). When I put a break point into the CPP code and start debugging I get this message:
This does not seem to be a "Debug" build.
When I remove all CUDA-calls and do not link the program against the cuda.lib and cudart.lib then the code is debuggable (it is possible to stop the program at the breakpoint and no error message is displayed).
Here is my CPP code:
#include <QtCore/QCoreApplication>
#include <QDebug>
#include <cuda.h>
#include <builtin_types.h>
int main(int argc, char* argv [])
{
QCoreApplication(argc, argv);
int deviceCount = 0;
int cudaDevice = 0;
char cudaDeviceName [100];
cuInit(0);
cuDeviceGetCount(&deviceCount);
cuDeviceGet(&cudaDevice, 0);
cuDeviceGetName(cudaDeviceName, 100, cudaDevice);
qDebug() << "Number of devices: " << deviceCount;
qDebug() << "Device name:" << cudaDeviceName;
}
Here is my .pro file:
QT += core
QT -= gui
TARGET = cudatest
CONFIG += console
CONFIG -= app_bundle
TEMPLATE = app
SOURCES += main.cpp
#################################
# Begin CUDA configuration
win32 {
CUDA_PATH = "C:/Program Files/NVIDIA GPU Computing Toolkit/CUDA/v6.5"
CUDA_INC_DIR = $$CUDA_PATH/include
contains(QMAKE_TARGET.arch, x86_64) {
SYSTEMNAME = x64
SYSTEMTYPE = 64
} else {
SYSTEMNAME = Win32
SYSTEMTYPE = 32
}
CUDA_LIB_DIR = $$CUDA_PATH/lib/$$SYSTEMNAME
QMAKE_CXXFLAGS_RELEASE -= -MD
QMAKE_CXXFLAGS_RELEASE += -MT
QMAKE_CXXFLAGS_DEBUG -= -MDd
QMAKE_CXXFLAGS_DEBUG += -MTd
}
INCLUDEPATH += $$CUDA_INC_DIR
LIBS += -L$$CUDA_LIB_DIR -lcuda -lcudart
#End CUDA configuration
########################
Environment:
Qt Creator 3.2.2
CUDA v6.5
CPP Compiler: VC++ 2013 Express
Debugger: C:\Program Files (x86)\Windows Kits\8.1\Debuggers\x86\cdb.exe
Qt 5.3.2 (compiled by VC++ 2013, 32bit)
I tried to do the same with VC++ 2010 Proffesional with the same result.
Can anyone give me a suggestion where could be the problem?
Thank you.

its probably because you are using Visual Express 2013. It says here that there is no compiler support for VS 2013 Express for CUDA v6.5(Under the Table 2. Windows Compiler Support in CUDA 6.5) You need to install the complete version of Visual Studio.

Qt Creator on Mac and boost libraries

I am running QtCreator on Mac... I want to start working on boost libraries ... So, I installed boost libraries using
brew install boost
After that I created a small boost hallo world program and made the changes in .pro file as follows
TEMPLATE = app
CONFIG += console
CONFIG -= app_bundle
CONFIG -= qt
unix:INCLUDEPATH += "/usr/local/Cellar/boost/1.55.0_1/include/"
unix:LIBPATH += "-L/usr/local/Cellar/boost/1.55.0_1/lib/"
SOURCES += main.cpp
LIBS += \
-lboost_date_time \
-lboost_filesystem \
-lboost_program_options \
-lboost_regex \
-lboost_signals \
-lboost_system
I am still unable to build... What could be the reason? Please suggest me what could be the possible mistake...
The errors are
library not found for -lboost_data_time
linker command failed with exit code 1 (use -v to see invocation)

This is taking a bit from Uflex's answer, as he missed something.
So keep the same code:
//make sure that there is a boost folder in your boost include directory
#include <boost/chrono.hpp>
#include <cmath>
int main()
{
auto start = boost::chrono::system_clock::now();
for ( long i = 0; i < 10000000; ++i )
std::sqrt( 123.456L ); // burn some time
auto sec = boost::chrono::system_clock::now() - start;
std::cout << "took " << sec.count() << " seconds" << std::endl;
return 0;
}
But lets change his .pro a bit:
TEMPLATE = app
CONFIG += console
CONFIG -= app_bundle
CONFIG -= qt
SOURCES += main.cpp
macx {
QMAKE_CXXFLAGS += -std=c++11
_BOOST_PATH = /usr/local/Cellar/boost/1.55.0_1
INCLUDEPATH += "$${_BOOST_PATH}/include/"
LIBS += -L$${_BOOST_PATH}/lib
## Use only one of these:
LIBS += -lboost_chrono-mt -lboost_system # using dynamic lib (not sure if you need that "-mt" at the end or not)
#LIBS += $${_BOOST_PATH}/lib/libboost_chrono-mt.a # using static lib
}
The only thing I have added to this was the boost system( -lboost_system )
That should solve the issue with his original version causing the undefined symbols, and allow you to add your other libraries.
Such as -lboost_date_time, which for me worked perfectly with the brew install.
Granted, my path is actually: /usr/local/Cellar/boost/1.55.0_2

Boost libraries are modularized, you just need to link against the libraries that you are using. Some libraries are header only, so you don't need to do anything, having boost reachable in your path is enough.
You can try to compile this:
//make sure that there is a boost folder in your boost include directory
#include <boost/chrono.hpp>
#include <cmath>
int main()
{
auto start = boost::chrono::system_clock::now();
for ( long i = 0; i < 10000000; ++i )
std::sqrt( 123.456L ); // burn some time
auto sec = boost::chrono::system_clock::now() - start;
std::cout << "took " << sec.count() << " seconds" << std::endl;
return 0;
}
And in the .pro file:
TEMPLATE = app
CONFIG += console
CONFIG -= app_bundle
CONFIG -= qt
SOURCES += main.cpp
macx {
QMAKE_CXXFLAGS += -std=c++11
_BOOST_PATH = /usr/local/Cellar/boost/1.55.0_1
INCLUDEPATH += "$${_BOOST_PATH}/include/"
LIBS += -L$${_BOOST_PATH}/lib
## Use only one of these:
LIBS += -lboost_chrono-mt # using dynamic lib (not sure if you need that "-mt" at the end or not)
#LIBS += $${_BOOST_PATH}/lib/libboost_chrono-mt.a # using static lib
}

Why does timing drastically changes with the amount of zeros in the input data?

I have encountered this strange issue while debugging.
In my code, I can initialize an host array srcArr_h[totArrElm] in two ways:
1)
for(int ic=0; ic<totArrElm; ic++)
{
srcArr_h[ic] = (float)(rand() % 256);
}
or
2) (half array elements will be set at runtime to zero)
for(int ic=0; ic<totArrElm; ic++)
{
int randV = (rand() % 256);
srcArr_h[ic] = randV%2;
}
If I use these arrays as input to a kernel function, I get drastically different timings. In particular if totArrElm = ARRDIM*ARRDIM with ARRDIM = 8192, I get
Timimg 1) 64599.3 ms
Timimg 2) 9764.1 ms
What's the trick? Of course I did verify the src host initialization is not impacting in the big time difference I get. It sounds very strnage to me, but could it be due to optimization at runtime?
Here is my code:
#include <string>
#include <stdint.h>
#include <iostream>
#include <stdio.h>
using namespace std;
#define ARRDIM 8192
__global__ void gpuKernel
(
float *sa, float *aux,
size_t memPitchAux, int w,
float *c_glob
)
{
float c_loc[256];
float sc_loc[256];
float g0=0.0f;
int tidx = blockIdx.x * blockDim.x + threadIdx.x; // x-coordinate of pixel = column in device memory
int tidy = blockIdx.y * blockDim.y + threadIdx.y; // y-coordinate of pixel = row in device memory
int idx = tidy * memPitchAux/4 + tidx;
for(int ic=0; ic<256; ic++)
{
c_loc[ic] = 0.0f;
}
for(int ic=0; ic<255; ic++)
{
sc_loc[ic] = 0.0f;
}
for(int is=0; is<255; is++)
{
int ic = fabs(sa[tidy*w +tidx]);
c_loc[ic] += 1.0f;
}
for(int ic=0; ic<255; ic++)
{
g0 += c_loc[ic];
}
aux[idx] = g0;
}
int main(int argc, char* argv[])
{
float time, loop_time;
cudaEvent_t start, stop;
cudaEvent_t start_loop, stop_loop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0) ;
/*
* array src host and device
*/
int heightSrc = ARRDIM;
int widthSrc = ARRDIM;
cudaSetDevice(0);
float *srcArr_h, *srcArr_d;
size_t nBytesSrcArr = sizeof(float)*heightSrc * widthSrc;
srcArr_h = (float *)malloc(nBytesSrcArr); // Allocate array on host
cudaMalloc((void **) &srcArr_d, nBytesSrcArr); // Allocate array on device
cudaMemset((void*)srcArr_d,0,nBytesSrcArr); // set to zero
int totArrElm = heightSrc*widthSrc;
cudaEventCreate(&start_loop);
cudaEventCreate(&stop_loop);
cudaEventRecord(start_loop, 0) ;
for(int ic=0; ic<totArrElm; ic++)
{
srcArr_h[ic] = (float)(rand() % 256); // case 1)
// int randV = (rand() % 256); // case 2)
// srcArr_h[ic] = randV%2;
}
cudaEventRecord(stop_loop, 0);
cudaEventSynchronize(stop_loop);
cudaEventElapsedTime(&loop_time, start_loop, stop_loop);
printf("Timimg LOOP: %3.1f ms\n", loop_time);
cudaMemcpy( srcArr_d, srcArr_h,nBytesSrcArr,cudaMemcpyHostToDevice);
/*
* auxiliary buffer auxD to save final results
*/
float *auxD;
size_t auxDPitch;
cudaMallocPitch((void**)&auxD,&auxDPitch,widthSrc*sizeof(float),heightSrc);
cudaMemset2D(auxD, auxDPitch, 0, widthSrc*sizeof(float), heightSrc);
/*
* auxiliary buffer auxH allocation + initialization on host
*/
size_t auxHPitch;
auxHPitch = widthSrc*sizeof(float);
float *auxH = (float *) malloc(heightSrc*auxHPitch);
/*
* kernel launch specs
*/
int thpb_x = 16;
int thpb_y = 16;
int blpg_x = (int) widthSrc/thpb_x + 1;
int blpg_y = (int) heightSrc/thpb_y +1;
int num_threads = blpg_x * thpb_x + blpg_y * thpb_y;
/* c_glob array */
int cglob_w = 256;
int cglob_h = num_threads;
float *c_glob_d;
size_t c_globDPitch;
cudaMallocPitch((void**)&c_glob_d,&c_globDPitch,cglob_w*sizeof(float),cglob_h);
cudaMemset2D(c_glob_d, c_globDPitch, 0, cglob_w*sizeof(float), cglob_h);
/*
* kernel launch
*/
dim3 dimBlock(thpb_x,thpb_y, 1);
dim3 dimGrid(blpg_x,blpg_y,1);
gpuKernel<<<dimGrid,dimBlock>>>(srcArr_d,auxD, auxDPitch, widthSrc, c_glob_d);
cudaThreadSynchronize();
cudaMemcpy2D(auxH,auxHPitch, // to CPU (host)
auxD,auxDPitch, // from GPU (device)
auxHPitch, heightSrc, // size of data (image)
cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);
printf("Timimg: %3.1f ms\n", time);
cudaFree(srcArr_d);
cudaFree(auxD);
cudaFree(c_glob_d);
}
My Makefile:
# OS Name (Linux or Darwin)
OSUPPER = $(shell uname -s 2>/dev/null | tr [:lower:] [:upper:])
OSLOWER = $(shell uname -s 2>/dev/null | tr [:upper:] [:lower:])
# Flags to detect 32-bit or 64-bit OS platform
OS_SIZE = $(shell uname -m | sed -e "s/i.86/32/" -e "s/x86_64/64/")
OS_ARCH = $(shell uname -m | sed -e "s/i386/i686/")
# These flags will override any settings
ifeq ($(i386),1)
OS_SIZE = 32
OS_ARCH = i686
endif
ifeq ($(x86_64),1)
OS_SIZE = 64
OS_ARCH = x86_64
endif
# Flags to detect either a Linux system (linux) or Mac OSX (darwin)
DARWIN = $(strip $(findstring DARWIN, $(OSUPPER)))
# Location of the CUDA Toolkit binaries and libraries
CUDA_PATH ?= /usr/local/cuda-5.0
CUDA_INC_PATH ?= $(CUDA_PATH)/include
CUDA_BIN_PATH ?= $(CUDA_PATH)/bin
ifneq ($(DARWIN),)
CUDA_LIB_PATH ?= $(CUDA_PATH)/lib
else
ifeq ($(OS_SIZE),32)
CUDA_LIB_PATH ?= $(CUDA_PATH)/lib
else
CUDA_LIB_PATH ?= $(CUDA_PATH)/lib64
endif
endif
# Common binaries
NVCC ?= $(CUDA_BIN_PATH)/nvcc
GCC ?= g++
# Extra user flags
EXTRA_NVCCFLAGS ?=
EXTRA_LDFLAGS ?=
EXTRA_CCFLAGS ?=
# CUDA code generation flags
# GENCODE_SM10 := -gencode arch=compute_10,code=sm_10
# GENCODE_SM20 := -gencode arch=compute_20,code=sm_20
# GENCODE_SM30 := -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35
GENCODE_SM10 := -gencode arch=compute_10,code=sm_10
GENCODE_SM20 := -gencode arch=compute_20,code=sm_20
GENCODE_SM30 := -gencode arch=compute_30,code=sm_30 -gencode arch=compute_35,code=sm_35
#GENCODE_FLAGS := $(GENCODE_SM20) $(GENCODE_SM10)
GENCODE_FLAGS := $(GENCODE_SM10) $(GENCODE_SM20) $(GENCODE_SM30)
# OS-specific build flags
ifneq ($(DARWIN),)
LDFLAGS := -Xlinker -rpath $(CUDA_LIB_PATH) -L$(CUDA_LIB_PATH) -lcudart
CCFLAGS := -arch $(OS_ARCH)
else
ifeq ($(OS_SIZE),32)
LDFLAGS := -L$(CUDA_LIB_PATH) -lcudart
CCFLAGS := -m32
else
LDFLAGS := -L$(CUDA_LIB_PATH) -lcudart
CCFLAGS := -m64
endif
endif
# OS-architecture specific flags
ifeq ($(OS_SIZE),32)
NVCCFLAGS := -m32
else
NVCCFLAGS := -m64
endif
# OpenGL specific libraries
ifneq ($(DARWIN),)
# Mac OSX specific libraries and paths to include
LIBPATH_OPENGL := -L../../common/lib/darwin -L/System/Library/Frameworks/OpenGL.framework/Libraries -framework GLUT -lGL -lGLU ../../common/lib/darwin/libGLEW.a
else
# Linux specific libraries and paths to include
LIBPATH_OPENGL := -L../../common/lib/linux/$(OS_ARCH) -L/usr/X11R6/lib -lGL -lGLU -lX11 -lXi -lXmu -lglut -lGLEW -lrt
endif
# Debug build flags
ifeq ($(dbg),1)
CCFLAGS += -g
NVCCFLAGS += -g -G
TARGET := debug
else
TARGET := release
endif
# Common includes and paths for CUDA
INCLUDES := -I$(CUDA_INC_PATH) -I. -I.. -I../../common/inc
LDFLAGS += $(LIBPATH_OPENGL)
# Target rules
all: build
build: stackOverflow
stackOverflow.o: stackOverflow.cu
$(NVCC) $(NVCCFLAGS) $(EXTRA_NVCCFLAGS) $(GENCODE_FLAGS) $(INCLUDES) -o $# -c $<
stackOverflow: stackOverflow.o
$(GCC) $(CCFLAGS) -o $# $+ $(LDFLAGS) $(EXTRA_LDFLAGS)
mkdir -p ./bin/$(OSLOWER)/$(TARGET)
cp $# ./bin/$(OSLOWER)/$(TARGET)
run: build
./stackOverflow
clean:
rm -f stackOverflow.o stackOverflow *.pgm
Cuda 5.0 on Tesla c1060, Ubuntu 12.04.

Tesla C1060 GPU device has the compute capability 1.3 which means that every thread has 128 32-bit registers. It's obviously not enough to fit all your local variables (2 arrays of floats, 256 elements each, and some more variables). Since the access to the local memory in the following line
c_loc[ic] += 1.0f;
is highly spread over the whole range 0...255 in case (1), you probably observe the register spilling which means that your data is placed into the local memory. The local memory is, in fact, located in the global one and, therefore, has the same throughput. The access can be cached but due to randomness in your algorithm, I bet caching is not very efficient. (EDIT: for compute capability 1.3 it is not even cached, it's just non-coalesced memory access). Good presentation about the Local memory in CUDA and the register spilling can be found here. There you can also find some guidance how to detect and solve the register spilling problem.
Consider reducing the amount of local data used by each thread or using the shared memory which is located on the chip and, hence, much faster.

gthread.h size of array is negative

I have following configuration in my .pro file
INCLUDEPATH += /home/vickey/ossbuild-read-only/Shared/Build/Linux/x86/include/glib-2.0/
CONFIG += link_pkgconfig
PKGCONFIG += gstreamer-0.10
LIBS += -L/usr/lib `pkg-config --cflags --libs gstreamer-0.10`
LIBS += -L. -L/usr/lib -lphonon -lcurl -ltag -fopenmp -lsayonara_gstreamer
When I try to build the project I get the following error
/home/vickey/src/player/../../../../ossbuild-read-only/Shared/Build/Linux/x86/include/glib-2.0/glib/gthread.h:-1: In function 'gboolean g_once_init_enter(volatile gsize*)':
/home/vickey/src/player/../../../../ossbuild-read-only/Shared/Build/Linux/x86/include/glib-2.0/glib/gthread.h:348: error: size of array is negative
Double clicking on error takes me to gthread.h file with below lines pointed
g_once_init_enter (volatile gsize *value_location)
{
if G_LIKELY ((gpointer) g_atomic_pointer_get (value_location) != NULL)
return FALSE;
else
return g_once_init_enter_impl (value_location);
}
what seems to be the problem ?

Had the same error compiling ancient glib and pango for a 64-bit platform.
Here's how g_atomic_pointer_get source looks in that version:
# define g_atomic_pointer_get(atomic) \
((void) sizeof (gchar [sizeof (*(atomic)) == sizeof (gpointer) ? 1 : -1]), \
(g_atomic_pointer_get) ((volatile gpointer G_GNUC_MAY_ALIAS *) (volatile void *) (atomic)))
So, here atomic is gsize, which must have the same sizeof as gpointer, i.e. void*.
It helped me to redefine gsize and gssize to be 8-byte on 64bit architecture in glibconfig.h.
Also update GLIB_SIZEOF_VOID_P, GLIB_SIZEOF_LONG and GLIB_SIZEOF_SIZE_T.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Compiling Cuda code in Qt Creator on Windows - c++

Related

How do I compile CUDA code with Qt in a windows/MSVC environment?

Can't debug cpp code under Qt Creator when linking with CUDA libs

Qt Creator on Mac and boost libraries

Why does timing drastically changes with the amount of zeros in the input data?

gthread.h size of array is negative

Categories

Resources