This question is similar to cuModuleLoadDataEx options but I would like to bring the topic up again and in addition provide more information.
When loading a PTX string with the NV driver via cuModuleLoadDataEx it seems to ignore all options all together. I provide full working examples so that anyone interested can directly and with no effort reproduce this. First a small PTX kernel (save this as small.ptx) then the C++ program that loads the PTX kernel.
.version 3.1
.target sm_20, texmode_independent
.address_size 64
.entry main()
{
ret;
}
main.cc
#include<cstdlib>
#include<iostream>
#include<fstream>
#include<sstream>
#include<string>
#include<map>
#include "cuda.h"
int main(int argc,char *argv[])
{
CUdevice cuDevice;
CUcontext cuContext;
CUfunction func;
CUresult ret;
CUmodule cuModule;
cuInit(0);
std::cout << "trying to get device 0\n";
ret = cuDeviceGet(&cuDevice, 0);
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "trying to create a context\n";
ret = cuCtxCreate(&cuContext, 0, cuDevice);
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "loading PTX string from file " << argv[1] << "\n";
std::ifstream ptxfile( argv[1] );
std::stringstream buffer;
buffer << ptxfile.rdbuf();
ptxfile.close();
std::string ptx_kernel = buffer.str();
std::cout << "Loading PTX kernel with driver\n" << ptx_kernel;
const unsigned int jitNumOptions = 3;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
void **jitOptVals = new void*[jitNumOptions];
// set up size of compilation log buffer
jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
int jitLogBufferSize = 1024*1024;
jitOptVals[0] = (void *)&jitLogBufferSize;
// set up pointer to the compilation log buffer
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char *jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;
// set up wall clock time
jitOptions[2] = CU_JIT_WALL_TIME;
float jitTime = -2.0;
jitOptVals[2] = &jitTime;
ret = cuModuleLoadDataEx( &cuModule , ptx_kernel.c_str() , jitNumOptions, jitOptions, (void **)jitOptVals );
if (ret != CUDA_SUCCESS) { exit(1);}
std::cout << "walltime: " << jitTime << "\n";
std::cout << std::string(jitLogBuffer) << "\n";
}
Build (assuming CUDA is installed under /usr/local/cuda, I use CUDA 5.0):
g++ -I/usr/local/cuda/include -L/usr/local/cuda/lib64/ main.cc -o main -lcuda
If someone is able to extract any sensible information from the compilation process that would be great! The documentation of CUDA driver API where cuModuleLoadDataEx is explained (and which options it is supposed to accept) http://docs.nvidia.com/cuda/cuda-driver-api/index.html
If I run this, the log is empty and jitTime wasn't even touched by the NV driver:
./main small.ptx
trying to get device 0
trying to create a context
loading PTX string from file empty.ptx
Loading PTX kernel with driver
.version 3.1
.target sm_20, texmode_independent
.address_size 64
.entry main()
{
ret;
}
walltime: -2
EDIT:
I managed to get the JIT compile time. However it seems that the driver expects an array of 32bit values as OptVals. Not as stated in the manual as an array of pointers (void *) which are on my system 64 bits. So, this works:
const unsigned int jitNumOptions = 1;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
int *jitOptVals = new int[jitNumOptions];
jitOptions[0] = CU_JIT_WALL_TIME;
// here the call to cuModuleLoadDataEx
std::cout << "walltime: " << (float)jitOptions[0] << "\n";
I believe that it is not possible to do the same with an array of void *. The following code does not work:
const unsigned int jitNumOptions = 1;
CUjit_option *jitOptions = new CUjit_option[jitNumOptions];
void **jitOptVals = new void*[jitNumOptions];
jitOptions[0] = CU_JIT_WALL_TIME;
// here the call to cuModuleLoadDataEx
// here I also would have a problem casting a 64 bit void * to a float (32 bit)
EDIT
Looking at the JIT compilation time jitOptVals[0] was misleading. As mentioned in the comments, the JIT compiler caches previous translations and won't update the JIT compile time if it finds a cached compilation. Since I was looking whether this value has changed or not I assumed that the call ignores the options all together. Which it doesn't. It's works fine.
Your jitOptVals should not contain pointers to your values, instead cast the values to void*:
// set up size of compilation log buffer
jitOptions[0] = CU_JIT_INFO_LOG_BUFFER_SIZE_BYTES;
int jitLogBufferSize = 1024*1024;
jitOptVals[0] = (void *)jitLogBufferSize;
// set up pointer to the compilation log buffer
jitOptions[1] = CU_JIT_INFO_LOG_BUFFER;
char *jitLogBuffer = new char[jitLogBufferSize];
jitOptVals[1] = jitLogBuffer;
// set up wall clock time
jitOptions[2] = CU_JIT_WALL_TIME;
float jitTime = -2.0;
//Keep jitOptVals[2] empty as it only an Output value:
//jitOptVals[2] = (void*)jitTime;
and after cuModuleLoadDataEx, you get your jitTime like jitTime = (float)jitOptions[2];
Related
We have a working library that uses LibTorch 1.5.0, built with CUDA 10.0 which runs as expected.
We are working on upgrading to CUDA 10.2 for various non-PyTorch related reasons. We noticed that when we run LibTorch inference on the newly compiled LibTorch (compiled exactly the same, except changing to CUDA 10.2), the runtime is about 20x slower.
We also checked it using the precompiled binaries.
This was tested on 3 different machines using 3 different GPUs (Tesla T4, GTX980 & P1000) and all gives consistent ~20x slower on CUDA 10.2
(Both on Windows 10 & Ubuntu 16.04), all with the latest drivers and on 3 different torch scripts (of the same architecture)
I've simplified the code to be extremely minimal without external dependencies other than Torch
int main(int argc, char** argv)
{
// Initialize CUDA device 0
cudaSetDevice(0);
std::string networkPath = DEFAULT_TORCH_SCRIPT;
if (argc > 1)
{
networkPath = argv[1];
}
auto jitModule = std::make_shared<torch::jit::Module>(torch::jit::load(networkPath, torch::kCUDA));
if (jitModule == nullptr)
{
std::cerr << "Failed creating module" << std::endl;
return EXIT_FAILURE;
}
// Meaningless data, just something to pass to the module to run on
// PATCH_HEIGHT & WIDTH are defined as 256
uint8_t* data = new uint8_t[PATCH_HEIGHT * PATCH_WIDTH * 3];
memset(data, 0, PATCH_HEIGHT * PATCH_WIDTH * 3);
auto stream = at::cuda::getStreamFromPool(true, 0);
bool res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);
std::cout << "Warmed up" << std::endl;
res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);
delete[] data;
return 0;
}
// Inference function
bool infer(std::shared_ptr<JitModule>& jitModule, at::cuda::CUDAStream& stream, const uint8_t* inputData, int width, int height)
{
std::vector<torch::jit::IValue> tensorInput;
// This function simply uses cudaMemcpy to copy to device and create a torch::Tensor from that data
// I can paste it if it's relevant but didn't now to keep as clean as possible
if (!prepareInput(inputData, width, height, tensorInput, stream))
{
return false;
}
// Reduce memory usage, without gradients
torch::NoGradGuard noGrad;
{
at::cuda::CUDAStreamGuard streamGuard(stream);
auto totalTimeStart = std::chrono::high_resolution_clock::now();
jitModule->forward(tensorInput);
// The synchronize here is just for timing sake, not use in production
cudaStreamSynchronize(stream.stream());
auto totalTimeStop = std::chrono::high_resolution_clock::now();
printf("forward sync time = %.3f milliseconds\n",
std::chrono::duration<double, std::milli>(totalTimeStop - totalTimeStart).count());
}
return true;
}
When compiling this with Torch that was compiled using CUDA 10.0 we get a runtime of 18 ms and when we run it with Torch compiled with CUDA 10.2, we get a runtime of 430 ms
Any thoughts on that?
This issue was also posted on PyTorch Forums.
Issue on GitHub
UPDATE
I profiled this small program using both CUDAs
It seems that both use very different kernels
96.5% of the 10.2 computes are conv2d_grouped_direct_kernel which takes ~60-100ms on my P1000
where as the top kernels in the 10.0 run are
47.1% - cudnn::detail::implicit_convolve_sgemm (~1.5 ms)
23.1% - maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt (~0.4 ms)
8.5% - maxwell_scudnn_128x32_relu_small_nn (~0.4ms)
so it's easy to see where the time difference comes from. Now the question is, why.
I'm having a loop where I parse an ONNX model into TensorRT, create an engine and do inference.
I make sure I call x->destroy() on all objects and I use cudaFree for each cudaMalloc.
Yet, I keep getting an increase in memory usage through nvidia-smi over consecutive iterations.
I'm really not sure where the problem comes from. The cuda-memcheck tool reports no leaks either.
Running Ubuntu 18.04, TensorRT 7.0.0, CUDA 10.2 and using a GTX 1070.
The code, the ONNX file along with a CMakeLists.txt are available on this repo
Here's the code
#include <memory>
#include <iostream>
#include <cuda_runtime_api.h>
#include <NvOnnxParser.h>
#include <NvInfer.h>
class Logger : public nvinfer1::ILogger
{
void log(Severity severity, const char* msg) override
{
// suppress info-level messages
if (severity != Severity::kINFO)
std::cout << msg << std::endl;
}
};
int main(int argc, char * argv[])
{
Logger gLogger;
auto builder = nvinfer1::createInferBuilder(gLogger);
const auto explicitBatch = 1U << static_cast<uint32_t>(nvinfer1::NetworkDefinitionCreationFlag::kEXPLICIT_BATCH);
auto network = builder->createNetworkV2(explicitBatch);
auto config = builder->createBuilderConfig();
auto parser = nvonnxparser::createParser(*network, gLogger);
parser->parseFromFile("../model.onnx", static_cast<int>(0));
builder->setMaxBatchSize(1);
config->setMaxWorkspaceSize(128 * (1 << 20)); // 128 MiB
auto engine = builder->buildEngineWithConfig(*network, *config);
builder->destroy();
network->destroy();
parser->destroy();
config->destroy();
for(int i=0; i< atoi(argv[1]); i++)
{
auto context = engine->createExecutionContext();
void* deviceBuffers[2]{0};
int inputIndex = engine->getBindingIndex("input_rgb:0");
constexpr int inputNumel = 1 * 128 * 64 * 3;
int outputIndex = engine->getBindingIndex("truediv:0");
constexpr int outputNumel = 1 * 128;
//TODO: Remove batch size hardcoding
cudaMalloc(&deviceBuffers[inputIndex], 1 * sizeof(float) * inputNumel);
cudaMalloc(&deviceBuffers[outputIndex], 1 * sizeof(float) * outputNumel);
cudaStream_t stream;
cudaStreamCreate(&stream);
float inBuffer[inputNumel] = {0};
float outBuffer[outputNumel] = {0};
cudaMemcpyAsync(deviceBuffers[inputIndex], inBuffer, 1 * sizeof(float) * inputNumel, cudaMemcpyHostToDevice, stream);
context->enqueueV2(deviceBuffers, stream, nullptr);
cudaMemcpyAsync(outBuffer, deviceBuffers[outputIndex], 1 * sizeof(float) * outputNumel, cudaMemcpyDeviceToHost, stream);
cudaStreamSynchronize(stream);
cudaFree(deviceBuffers[inputIndex]);
cudaFree(deviceBuffers[outputIndex]);
cudaStreamDestroy(stream);
context->destroy();
}
engine->destroy();
return 0;
}
Looks like the issue was coming from the repetitive IExecutionContext creation despite destroying it at the end of every iteration. Creating/deleting the context at the same time as the engine fixed the issue for me. Nevertheless, it could still be a bug where context creation leaks a little bit of memory and that leak accumulates over time. Filed a github issue.
I try to read in a gpkg file to extract geo informations like streets and buildings.
Therefor I started with this code:
#include "gdal_priv.h"
#include <iostream>
int main() {
GDALDataset* poDataset;
GDALAllRegister();
std::cout << "driver# " << GetGDALDriverManager()->GetDriverCount()
<< std::endl;
for (int i = 0; i < GetGDALDriverManager()->GetDriverCount(); i++) {
auto driver = GetGDALDriverManager()->GetDriver(i);
auto info = driver->GetDescription();
std::cout << "driver " << i << ": " << info << std::endl;
}
auto driver = GetGDALDriverManager()->GetDriverByName("GPKG");
poDataset = (GDALDataset*)GDALOpen("Building_LoD1.gpkg", GA_ReadOnly);
if (poDataset == NULL) {
// ...;
}
return 0;
}
The driver list contains GPKG, but the reading fails with an error that the file is not recognized as supported file format.
Doing a gdalinfo Building_LoD1.gpkg leads to the same error in the console. But I can open the file in QGIS.
And a gdalsrsinfo Building_LoD1.gpk reports:
PROJ.4 : +proj=somerc +lat_0=46.95240555555556 +lon_0=7.439583333333333 +k_0=1 +x_0=2600000 +y_0=1200000 +ellps=bessel +towgs84=674.374,15.056,405.346,0,0,0,0 +units=m +no_defs
OGC WKT :
PROJCS["CH1903+ / LV95",
GEOGCS["CH1903+",
DATUM["CH1903+",
SPHEROID["Bessel 1841",6377397.155,299.1528128,
AUTHORITY["EPSG","7004"]],
TOWGS84[674.374,15.056,405.346,0,0,0,0],
AUTHORITY["EPSG","6150"]],
PRIMEM["Greenwich",0,
AUTHORITY["EPSG","8901"]],
UNIT["degree",0.0174532925199433,
AUTHORITY["EPSG","9122"]],
AUTHORITY["EPSG","4150"]],
PROJECTION["Hotine_Oblique_Mercator_Azimuth_Center"],
PARAMETER["latitude_of_center",46.95240555555556],
PARAMETER["longitude_of_center",7.439583333333333],
PARAMETER["azimuth",90],
PARAMETER["rectified_grid_angle",90],
PARAMETER["scale_factor",1],
PARAMETER["false_easting",2600000],
PARAMETER["false_northing",1200000],
UNIT["metre",1,
AUTHORITY["EPSG","9001"]],
AXIS["Easting",EAST],
AXIS["Northing",NORTH],
AUTHORITY["EPSG","2056"]]
Does anyone know why a gpkg file might be reported as not supported?
The gdal version is 2.3.2.
I figured out the problem. The reason for the message is not that the file format is not support by gdal, but that I used the wrong function to open the file.
If I want to read in a file that has vector information then I need to use:
GDALDataset* poDS;
poDS = (GDALDataset*)GDALOpenEx( "Building_LoD1.gpkg", GDAL_OF_VECTOR, NULL, NULL, NULL);
I'm trying to write a simple "Hello, world!" program using Mach threads on x86_64. Unfortunately, the program crashes with a segmentation fault on my machine, and I can't seem to fix the problem. I couldn't find much documentation about Mach threads online, but I referred to the following C file which also makes use of Mach threads.
As far as I can tell, I'm doing everything correctly. I suspect that the segmentation fault is because I did not set up the thread's stack correctly, but I took the same approach as the reference file, which has the following code.
// This is for alignment. In particular note that the sizeof(void*) is necessary
// since it would usually specify the return address (i.e. we are aligning the call
// frame to a 16 byte boundary as required by the abi, but the stack pointer
// to point to the byte beyond that. Not doing this leads to funny behavior on
// the first access to an external function will fail due to stack misalignment
state.__rsp &= -16;
state.__rsp -= sizeof(void*);
Do you have any idea as to what I could be doing wrong?
#include <cstdint>
#include <iostream>
#include <system_error>
#include <unistd.h>
#include <mach/mach_init.h>
#include <mach/mach_types.h>
#include <mach/task.h>
#include <mach/thread_act.h>
#include <mach/thread_policy.h>
#include <mach/i386/thread_status.h>
void check(kern_return_t err)
{
if (err == KERN_SUCCESS) {
return;
}
auto code = std::error_code{err, std::system_category()};
switch (err) {
case KERN_FAILURE:
throw std::system_error{code, "failure"};
case KERN_INVALID_ARGUMENT:
throw std::system_error{code, "invalid argument"};
default:
throw std::system_error{code, "unknown error"};
}
}
void test()
{
std::cout << "Hello from thread." << std::endl;
}
int main()
{
auto page_size = ::getpagesize();
auto stack = new uint8_t[page_size];
auto thread = ::thread_t{};
auto task = ::mach_task_self();
check(::thread_create(task, &thread));
auto state = ::x86_thread_state64_t{};
auto count = ::mach_msg_type_number_t{x86_THREAD_STATE64_COUNT};
check(::thread_get_state(thread, x86_THREAD_STATE64,
(::thread_state_t)&state, &count));
auto stack_ptr = (uintptr_t)(stack + page_size);
stack_ptr &= -16;
stack_ptr -= sizeof(void*);
state.__rip = (uintptr_t)test;
state.__rsp = (uintptr_t)stack_ptr;
state.__rbp = (uintptr_t)stack_ptr;
check(::thread_set_state(thread, x86_THREAD_STATE64,
(::thread_state_t)&state, x86_THREAD_STATE64_COUNT));
check(::thread_resume(thread));
::sleep(1);
std::cout << "Done." << std::endl;
}
The reference file uses C++11; if compiling with GCC or Clang, you will need to supply the std=c++11 flag.
I am a fresh to opencv with cuda.
I use opencv2.4.6 and CUDA4.2.
I have successfuly compile the opencv with cuda.
when i use the code:
int cuda_count;
cudaError_t error = cudaGetDeviceCount( &cuda_count );
it returns cudaSuccess and cuda_count=1
But, when i use the code:
int num_devices = cv::gpu::getCudaEnabledDeviceCount();
the num_devices returns 0
why?
my complete code is:
int main()
{
int num_devices = cv::gpu::getCudaEnabledDeviceCount();
int cuda_count;
cudaError_t error = cudaGetDeviceCount( &cuda_count );
if(num_devices <=0 )
{
std::cerr << "no" << std::endl;
return -1;
}
int enable_devivce_id = -1;
}
you must have been compiled OpenCV without CUDA support
gpu::getCudaEnabledDeviceCount Returns the number of installed
CUDA-enabled devices.
C++: int gpu::getCudaEnabledDeviceCount()
Use this function before any
other GPU functions calls. If OpenCV is compiled without GPU support,
this function returns 0.