cuModuleGetFunction returns not found - c++

I want to compile CUDA kernels with the nvrtc JIT compiler to improve the performance of my application (so I have an increased amount of instruction fetches but I am saving multiple array accesses).
The functions looks e.g. like this and is generated by my function generator (not that important):
extern "C" __device__ void GetSumOfBranches(double* branches, double* outSum)
{
double sum = (branches[38])+(-branches[334])+(-branches[398])+(-branches[411]);
*outSum = sum;
}
I am compiling the code above with the following function:
CUfunction* FunctionGenerator::CreateFunction(const char* programText)
{
// When I comment this statement out the output of the PTX file is changing
// what is the reson?!
// Bug?
std::string savedString = std::string(programText);
nvrtcProgram prog;
nvrtcCreateProgram(&prog, programText, "GetSumOfBranches.cu", 0, NULL, NULL);
const char *opts[] = {"--gpu-architecture=compute_52", "--fmad=false"};
nvrtcCompileProgram(prog, 2, opts);
// Obtain compilation log from the program.
size_t logSize;
nvrtcGetProgramLogSize(prog, &logSize);
char *log = new char[logSize];
nvrtcGetProgramLog(prog, log);
// Obtain PTX from the program.
size_t ptxSize;
nvrtcGetPTXSize(prog, &ptxSize);
char *ptx = new char[ptxSize];
nvrtcGetPTX(prog, ptx);
printf("%s", ptx);
CUdevice cuDevice;
CUcontext context;
CUmodule module;
CUfunction* kernel;
kernel = (CUfunction*)malloc(sizeof(CUfunction));
cuInit(0);
cuDeviceGet(&cuDevice, 0);
cuCtxCreate(&context, 0, cuDevice);
auto resultLoad = cuModuleLoadDataEx(&module, ptx, 0, 0, 0);
auto resultGetF = cuModuleGetFunction(kernel, module, "GetSumOfBranches");
return kernel;
}
Everything is working except that cuModuleGetFunction is returning CUDA_ERROR_NOT_FOUND. That error occurs because GetSumOfBranches cannot be found in the PTX file.
However the output of printf("%s", ptx); is this:
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19856038
// Cuda compilation tools, release 7.5, V7.5.17
// Based on LLVM 3.4svn
//
.version 4.3
.target sm_52
.address_size 64
// .globl GetSumOfBranches
.visible .func GetSumOfBranches(
.param .b64 GetSumOfBranches_param_0,
.param .b64 GetSumOfBranches_param_1
)
{
.reg .f64 %fd<8>;
.reg .b64 %rd<3>;
ld.param.u64 %rd1, [GetSumOfBranches_param_0];
ld.param.u64 %rd2, [GetSumOfBranches_param_1];
ld.f64 %fd1, [%rd1+304];
ld.f64 %fd2, [%rd1+2672];
sub.rn.f64 %fd3, %fd1, %fd2;
ld.f64 %fd4, [%rd1+3184];
sub.rn.f64 %fd5, %fd3, %fd4;
ld.f64 %fd6, [%rd1+3288];
sub.rn.f64 %fd7, %fd5, %fd6;
st.f64 [%rd2], %fd7;
ret;
}
In my optinion everything is fine and GetSumOfBranches sould be found by cuModuleGetFunction. Can you explain me why?
Second Question
when i outcomment std::string savedString = std::string(programText); then the output of the PTX is just:
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-19856038
// Cuda compilation tools, release 7.5, V7.5.17
// Based on LLVM 3.4svn
//
.version 4.3
.target sm_52
.address_size 64
and this is weird because savedString is not used at all...

What you are trying to do isn't supported. The host side modules management APIs and device ELF format do not expose __device__ functions, only __global__ functions which are callable via the kernel launch APIs.
You can compile device functions a priori or at runtime and link them with kernels in a JIT fashion, and you can retrieve those kernels and call them. But that is all you can do.

Related

Obtaining libc function name (e.g., scanf) with Intel PIN Instrumentation

I'm trying to obtain a routine name for libc functions (e.g., scanf in this case) with the Intel PIN instrumentation. After trying a lot of ways, I am under the belief that this is difficult, but I figured to ask first.
For the sake of brevity, below is a minimal example of an input binary source code and Intel PIN instrumentation code of what I have tried, I feel that the result is close, yet missing the final detail.
Example source code (test.c):
#include <stdio.h>
#include <stdlib.h>
void foo(char *buffer) {
scanf("%s", buffer);
printf("Unsafe gets: %s\n", buffer);
}
int main() {
char *buf = malloc(sizeof(char)*50);
foo(buf);
return 0;
}
Intel PIN instrumentation source code (test_pin.so):
VOID do_call(ADDRINT addr)
{
printf("%p - %s\n", (void*)addr, RTN_FindNameByAddress(addr).c_str());
//fprintf(trace, "\n[%s]\n", RTN_FindNameByAddress(addr).c_str()));
//fflush(trace);
}
VOID Instruction(INS ins, VOID *v)
{
if (INS_IsCall(ins))
{
if (INS_IsDirectBranchOrCall(ins))
{
const ADDRINT addr = INS_DirectBranchOrCallTargetAddress(ins);
INS_InsertPredicatedCall(ins, IPOINT_BEFORE, AFUNPTR(do_call),
IARG_PTR, addr, IARG_FUNCARG_CALLSITE_VALUE, 0, IARG_END);
}
}
}
int main(int argc, char **argv)
{
/* initialize symbol processing */
PIN_InitSymbols();
/* initialize Pin; optimized branch */
if (unlikely(PIN_Init(argc, argv)))
/* Pin initialization failed */
goto err;
INS_AddInstrumentFunction(Instruction, 0);
/* start Pin */
PIN_StartProgram();
/* typically not reached; make the compiler happy */
return EXIT_SUCCESS;
}
So the reason why I'm saying that I feel somewhat close is that running the above Intel PIN Instrumentation using the command pin -t test_pin.so -- ./test will result in the following output:
...
0x7f27aac14280 - brk
0x7f27aac142e0 - sbrk
0x7f27aab97a70 - pthread_attr_setschedparam
0x55a6dcbb8189 - foo
0x55a6dcbb8090 - .plt.sec
0x7f27aab63a00 - psiginfo
0x7f27aab91c30 - __uflow
0x7f27aab91e80 - _IO_doallocbuf
...
where we can observe that two addresses start with 0x5 (to indicate that this is from the main application rather than calling from a shared library such as 0x7):
0x55a6dcbb8189 - foo
0x55a6dcbb8090 - .plt.sec
Doing additional investigation using objdump or gdb of the above source code example, it can be found that the entry address of foo function and the mystery target of .plt.sec from the Intel PIN is (after removing the relocation base address):
0000000000001189 <foo>:
...
11ac: e8 df fe ff ff call 1090 <__isoc99_scanf#plt>
So instead of getting the output of .plt.sec, is it possible to somehow resolve this mysterious .plt.sec name into __isoc99_scanf? I would appreciate any suggestions.

How can I create an executable to run a kernel in a given PTX file?

As far as I know, you need a host code (for CPU) and a device code (for GPU), without them you can't run something on GPU.
I am learning PTX ISA and I don't know how to execute it on Windows. Do I need a .cu file to run it or is there another way to run it?
TL;DR:
How can I assemble .ptx file and host code file and make a executable file?
You use the CUDA driver API. Relevant sample codes are vectorAddDrv (or perhaps any other driver API sample code) as well as ptxjit.
Do I need a .cu file to run it or is there another way to run it?
You do not need a .cu file (nor do you need nvcc) to use the driver API method, if you start with device code in PTX form.
Details:
The remainder of this answer is not intended to be a tutorial on driver API programming (use the references already given and the API reference manual here), nor is it intended to be a tutorial on PTX programming. For PTX programming I refer you to the PTX documentation.
To start with, we need an appropriate PTX kernel definition. (For that, rather than writing my own kernel PTX code, I will use the one from the vectorAddDrv sample code, from the CUDA 11.1 toolkit, converting that CUDA C++ kernel definition to an equivalent PTX kernel definition via nvcc -ptx vectorAdd_kernel.cu):
vectorAdd_kernel.ptx:
.version 7.1
.target sm_52
.address_size 64
// .globl VecAdd_kernel
.visible .entry VecAdd_kernel(
.param .u64 VecAdd_kernel_param_0,
.param .u64 VecAdd_kernel_param_1,
.param .u64 VecAdd_kernel_param_2,
.param .u32 VecAdd_kernel_param_3
)
{
.reg .pred %p<2>;
.reg .f32 %f<4>;
.reg .b32 %r<6>;
.reg .b64 %rd<11>;
ld.param.u64 %rd1, [VecAdd_kernel_param_0];
ld.param.u64 %rd2, [VecAdd_kernel_param_1];
ld.param.u64 %rd3, [VecAdd_kernel_param_2];
ld.param.u32 %r2, [VecAdd_kernel_param_3];
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r1, %r3, %r4, %r5;
setp.ge.s32 %p1, %r1, %r2;
#%p1 bra $L__BB0_2;
cvta.to.global.u64 %rd4, %rd1;
mul.wide.s32 %rd5, %r1, 4;
add.s64 %rd6, %rd4, %rd5;
cvta.to.global.u64 %rd7, %rd2;
add.s64 %rd8, %rd7, %rd5;
ld.global.f32 %f1, [%rd8];
ld.global.f32 %f2, [%rd6];
add.f32 %f3, %f2, %f1;
cvta.to.global.u64 %rd9, %rd3;
add.s64 %rd10, %rd9, %rd5;
st.global.f32 [%rd10], %f3;
$L__BB0_2:
ret;
}
We'll also need a driver API C++ source code file that does all the host-side work to load this kernel and launch it. Again I will use the source code from the vectorAddDrv sample project (the .cpp file), with modifications to load PTX instead of fatbin:
vectorAddDrv.cpp:
// Vector addition: C = A + B.
// Includes
#include <stdio.h>
#include <string>
#include <iostream>
#include <cstring>
#include <fstream>
#include <streambuf>
#include <cuda.h>
#include <cmath>
#include <vector>
#define CHK(X) if ((err = X) != CUDA_SUCCESS) printf("CUDA error %d at %d\n", (int)err, __LINE__)
// Variables
CUdevice cuDevice;
CUcontext cuContext;
CUmodule cuModule;
CUfunction vecAdd_kernel;
CUresult err;
CUdeviceptr d_A;
CUdeviceptr d_B;
CUdeviceptr d_C;
// Host code
int main(int argc, char **argv)
{
printf("Vector Addition (Driver API)\n");
int N = 50000, devID = 0;
size_t size = N * sizeof(float);
// Initialize
CHK(cuInit(0));
CHK(cuDeviceGet(&cuDevice, devID));
// Create context
CHK(cuCtxCreate(&cuContext, 0, cuDevice));
// Load PTX file
std::ifstream my_file("vectorAdd_kernel.ptx");
std::string my_ptx((std::istreambuf_iterator<char>(my_file)), std::istreambuf_iterator<char>());
// Create module from PTX
CHK(cuModuleLoadData(&cuModule, my_ptx.c_str()));
// Get function handle from module
CHK(cuModuleGetFunction(&vecAdd_kernel, cuModule, "VecAdd_kernel"));
// Allocate/initialize vectors in host memory
std::vector<float> h_A(N, 1.0f);
std::vector<float> h_B(N, 2.0f);
std::vector<float> h_C(N);
// Allocate vectors in device memory
CHK(cuMemAlloc(&d_A, size));
CHK(cuMemAlloc(&d_B, size));
CHK(cuMemAlloc(&d_C, size));
// Copy vectors from host memory to device memory
CHK(cuMemcpyHtoD(d_A, h_A.data(), size));
CHK(cuMemcpyHtoD(d_B, h_B.data(), size));
// Grid/Block configuration
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
void *args[] = { &d_A, &d_B, &d_C, &N };
// Launch the CUDA kernel
CHK(cuLaunchKernel(vecAdd_kernel, blocksPerGrid, 1, 1,
threadsPerBlock, 1, 1,
0,
NULL, args, NULL));
// Copy result from device memory to host memory
// h_C contains the result in host memory
CHK(cuMemcpyDtoH(h_C.data(), d_C, size));
// Verify result
for (int i = 0; i < N; ++i)
{
float sum = h_A[i] + h_B[i];
if (fabs(h_C[i] - sum) > 1e-7f)
{
printf("mismatch!");
break;
}
}
return 0;
}
(Note that I have stripped out various items such as deallocation calls. This is intended to demonstrate the overall method; the code above is merely a demonstrator.)
On Linux:
We can compile and run the code as follows:
$ g++ vectorAddDrv.cpp -o vectorAddDrv -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda
$ ./vectorAddDrv
Vector Addition (Driver API)
$
On Windows/Visual Studio: Create a new C++ project in visual studio. Add the above .cpp file to the project. Make sure the vectorAdd_kernel.ptx file is in the same directory as the built executable. You will also need to modify the project definition to point the location of the CUDA include files and the CUDA library files. Here's what I did in VS2019:
File...New...Project...Console App...Create
Replace the contents of the given .cpp file with the .cpp file contents above
Change project target to x64
In project...properties
change platform to x64
in configuration properties...C/C++...General...Additional Include Directories, add the path to the CUDA toolkit include directory, on my machine it was C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\include
in configuration properties...Linker...General...Additional Library Directories, add the path to the CUDA toolkit library directory, on my machine it was C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v11.1\lib\x64
in configuration properties...Linker...Input...Additional Dependencies, add the cuda.lib file (for the driver API library)
Save the project properties, then do Build....Rebuild
From the console output, locate the location of the built executable. Make sure the vectorAdd_kernel.ptx file is in that directory, and run the executable from that directory. (i.e. open a command prompt. change to that directory. run the application from the command prompt)
NOTE: If you are not using the CUDA 11.1 toolkit or newer, or if you are running on a GPU of compute capability 5.0 or lower, the above PTX code will not work, and so this example will not work verbatim. However the overall method will work, and this question is not about how to write PTX code.
EDIT: Responding to a question in the comments:
What if you wanted the binary to not have to build anything at runtime? i.e. assemble the PTX and stick it in a binary with the compiled host-side code?
I'm not aware of a method provided by the NVIDIA toolchain to do this. It's pretty much the domain of the runtime API to create these unified binaries, from my perspective.
However the basic process seems to be evident from what can be seen of the driver API flow already in the above example: whether we start with a .cubin or a .ptx file, either way the file is loaded into a string, and the string is handed off to cuModuleLoad(). Therefore, it doesn't seem that difficult to build a string out of a .cubin binary with a utility, and then incorporate that in the build process.
I'm really just hacking around here, you should use this at your own risk, and there may be any number of factors that I haven't considered. I'm just going to demonstrate on linux for this part. Here is the source code and build example for the utility:
$ cat f2s.cpp
// Includes
#include <stdio.h>
#include <string>
#include <iostream>
#include <cstring>
#include <fstream>
#include <streambuf>
int main(int argc, char **argv)
{
std::ifstream my_file("vectorAdd_kernel.cubin");
std::string my_bin((std::istreambuf_iterator<char>(my_file)), std::istreambuf_iterator<char>());
std::cout << "unsigned char my_bin[] = {";
for (int i = 0; i < my_bin.length()-1; i++) std::cout << (int)(unsigned char)my_bin[i] << ",";
std::cout << (int)(unsigned char)my_bin[my_bin.length()-1] << "};";
return 0;
}
$ g++ f2s.cpp -o f2s
$
The next step here is to create a .cubin file for use. In the above example, I created the ptx file via nvcc -ptx vectorAdd_kernel.cu. We can just change that to nvcc -cubin vectorAdd_kernel.cu or you can use whatever method you like to generate the .cubin file.
With the cubin file created, we need to convert that into something that can be sucked into our C++ code build process. That is the purpose of the f2s utility. You would use it like this:
./f2s > my_bin.h
(probably it would be good to allow the f2s utility to accept an input filename as a command-line argument. Exercise left to reader. This is just for demonstration/amusement.)
After the creation of the above header file, we need to modify our .cpp file as follows:
$ cat vectorAddDrv_bin.cpp
// Vector addition: C = A + B.
// Includes
#include <stdio.h>
#include <string>
#include <iostream>
#include <cstring>
#include <fstream>
#include <streambuf>
#include <cuda.h>
#include <cmath>
#include <vector>
#include <my_bin.h>
#define CHK(X) if ((err = X) != CUDA_SUCCESS) printf("CUDA error %d at %d\n", (int)err, __LINE__)
// Variables
CUdevice cuDevice;
CUcontext cuContext;
CUmodule cuModule;
CUfunction vecAdd_kernel;
CUresult err;
CUdeviceptr d_A;
CUdeviceptr d_B;
CUdeviceptr d_C;
// Host code
int main(int argc, char **argv)
{
printf("Vector Addition (Driver API)\n");
int N = 50000, devID = 0;
size_t size = N * sizeof(float);
// Initialize
CHK(cuInit(0));
CHK(cuDeviceGet(&cuDevice, devID));
// Create context
CHK(cuCtxCreate(&cuContext, 0, cuDevice));
// Create module from "binary string"
CHK(cuModuleLoadData(&cuModule, my_bin));
// Get function handle from module
CHK(cuModuleGetFunction(&vecAdd_kernel, cuModule, "VecAdd_kernel"));
// Allocate/initialize vectors in host memory
std::vector<float> h_A(N, 1.0f);
std::vector<float> h_B(N, 2.0f);
std::vector<float> h_C(N);
// Allocate vectors in device memory
CHK(cuMemAlloc(&d_A, size));
CHK(cuMemAlloc(&d_B, size));
CHK(cuMemAlloc(&d_C, size));
// Copy vectors from host memory to device memory
CHK(cuMemcpyHtoD(d_A, h_A.data(), size));
CHK(cuMemcpyHtoD(d_B, h_B.data(), size));
// Grid/Block configuration
int threadsPerBlock = 256;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
void *args[] = { &d_A, &d_B, &d_C, &N };
// Launch the CUDA kernel
CHK(cuLaunchKernel(vecAdd_kernel, blocksPerGrid, 1, 1,
threadsPerBlock, 1, 1,
0,
NULL, args, NULL));
// Copy result from device memory to host memory
// h_C contains the result in host memory
CHK(cuMemcpyDtoH(h_C.data(), d_C, size));
// Verify result
for (int i = 0; i < N; ++i)
{
float sum = h_A[i] + h_B[i];
if (fabs(h_C[i] - sum) > 1e-7f)
{
printf("mismatch!");
break;
}
}
return 0;
}
$ g++ vectorAddDrv_bin.cpp -o vectorAddDrv_bin -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lcuda -I.
$ ./vectorAddDrv_bin
Vector Addition (Driver API)
$
It seems to work. YMMV. For further amusement, this approach seems to create a form of obfuscation:
$ cuobjdump -sass vectorAdd_kernel.cubin
code for sm_52
Function : VecAdd_kernel
.headerflags #"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
/* 0x001cfc00e22007f6 */
/*0008*/ MOV R1, c[0x0][0x20] ; /* 0x4c98078000870001 */
/*0010*/ S2R R0, SR_CTAID.X ; /* 0xf0c8000002570000 */
/*0018*/ S2R R2, SR_TID.X ; /* 0xf0c8000002170002 */
/* 0x001fd842fec20ff1 */
/*0028*/ XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ ; /* 0x4f107f8000270003 */
/*0030*/ XMAD R2, R0.reuse, c[0x0] [0x8], R2 ; /* 0x4e00010000270002 */
/*0038*/ XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2 ; /* 0x5b30011800370000 */
/* 0x001ff400fd4007ed */
/*0048*/ ISETP.GE.AND P0, PT, R0, c[0x0][0x158], PT ; /* 0x4b6d038005670007 */
/*0050*/ NOP ; /* 0x50b0000000070f00 */
/*0058*/ #P0 EXIT ; /* 0xe30000000000000f */
/* 0x081fd800fea207f1 */
/*0068*/ SHL R6, R0.reuse, 0x2 ; /* 0x3848000000270006 */
/*0070*/ SHR R0, R0, 0x1e ; /* 0x3829000001e70000 */
/*0078*/ IADD R4.CC, R6.reuse, c[0x0][0x140] ; /* 0x4c10800005070604 */
/* 0x001fd800fe0207f2 */
/*0088*/ IADD.X R5, R0.reuse, c[0x0][0x144] ; /* 0x4c10080005170005 */
/*0090*/ { IADD R2.CC, R6, c[0x0][0x148] ; /* 0x4c10800005270602 */
/*0098*/ LDG.E R4, [R4] }
/* 0xeed4200000070404 */
/* 0x001fd800f62007e2 */
/*00a8*/ IADD.X R3, R0, c[0x0][0x14c] ; /* 0x4c10080005370003 */
/*00b0*/ LDG.E R2, [R2] ; /* 0xeed4200000070202 */
/*00b8*/ IADD R6.CC, R6, c[0x0][0x150] ; /* 0x4c10800005470606 */
/* 0x001fc420fe4007f7 */
/*00c8*/ IADD.X R7, R0, c[0x0][0x154] ; /* 0x4c10080005570007 */
/*00d0*/ FADD R0, R2, R4 ; /* 0x5c58000000470200 */
/*00d8*/ STG.E [R6], R0 ; /* 0xeedc200000070600 */
/* 0x001ffc00ffe007ea */
/*00e8*/ NOP ; /* 0x50b0000000070f00 */
/*00f0*/ EXIT ; /* 0xe30000000007000f */
/*00f8*/ BRA 0xf8 ; /* 0xe2400fffff87000f */
..........
$ cuobjdump -sass vectorAddDrv_bin
cuobjdump info : File 'vectorAddDrv_bin' does not contain device code
$
LoL

Pytorch inference time difference between CUDA 10.0 & 10.2

We have a working library that uses LibTorch 1.5.0, built with CUDA 10.0 which runs as expected.
We are working on upgrading to CUDA 10.2 for various non-PyTorch related reasons. We noticed that when we run LibTorch inference on the newly compiled LibTorch (compiled exactly the same, except changing to CUDA 10.2), the runtime is about 20x slower.
We also checked it using the precompiled binaries.
This was tested on 3 different machines using 3 different GPUs (Tesla T4, GTX980 & P1000) and all gives consistent ~20x slower on CUDA 10.2
(Both on Windows 10 & Ubuntu 16.04), all with the latest drivers and on 3 different torch scripts (of the same architecture)
I've simplified the code to be extremely minimal without external dependencies other than Torch
int main(int argc, char** argv)
{
// Initialize CUDA device 0
cudaSetDevice(0);
std::string networkPath = DEFAULT_TORCH_SCRIPT;
if (argc > 1)
{
networkPath = argv[1];
}
auto jitModule = std::make_shared<torch::jit::Module>(torch::jit::load(networkPath, torch::kCUDA));
if (jitModule == nullptr)
{
std::cerr << "Failed creating module" << std::endl;
return EXIT_FAILURE;
}
// Meaningless data, just something to pass to the module to run on
// PATCH_HEIGHT & WIDTH are defined as 256
uint8_t* data = new uint8_t[PATCH_HEIGHT * PATCH_WIDTH * 3];
memset(data, 0, PATCH_HEIGHT * PATCH_WIDTH * 3);
auto stream = at::cuda::getStreamFromPool(true, 0);
bool res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);
std::cout << "Warmed up" << std::endl;
res = infer(jitModule, stream, data, PATCH_WIDTH, PATCH_HEIGHT);
delete[] data;
return 0;
}
// Inference function
bool infer(std::shared_ptr<JitModule>& jitModule, at::cuda::CUDAStream& stream, const uint8_t* inputData, int width, int height)
{
std::vector<torch::jit::IValue> tensorInput;
// This function simply uses cudaMemcpy to copy to device and create a torch::Tensor from that data
// I can paste it if it's relevant but didn't now to keep as clean as possible
if (!prepareInput(inputData, width, height, tensorInput, stream))
{
return false;
}
// Reduce memory usage, without gradients
torch::NoGradGuard noGrad;
{
at::cuda::CUDAStreamGuard streamGuard(stream);
auto totalTimeStart = std::chrono::high_resolution_clock::now();
jitModule->forward(tensorInput);
// The synchronize here is just for timing sake, not use in production
cudaStreamSynchronize(stream.stream());
auto totalTimeStop = std::chrono::high_resolution_clock::now();
printf("forward sync time = %.3f milliseconds\n",
std::chrono::duration<double, std::milli>(totalTimeStop - totalTimeStart).count());
}
return true;
}
When compiling this with Torch that was compiled using CUDA 10.0 we get a runtime of 18 ms and when we run it with Torch compiled with CUDA 10.2, we get a runtime of 430 ms
Any thoughts on that?
This issue was also posted on PyTorch Forums.
Issue on GitHub
UPDATE
I profiled this small program using both CUDAs
It seems that both use very different kernels
96.5% of the 10.2 computes are conv2d_grouped_direct_kernel which takes ~60-100ms on my P1000
where as the top kernels in the 10.0 run are
47.1% - cudnn::detail::implicit_convolve_sgemm (~1.5 ms)
23.1% - maxwell_scudnn_winograd_128x128_ldg1_ldg4_tile148n_nt (~0.4 ms)
8.5% - maxwell_scudnn_128x32_relu_small_nn (~0.4ms)
so it's easy to see where the time difference comes from. Now the question is, why.

GCC Plugin, add new optimizing pragma

I'm creating a GCC plugin.
I'm trying to create a plugin for a specific loop transformation - unroll loop exactly N (parameter given) times.
I have installed plugins correctly and I can successfully register my pragma in compilation process.
When I register pragma with function c_register_pragma, I can handle it in lexical analysis (with function handle_my_pragma), but how can I find it then?
I can also define my own pass and traverse GIMPLE, but there is no trace of any pragma.
So my question is: Where is my pragma and how can I influence my code with it?
Or what would you suggest to reach my goal? It doesn't have to be with pragma, but it seemed to be a good idea.
Also, I know about MELT, but within the study of GCC, I would prefer pure plugin in C.
My code
static bool looplugin_gate(void)
{
return true;
}
static unsigned looplugin_exec(void)
{
printf( "===looplugin_exec===\n" );
basic_block bb;
gimple stmt;
gimple_stmt_iterator gsi;
FOR_EACH_BB(bb)
{
for (gsi=gsi_start_bb(bb); !gsi_end_p(gsi); gsi_next(&gsi), j++)
{
stmt = gsi_stmt(gsi);
print_gimple_stmt (stdout, stmt, 0, TDF_SLIM);
}
}
return 0;
}
void handle_my_pragma(cpp_reader *ARG_UNUSED(dummy))
{
printf ("=======Handling loopragma=======\n" );
enum cpp_ttype token;
tree x;
int num = -1;
token = pragma_lex (&x);
if (TREE_CODE (x) != INTEGER_CST)
warning (0, "invalid constant in %<#pragma looppragma%> - ignored");
num = TREE_INT_CST_LOW (x);
printf( "Detected #pragma loopragma %d\n", num );
}
static void register_my_pragma (void *event_data, void *data)
{
warning (0, G_("Callback to register pragmas"));
c_register_pragma (NULL, "loopragma", handle_my_pragma);
}
static struct opt_pass myopt_pass =
{
.type = GIMPLE_PASS,
.name = "LoopPlugin",
.gate = looplugin_gate,
.execute = looplugin_exec
};
int plugin_init(struct plugin_name_args *info, /* Argument infor */
struct plugin_gcc_version *ver) /* Version of GCC */
{
const char * plugin_name = info->base_name;
struct register_pass_info pass;
pass.pass = &myopt_pass;
pass.reference_pass_name = "ssa";
pass.ref_pass_instance_number = 1;
pass.pos_op = PASS_POS_INSERT_BEFORE;
register_callback( plugin_name, PLUGIN_PRAGMAS, register_my_pragma, NULL );
register_callback( plugin_name, PLUGIN_PASS_MANAGER_SETUP, NULL, &pass );
return 0;
}
PS: If there was someone familiar with GCC plugins development and had a good heart :), please contact me (mbukovy gmail com). I'm doing this because of my final thesis (own choice) and I welcome any soulmate.
When I register pragma with function c_register_pragma, I can handle it in lexical analysis (with function handle_my_pragma), but how can I find it then?
There is an option (actually, a hack) to create fictive helper function call at the place of pragma, when parsing. Then you can detect this function by name in intermediate representation.
Aslo, several days ago there was a question in GCC ML from felix.yang (huawei) "How to deliver loop-related pragma information from TREE to RTL?" - http://comments.gmane.org/gmane.comp.gcc.devel/135243 - check the thread
Some recommendations from the list:
Look at how we implement #pragma ivdep (see replace_loop_annotate ()
and fortran/trans-stmt.c where it builds ANNOTATE_EXPR).
Patch with replace_loop_annotate() function addition and ivdep pragma implementation: "Re: Patch: Add #pragma ivdep support to the ME and C FE" by Tobias Burnus (2013-08-24).
I do not think registering a DEFERRED pragma in plugin is possible, since the handler for deferred pragma is not exposed in GCC plugin level.
So your pragma just works during preprocessing stage instead of parsing stage, then, it is quite tricky to achieve an optimization goal.

LLVM NVPTX backend struct parameter zero size

I'm getting an obscure exception when loading the PTX assembly generated by LLVM's NVPTX backend. (I'm loading the PTX from ManagedCuda - http://managedcuda.codeplex.com/ )
ErrorNoBinaryForGPU: This indicates that there is no kernel image available that is suitable for the device. This can occur when a user specifies code generation options for a particular CUDA source file that do not include the corresponding device configuration.
Here is the LLVM IR for the module (it's a bit weird since it's generated by a tool)
; ModuleID = 'Module'
target triple = "nvptx64-nvidia-cuda"
%testStruct = type { i32 }
define void #kernel(i32 addrspace(1)*) {
entry:
%1 = alloca %testStruct
store %testStruct zeroinitializer, %testStruct* %1
%2 = load %testStruct* %1
call void #structtest(%testStruct %2)
ret void
}
define void #structtest(%testStruct) {
entry:
ret void
}
!nvvm.annotations = !{!0}
!0 = metadata !{void (i32 addrspace(1)*)* #kernel, metadata !"kernel", i32 1}
and here is the resulting PTX
//
// Generated by LLVM NVPTX Back-End
//
.version 3.1
.target sm_20
.address_size 64
// .globl kernel
.visible .func structtest
(
.param .b0 structtest_param_0
)
;
.visible .entry kernel(
.param .u64 kernel_param_0
)
{
.local .align 8 .b8 __local_depot0[8];
.reg .b64 %SP;
.reg .b64 %SPL;
.reg .s32 %r<2>;
.reg .s64 %rl<2>;
mov.u64 %rl1, __local_depot0;
cvta.local.u64 %SP, %rl1;
mov.u32 %r1, 0;
st.u32 [%SP+0], %r1;
// Callseq Start 0
{
.reg .b32 temp_param_reg;
// <end>}
.param .align 4 .b8 param0[4];
st.param.b32 [param0+0], %r1;
call.uni
structtest,
(
param0
);
//{
}// Callseq End 0
ret;
}
// .globl structtest
.visible .func structtest(
.param .b0 structtest_param_0
)
{
ret;
}
I have no idea how to read PTX, but I have a feeling the problem has to do with the .b0 bit of .param .b0 structtest_param_0 in the structtest function definition.
Passing non-structure values (like integers or pointers) works fine, and the .b0. bit of the function reads something sane like .b32 or .b64 when doing so.
Changing triple to nvptx-nvidia-cuda (32 bit) does nothing, as well as including/excluding the data layout suggested in http://llvm.org/docs/NVPTXUsage.html
Is this a bug in the NVPTX backend, or am I doing something wrong?
Update:
I'm looking through this - http://llvm.org/docs/doxygen/html/NVPTXAsmPrinter_8cpp_source.html - and it appears as if the type is falling through to line 01568, is obviously not a primitive type, and Ty->getPrimitiveSizeInBits() returns zero. (At least that's my guess, anyway)
Do I need to add a special case for checking to see if it's a structure, taking the address, making the argument byval, and dereferencing the struct afterwards? That seems like a hacky solution, but I'm not sure how else to fix it.
Have you tried to get the error message buffer from compilation? In managedCuda this would be something like:
CudaContext ctx = new CudaContext();
CudaJitOptionCollection options = new CudaJitOptionCollection();
CudaJOErrorLogBuffer err = new CudaJOErrorLogBuffer(1024);
options.Add(err);
try
{
ctx.LoadModulePTX("test.ptx", options);
}
catch
{
options.UpdateValues();
MessageBox.Show(err.Value);
}
When I run your ptx it says:
ptxas application ptx input, line 12; fatal : Parsing error near '.b0': syntax error
ptxas fatal : Ptx assembly aborted due to errors"
what supports your guess with b0.