MKL Rectangular Matrix inplace transpose: not using multiple cores?

MKL Rectangular Matrix inplace transpose: not using multiple cores? - c++

I want an in place memory transpose of very large matrix. I am using mkl_simatcopy. But I am observing some performance issue while transposing inplace. I am currently using Intel(R) Xeon(R) CPU E7-8867 v4 # 2.40GHz having 72 physical cores and RedHat OS.
My observation is that, when I perform transpose operation, only single core is used and it is not using all cores. I have tried all environment variables like MK_NUM_THREADS, MKL_DYNAMIC="FALSE" etc. My compilation script is as follows :
gcc -std=c99 -m64 -I $MKLROOT/include transpose.c
${MKLROOT}/lib/intel64/libmkl_scalapack_ilp64.a -Wl,--start-group
${MKLROOT}/lib/intel64/libmkl_cdft_core.a
${MKLROOT}/lib/intel64/libmkl_intel_ilp64.a
${MKLROOT}/lib/intel64/libmkl_tbb_thread.a
${MKLROOT}/lib/intel64/libmkl_core.a
${MKLROOT}/lib/intel64/libmkl_blacs_openmpi_ilp64.a -Wl,--end-group -lstdc++ -lpthread -lm -ldl -o transpose.out
Timings obtained are as follows
Sno. Rows Cols Time(in sec)
1 16384 8192 16
2 16384 32768 68
3 32768 65536 233
Data Type is float. Please let me know , if there is an efficient way to transpose inplace or how can we port it to multiple cores.
int main(int argc, char *argv[])
{
unsigned long noOfScan = atol(argv[1]);
unsigned long noOfPix = atol(argv[2]);
size_t nEle = noOfScan * noOfPix;
float *data = (float *)calloc(nEle, sizeof(float));
initalizeData(data, noOfScan, noOfPix);
//printdata(data,noOfScan,noOfPix);
//writeDataFile((char *)data,"BeforeTranspose.img",nEle*sizeof(float));
printf("After transpose \n\n");
long nt = mkl_get_max_threads();
printf("No Of threads are = %d \n", nt);
//mkl_set_num_threads_local(nt);
//mkl_set_num_threads(nt);
double time1 = cpuSecond();
mkl_simatcopy('R', 'T', noOfScan, noOfPix, 1, data, noOfPix, noOfScan);
printf("Time elapsed is %lf \n", cpuSecond() - time1);
memset(data, 0, nEle * sizeof(float));
free(data);
}

The answer from Intel's forum: mkl_simatcopy doesn't support multithreading.

Yes, this routine is not threaded. In the case, if you really need to have this routine threaded, please submit the feature requests to the intel online service center - https://supporttickets.intel.com/

Related

CUDA Optimizations for the RTX A5000 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 5 months ago.
Improve this question
I am a bit rusty on my CUDA skills. I am attempting to generate a 5120x5120 image that consists entirely of random noise generated via cuRand. I am using a single NVIDIA RTX A5000 who's compute compatibility is 8.6. My question is, what should be the ideal grid dimensions and threads per block to squeeze the highest amount of efficiency out of this noise generation. For context, here are some of the hardware specs of the A5000:
-64 SM's
-8192 CUDA cores
-128 cores per SM
-16 blocks per SM
-48 active warps per SM (1536 threads)

Both for general considerations of optimizing performance based on maximizing occupancy as well as considerations on CURAND generator initialization, you would want to choose a number of blocks and threads per block so that the product is equal to at most 64 SMs x 1536 threads per SM. Ideally you would target that number.
You would start by writing the kernel that takes generator state that has already been initialized, and runs a grid-stride loop to use random number generation to write image points.
If the occupancy analysis (probably based on registers per thread) suggests that maximum thread load per SM is possible (for that kernel that you just wrote), then you would size your grids that way.
If the occupancy analysis indicated that your kernel cannot have the full 1536 threads per SM, then you would reduce your grid launch (size) accordingly.
Since the cc8.6 SM has a maximum of 1536 threads, don't size your threadblocks at 1024 threads. Choose 512, or some other number like 256.
With CURAND, in my view, the best practice is to launch the generator state initialization kernel (separately, first) before your image update kernel. Don't try to do generator state initialization in the same kernel that is doing the image generation.
Once you figure out what the maximum occupancy is, then you will size your generator state array to match that, and you will launch a grid-stride kernel to fill that array as your CURAND init kernel.
Then launch your image creation kernel.
Here is a simple example:
$ cat t2115.cu
#include <curand_kernel.h>
#include <curand.h>
const int imageW = 5120;
const int imageH = 5120;
using mt = float;
__global__ void setup_kernel(curandState *state, size_t N, const unsigned long long seed = 1, const unsigned long long offset = 0){
for (size_t id = blockIdx.x * blockDim.x + threadIdx.x; id < N; id += gridDim.x*blockDim.x)
curand_init(seed, id, offset, state+id);
}
template <typename T>
__global__ void image_gen(T *img, curandState *state, const size_t img_size){
size_t id = blockIdx.x * blockDim.x + threadIdx.x;
curandState s = state[id];
for (;id < img_size; id += gridDim.x*blockDim.x)
img[id] = curand_uniform(&s);
}
int main(){
size_t is = ((size_t)imageW)*imageH;
int grid_dim = 64 * 1536;
int bs = 512;
int gs = grid_dim/bs;
mt *img;
curandState *s;
cudaMalloc(&s, sizeof(curandState)*grid_dim);
cudaMallocManaged(&img, sizeof(mt)*is);
setup_kernel<<<gs, bs>>>(s, grid_dim);
image_gen<<<gs, bs>>>(img, s, is);
cudaDeviceSynchronize();
}
$ nvcc -Xptxas=-v -arch=sm_86 -o t2115 t2115.cu
ptxas info : 218048 bytes gmem, 72 bytes cmem[3]
ptxas info : Compiling entry function '_Z9image_genIfEvPT_P17curandStateXORWOWm' for 'sm_86'
ptxas info : Function properties for _Z9image_genIfEvPT_P17curandStateXORWOWm
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 18 registers, 376 bytes cmem[0]
ptxas info : Compiling entry function '_Z12setup_kernelP17curandStateXORWOWmyy' for 'sm_86'
ptxas info : Function properties for _Z12setup_kernelP17curandStateXORWOWmyy
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 39 registers, 384 bytes cmem[0]
$
We've compiled with the -Xptxas=-v switch which causes the compiler to indicate register utilization. We see that the image_gen function uses 18 registers per thread. This is well below the limit for what will allow full occupancy, so we should be fine with the indicated launch sizes and should expect full occupancy. The sm_86 SM has support for up to 65536 registers, so when considered across 1536 threads, this implies a limit of about 65536/1536 = 42 register per thread. If we had a number larger than that, we would reduce the grid_dim variable accordingly.
This can be largely "automated" with the occupancy API and launch bounds functionality, but these basic ideas should be understood first.

How can I obtain consistently high throughput in this loop?

In the course of optimising an inner loop I have come across strange performance behaviour that I'm having trouble understanding and correcting.
A pared-down version of the code follows; roughly speaking there is one gigantic array which is divided up into 16 word chunks, and I simply add up the number of leading zeroes of the words in each chunk. (In reality I'm using the popcnt code from Dan Luu, but here I picked a simpler instruction with similar performance characteristics for "brevity". Dan Luu's code is based on an answer to this SO question which, while it has tantalisingly similar strange results, does not seem to answer my questions here.)
// -*- compile-command: "gcc -O3 -march=native -Wall -Wextra -std=c99 -o clz-timing clz-timing.c" -*-
#include <stdint.h>
#include <time.h>
#include <stdlib.h>
#include <stdio.h>
#define ARRAY_LEN 16
// Return the sum of the leading zeros of each element of the ARRAY_LEN
// words starting at u.
static inline uint64_t clz_array(const uint64_t u[ARRAY_LEN]) {
uint64_t c0 = 0;
for (int i = 0; i < ARRAY_LEN; ++i) {
uint64_t t0;
__asm__ ("lzcnt %1, %0" : "=r"(t0) : "r"(u[i]));
c0 += t0;
}
return c0;
}
// For each of the narrays blocks of ARRAY_LEN words starting at
// arrays, put the result of clz_array(arrays + i*ARRAY_LEN) in
// counts[i]. Return the time taken in milliseconds.
double clz_arrays(uint32_t *counts, const uint64_t *arrays, int narrays) {
clock_t t = clock();
for (int i = 0; i < narrays; ++i, arrays += ARRAY_LEN)
counts[i] = clz_array(arrays);
t = clock() - t;
// Convert clock time to milliseconds
return t * 1e3 / (double)CLOCKS_PER_SEC;
}
void print_stats(double t_ms, long n, double total_MiB) {
double t_s = t_ms / 1e3, thru = (n/1e6) / t_s, band = total_MiB / t_s;
printf("Time: %7.2f ms, %7.2f x 1e6 clz/s, %8.1f MiB/s\n", t_ms, thru, band);
}
int main(int argc, char *argv[]) {
long n = 1 << 20;
if (argc > 1)
n = atol(argv[1]);
long total_bytes = n * ARRAY_LEN * sizeof(uint64_t);
uint64_t *buf = malloc(total_bytes);
uint32_t *counts = malloc(sizeof(uint32_t) * n);
double t_ms, total_MiB = total_bytes / (double)(1 << 20);
printf("Total size: %.1f MiB\n", total_MiB);
// Warm up
t_ms = clz_arrays(counts, buf, n);
//print_stats(t_ms, n, total_MiB); // (1)
// Run it
t_ms = clz_arrays(counts, buf, n); // (2)
print_stats(t_ms, n, total_MiB);
// Write something into buf
for (long i = 0; i < n*ARRAY_LEN; ++i)
buf[i] = i;
// And again...
(void) clz_arrays(counts, buf, n); // (3)
t_ms = clz_arrays(counts, buf, n); // (4)
print_stats(t_ms, n, total_MiB);
free(counts);
free(buf);
return 0;
}
The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory.
Here is the result of a typical run (compiler command is at the beginning of the source):
$ ./clz-timing 10000000
Total size: 1220.7 MiB
Time: 47.78 ms, 209.30 x 1e6 clz/s, 25548.9 MiB/s
Time: 77.41 ms, 129.19 x 1e6 clz/s, 15769.7 MiB/s
The CPU on which this was run is an "Intel(R) Core(TM) i7-6700HQ CPU # 2.60GHz" which has a turbo boost of 3.5GHz. The latency of the lzcnt instruction is 3 cycles but it has a throughput of 1 operation per second (see Agner Fog's Skylake instruction tables) so, with 8 byte words (using uint64_t) at 3.5GHz the peak bandwidth should be 3.5e9 cycles/sec x 8 bytes/cycle = 28.0 GiB/s, which is pretty close to what we see in the first number. Even at 2.6GHz we should get close to 20.8 GiB/s.
The main question I have is,
Why is the bandwidth of call (4) always so far below the optimal value(s) obtained in call (2) and what can I do to guarantee optimal performance under a majority of circumstances?
Some points regarding what I've found so far:
According to extensive analysis with perf, the problem seems to be caused by LLC cache load misses in the slow cases that don't appear in the fast case. My guess was that maybe the fact that the memory on which we're performing the calculation hadn't been initialised meant that the compiler didn't feel obliged to load any particular values into memory, but the output of objdump -d clearly shows that the same code is being run each time. It's as though the hardware prefetcher was active the first time but not the second time, but in every case this array should be the easiest thing in the world to prefetch reliably.
The "warm up" calls at (1) and (3) are consistently as slow as the second printed bandwidth corresponding to call (4).
I've obtained much the same results on my desktop machine ("Intel(R) Xeon(R) CPU E5-2620 v3 # 2.40GHz").
Results were essentially the same between GCC 4.9, 7.0 and Clang 4.0. All tests run on Debian testing, kernel 4.14.
All of these results and observations can also be obtained with clz_array replaced by builtin_popcnt_unrolled_errata_manual from the Dan Luu post, mutatis mutandis.
Any help would be most appreciated!

The slightly peculiar thing about the code above is that the first and second times I call the clz_arrays function it is on uninitialised memory
Uninitialized memory that malloc gets from the kernel with mmap is all initially copy-on-write mapped to the same physical page of all zeros.
So you get TLB misses but not cache misses. If it used a 4k page, then you get L1D hits. If it used a 2M hugepage, then you only get L3 (LLC) hits, but that's still significantly better bandwidth than DRAM.
Single-core memory bandwidth is often limited by max_concurrency / latency, and often can't saturate DRAM bandwidth. (See Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?, and the "latency-bound platforms" section of this answer for more about this in; it's much worse on many-core Xeon chips than on quad-core desktop/laptops.)
Your first warm-up run will suffer from page faults as well as TLB misses. Also, on a kernel with Meltdown mitigation enabled, any system call will flush the whole TLB. If you were adding extra print_stats to show the warm-up run performance, that would have made the run after slower.
You might want to loop multiple times over the same memory inside a timing run, so you don't need so many page-walks from touching so much virtual address space.
clock() is not a great way to measure performance. It records time in seconds, not CPU core clock cycles. If you run your benchmark long enough, you don't need really high precision, but you would need to control for CPU frequency to get accurate results. Calling clock() probably results in a system call, which (with Meltdown and Spectre mitigation enabled) flushes TLBs and branch-prediction. It may be slow enough for Skylake to clock back down from max turbo. You don't do any warm-up work after that, and of course you can't because anything after the first clock() is inside the timed interval.
Something based on wall-clock time which can use RDTSC as a timesource instead of switching to kernel mode (like gettimeofday()) would be lower overhead, although then you'd be measuring wall-clock time instead of CPU time. That's basically equivalent if the machine is otherwise idle so your process doesn't get descheduled.
For something that wasn't memory-bound, CPU performance counters to count core clock cycles can be very accurate, and without the inconvenience of having to control for CPU frequency. (Although these days you don't have to reboot to temporarily disable turbo and set the governor to performance.)
But with memory-bound stuff, changing core frequency changes the ratio of core to memory, making memory faster or slower relative to the CPU.

CUDA problems with atomicadd in nested loop on medium sized grids (>760 by 760)

I am having an unknown error within my CUDA program and it seems to be related to the atomicadd function. I am coding on windows on Visual Studio 2015. My calling function is specified as the following
int regionWidth=32;
int regionHeight=32;
dim3 gridSize(765,765);
dim3 blockSize(regionWidth, regionHeight);
cudaMalloc((void **)&dev_count, sizeof(int));
count = 0;
cudaMemcpy(dev_count, &count, sizeof(int), cudaMemcpyHostToDevice);
crashFN << < gridSize, blockSize >> > (regionWidth, regionHeight, dev_count);
cudaMemcpy(&count, dev_count, sizeof(int), cudaMemcpyDeviceToHost);
printf("total number of threads that executed was: %d vs. %d called -> %s\n", count, gridSize.x*gridSize.y*blockSize.x*blockSize.y, (count==gridSize.x*gridSize.y*blockSize.x*blockSize.y)?"ok":"error");
then my global kernel function is
__global__
void crashFN(int regionWidth, int regionHeight, int* ct)
{
__shared__ int shared_sum;
shared_sum = 0;
sumGlobal(regionWidth, regionHeight, &shared_sum);
atomicAdd(ct, 1);
}
with sumGlobal defined as
__device__
void sumGlobal(int regionWidth, int regionHeight, int* global_sum)
{
// sum in nested loop
for (int y = 0; y < regionHeight; y++)
for (int x = 0; x < regionWidth; x++)
atomicAdd(global_sum, 1);
}
The build output from the program is the following
1> H:\GPU\GPU_PROJECT_HZDR\targeterConsole>"C:\Program Files\NVIDIA GPU
Computing Toolkit\CUDA\v8.0\bin\nvcc.exe" -
gencode=arch=compute_50,code=\"sm_50,compute_50\" --use-local-env --cl-
version 2015 -ccbin "C:\Program Files (x86)\Microsoft Visual Studio
14.0\VC\bin\x86_amd64" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\include" -I"C:\Program Files\NVIDIA GPU Computing
Toolkit\CUDA\v8.0\include" --keep-dir x64\Release -maxrregcount=0 --
machine 64 --compile -cudart static -DWIN32 -DWIN64 -DNDEBUG -D_CONSOLE
-D_MBCS -Xcompiler "/EHsc /W3 /nologo /O2 /FS /Zi /MD " -o
x64\Release\targetDetectionGPU.cu.obj
"H:\GPU\GPU_PROJECT_HZDR\targetDetectionGPU.cu"
it's a standard Nvidia CUDA console project, only changed the arch to sm_50,compute_50
my program's output is the following (with debug information)
sharedMemBytes=36864
regionWidth=32 regionHeight=32 coDIMX=16 coDIMY=16 coDIMZ=32
gridSize.x=765 gridSize.y=765 blockSize.x=32 blockSize.y=32
There is 1 device supporting CUDA
Device 0: "GeForce GTX 1050 Ti"
CUDA Driver Version: 9.0
CUDA Runtime Version: 8.0
CUDA Capability Major revision number: 6
CUDA Capability Minor revision number: 1
Total amount of global memory: 0 bytes
Number of multiprocessors: 6
Number of cores: 288
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.39 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host
threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime
Version = 8.0, NumDevs = 1, Device = GeForce GTX 1050 Ti
Requested resources: gridSize.x=765 gridSize.y=765 blockSize.x=32
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 0 vs. 599270400 called -> error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 558 CUDA Runtime API
error (30): unknown error
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 573 CUDA Runtime API
error (30): unknown error
finshed cuda algorithm
with smaller grid sizes, it seems to work better
so when I instead choose 764, 764 grid size I get
Requested resources: gridSize.x=764 gridSize.y=764 blockSize.x=32
blockSize.y=32 sharedMemory=36 MB
total number of threads that executed was: 597704704 vs. 597704704 called ->
ok
file=H:/GPU/GPU_PROJECT_HZDR/targetDetectionGPU.cu line 574 CUDA Runtime API
error (30): unknown error
with 750 x 750 the error was gone, with 760x760 the error was back.
The device specifications allows much larger grid sizes than 765, or am I missing something here? Not sure why a simple atomicAdd in a nested loop should cause these errors, is it a bug?
Ok, simplified the kernel call now, removed the function call and combined the loops into 1 but still the error on larger grid sizes, if I comment out the loop it runs ok.
__global__
void crashFN(int regionWidth, int regionHeight, int* ct)
{
__shared__ int shared_sum;
shared_sum = 0;
__syncthreads();
for (int y = 0; y < regionHeight*regionWidth; y++)
atomicAdd(&shared_sum, 1);
__syncthreads();
atomicAdd(ct, 1);
}
if I shorten the loop to
for (int y = 0; y < regionHeight; y++)
atomicAdd(&shared_sum, 1);
then it works ok, seems like a timeout issue, strange because I set the WDDM TDR timeout to 10 seconds with the NSight monitor.

If you get a "error (30): unknown error" suspect a TDR timeout, especially on Windows. Basically my test program was taking to long in the loops and causing a timeout. This is particularly the case when you are debugging using printf statements!
The solution is to increase the timeout value by changing the TDR setting to more like 30 seconds or so, increasing this value is not a problem when you are not using the GPU card for the main display. When the TDR value is increased you can better see that it is your program is taking too long and not something else. Try to improve the code by removing loops, especially those containing atomic operations, or restructure it to use techniques like reduction.
http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf

why tanh has different results in OpenCL and C++ function

here is my OpenCL code.
#include <iostream>
#include <cmath>
#include <CL/cl.hpp>
int main(){
std::vector<cl::Platform> all_platforms;
cl::Platform::get(&all_platforms);
cl::Platform default_platform=all_platforms[0];
std::vector<cl::Device> all_devices;
default_platform.getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
cl::Device default_device=all_devices[0];
std::cout<< "Using device: "<<default_device.getInfo<CL_DEVICE_NAME>()<<"\n";
cl_context_properties properties[] = { CL_CONTEXT_PLATFORM, (cl_context_properties)(default_platform)(), 0};
cl::Context context = cl::Context(CL_DEVICE_TYPE_ALL, properties);
cl::Program::Sources sources;
std::string kernel_code=
" void __kernel simple_tanh(__global const float *A, __global float *B){ "
" B[get_global_id(0)]=tanh(A[get_global_id(0)]); "
" } ";
sources.push_back({kernel_code.c_str(),kernel_code.length()});
cl::Program program(context,sources);
if(program.build({default_device})!=CL_SUCCESS){
std::cout<<" Error building: "<<program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(default_device)<<"\n";
exit(1);
}
cl::Buffer buffer_A(context,CL_MEM_READ_WRITE,sizeof(float));
cl::Buffer buffer_B(context,CL_MEM_READ_WRITE,sizeof(float));
float A[1]; A[0] = 0.0595172755420207977294921875000000000000f;
cl::CommandQueue queue(context,default_device);
queue.enqueueWriteBuffer(buffer_A,CL_TRUE,0,sizeof(float),A);
queue.finish();
cl::Kernel kernel=cl::Kernel(program,"simple_tanh");
kernel.setArg(0,buffer_A);
kernel.setArg(1,buffer_B);
queue.enqueueNDRangeKernel(kernel,cl::NullRange,cl::NDRange(1),cl::NullRange);
queue.finish();
float B[1];
queue.enqueueReadBuffer(buffer_B,CL_TRUE,0,sizeof(float),B);
printf("result: %.40f %.40f\n", tanh(A[0]), B[0]);
return 0;
}
after I compile with this cmd: g++ -std=c++0x hello.cc -lOpenCL -o hello, and run it. I got different results of tanh function.
Using device: Tahiti
result: 0.0594470988394579374913817559900053311139 0.0594470985233783721923828125000000000000
the first is the cpu result, and the second is the OpenCL function. which one should I trust?

When a kernel is unable to be vectorized by compiler(opencl), generated instructions could be scalar types. Then, x87 FPU computes 80 bit. SSE has precision more comparable to GPU, you need float4 or float8 in your kernel such that compiler can produce SSE/AVX which has closer precision to GPU.
Generally Intel's opencl compiler vectorizes better(for some old CPUs at least). Which implementation are you using? There can be differences even between GPUs but they all obey the rule of not crossing ULP limit. If you need more precision with GPU(and SSE/AVX), why not write you own series expansion function then? But it would make learning very slow but faster than a single FPU at least.
What is your CPU? What opencl platform are you using? Did you check generated codes for kernel with some profiler software or some kernel analyzer?
Above all, you shouldn't do this:
cl::NDRange(1)
unless it's learning purpose. This will have %99 kernel launch overhead, %1 data copy overhead and close to zero compute latency. Maybe thats why it's using 80-bit FPU instead of SSE(on CPU). Try computing for multiple-of-8 ndrange values or use float8 types in kernel to let compiler use vectorized instructions.
When global ndrange value is millions, it will have significant effect on learning time, not leraning iterations needed. If CPU can finish learning in 1 day with 1M iterations, maybe GPU can finish it in 1-hour even if it needs 10M iterations. Transcandental functions have high compute to data ratio so speed up ratio versus CPU is higher if you use more of them.
If you derive your own series expansion function to achieve more precision, it would still be much faster than single CPU core in this embarrassingly parallel kernel code.
If the neural network has only a few neurons, then maybe you can do N networks training at the same time and pick the best learner(if learning has any randomization)? So it picks even better results than CPU?

cudaOccupancyMaxActiveBlocksPerMultiprocessor is undefined

I am trying to learn cuda and use it in an efficient way. And I have found a code from nvidia's website, which tells that we can learn what should be the block size that we should use for the device's most efficient usage. The code is as follows :
#include <iostream>
// Device code
__global__ void MyKernel(int *d, int *a, int *b)
{
int idx = threadIdx.x + blockIdx.x * blockDim.x;
d[idx] = a[idx] * b[idx];
}
// Host code
int main()
{
int numBlocks; // Occupancy in terms of active blocks
int blockSize = 32;
// These variables are used to convert occupancy to warps
int device;
cudaDeviceProp prop;
int activeWarps;
int maxWarps;
cudaGetDevice(&device);
cudaGetDeviceProperties(&prop, device);
cudaOccupancyMaxActiveBlocksPerMultiprocessor(
&numBlocks,
MyKernel,
blockSize,
0);
activeWarps = numBlocks * blockSize / prop.warpSize;
maxWarps = prop.maxThreadsPerMultiProcessor / prop.warpSize;
std::cout << "Occupancy: " << (double)activeWarps / maxWarps * 100 << "%" << std::endl;
return 0;
}
However, when I compiled it, there is the following error :
Compile line :
nvcc ben_deneme2.cu -arch=sm_35 -rdc=true -lcublas -lcublas_device -lcudadevrt -o my
Error :
ben_deneme2.cu(25): error: identifier "cudaOccupancyMaxActiveBlocksPerMultiprocessor" is undefined
1 error detected in the compilation of "/tmp/tmpxft_0000623d_00000000-8_ben_deneme2.cpp1.ii".
Should I include a library for this, though I could not find a library name for this on the internet? Or am I doing something else wrong?
Thanks in advance

The cudaOccupancyMaxActiveBlocksPerMultiprocessorfunction is included in CUDA 6.5. You have not access to that function if you have a previous version of CUDA installed, for example, it will not work for CUDA 5.5.
If you want to use that function you must update your CUDA version at least to to 6.5.
People using older versions usually use the Cuda Occupancy Calculator.
One common heuristic used to choose a good block size is to aim for high occupancy, which is the ratio of the number of active warps per multiprocessor to the maximum number of warps that can be active on the multiprocessor at once. -- CUDA Pro Tip: Occupancy API Simplifies Launch Configuration

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js