The overhead of a OpenCL or CUDA call?

The overhead of a OpenCL or CUDA call? - c++

I'm writing a function that does a lot of BLAS gemv operations.
I would like to be able to do this on the GPU, and I've tried with cuBlas.
My problem is that my matrix's and vectors are rather small, 100x100 matrix and 100 vector. CuBlas takes ages compared to a CPU and I see why, a mixture of fast cache on the cpu and a large overhead on doing the calls to the GPU.
Therefore I'm trying to figure out a smart way of measuring the time it takes to communicate the call to the GPU.
That is the time it takes CUDA to setup the call and send it to the graphics processor -- not counting the time it actually takes to do the matrix-vector multiplication.
How would I go about doing this?

Update: The following results are for a hand-written FFT GPU algorithm on 2005 hardware (nVidia 7800 GTX), but shows the principle of CPU-GPU tranfer bottlenecks
The overhead is not the call per-se but compilation of the GPU program and transferring the data between the GPU and the host. The CPU is highly optimized for functions that can be performed entirely in cache and the latency of DDR3 memory is far lower than the PCI-Express bus which services the GPU. I have experienced this myself when writing GPU FFT routines (prior to CUDA). Please see this related question.
N FFTw (ms) GPUFFT (ms) GPUFFT MFLOPS GPUFFT Speedup
8 0 0.06 3.352705 0.006881
16 0.001 0.065 7.882117 0.010217
32 0.001 0.075 17.10887 0.014695
64 0.002 0.085 36.080118 0.026744
128 0.004 0.093 76.724324 0.040122
256 0.007 0.107 153.739856 0.066754
512 0.015 0.115 320.200892 0.134614
1024 0.034 0.125 657.735381 0.270512
2048 0.076 0.156 1155.151507 0.484331
4096 0.173 0.215 1834.212989 0.804558
8192 0.483 0.32 2664.042421 1.510011
16384 1.363 0.605 3035.4551 2.255411
32768 3.168 1.14 3450.455808 2.780041
65536 8.694 2.464 3404.628083 3.528726
131072 15.363 5.027 3545.850483 3.05604
262144 33.223 12.513 3016.885246 2.655183
524288 72.918 25.879 3079.443664 2.817667
1048576 173.043 76.537 2192.056517 2.260904
2097152 331.553 157.427 2238.01491 2.106081
4194304 801.544 430.518 1715.573229 1.861814
The table above shows timings of a GPU FFT implementation vs CPU implementation based on kernel size. For smaller sizes, the transfer of data to/from the GPU dominates. Smaller kernels can be performed on the CPU, some implementations/sizes entirely in the cache. This makes the CPU the best choice for small operations.
If on the other hand you need to perform large batches of work on data with minimal moves to/from the GPU then the GPU will beat the CPU hands down.
In so far as measuring the effect in your example, I would suggest performing an experiment like the above. Try to work out the FLOPS computed for each size of matrix and run the test on the CPU and GPU for varying sizes of matrix. Output to a CSV file the size, time and FLOPS for GPU vs CPU. For any profiling ensure you run several hundred iterations of your code and time the whole thing, then divide the total time by iterations to get the loop time. Try different shaped matrices also if your algorithm allows (e.g. 10x100 rather than 100x10).
Using this data you can get a feel for what the overheads are. To find out exactly repeat the same experiment but replace the inner shader code executed on the GPU with no-operation (simply copy from input to output).
Hope this helps,

You can get the time in nanoseconds from the device when an event was queued, submitted, started, and finished by using clGetEventProfilingInfo on your buffer transfer event.
more info, and how to set it up here: http://www.khronos.org/registry/cl/sdk/1.0/docs/man/xhtml/clGetEventProfilingInfo.html
I think that for 100x100 matrices, you may be better off sticking to cpu for the crunching. Unless you have many to multiply at the same time, the benefit of the gpu will be hardly noticeable due to the (small) transfer overhead and usually much lower clock speeds. Make sure you tweak your kernel to use as much of the local data as possible - on my hardware, there is 32KB per work group, and that should be plenty to hold two 100x100 matrices. The built-in dot product functions should also be very handy too.
There was an awesome talk about this at ADFS last year (see sessionId: 2908)
http://developer.amd.com/afds/pages/OLD/sessions.aspx
They talk in detail about optimizing the kernel, and hard-coding the optimal sizes.

Are your matrices already on the GPU?
If not, CUBLAS might transfer them for you (known as thunking), which is an additional overhead.
Also, GPUs do not really shine for such small computations, i.e. it will probably be slower than CPUs since you have to transfer your result back.
If you can, use bigger matrices.
Otherwise you might want to use streams (cudaStream_t) to start multiple parallel computations on the GPU.
If you want to measure the execution time of a kernel in CUDA, you need to enclose that (or anything else that computes on the GPU) in events, like this when using the CUDA runtime API:
cudaEvent_t start, stop;
cudaEventRecord(&start);
struct timeval cpuStart, cpuEnd;
gettimeofday(&cpuStart, 0); // get start time on CPU
// Do something with CUDA on the GPU, e.g. call kernels, transfer memory, ...
gettimeofday(&cpuEnd, 0); // get end time on CPU
double seconds = cpuEnd.tv_sec - cpuStart.tv_sec;
double microseconds = cpuEnd.tv_usec - cpuStart.tv_usec;
double cpuDuration = (seconds * 1.0e6 + microseconds) / 1.0e3; // in milliseconds
cudaEventRecord(&stop);
// Wait until the stop event occurred
cudaError_t eventResult;
do
{
eventResult = cudaEventQuery(stop);
}
while (eventResult == cudaErrorNotReady);
// Assert there was no error; check the CUDA Toolkit Reference for further info
assert(cudaSuccess == eventResult); // requires #include <assert.h> or <cassert>
// Retrieve the time
float gpuDuration = 0.0; // in milliseconds
cudaEventElapsedTime(&gpuDuration, start, stop);
// Release the event objects
cudaEventDestroy(stop);
cudaEventDestroy(start);
You might want to check the error code of every call to CUDA (at least with an assert), as you may get errors from previous calls, resulting in hours of debugging...
(Note: I mostly use the CUDA driver API, so this might not work out of the box. Sorry for that.)
EDIT: Just saw that you want to measure the call itself, not the duration of the kernel.
You can do that by simply measuring the time on the CPU for the call - see the updated code above.
This works only on Linux because gettimeofday is not available for Windows (AFAIK).

To find the call overhead, call a CUDA kernel that does as little as possible.
for (int i=0; i<NLoops; i++) {
gettimeofday(&cpuStart, 0); // get start time on CPU
// Call minimal CUDA kernel
gettimeofday(&cpuEnd, 0); // get end time on CPU
// save elapsed time
}
Follow the code of Alex P. above.
The less processing you do in the kernel, the more the time difference will be only the call overhead.
Do a little experimenting to find a good value for NLoops (maybe 1,000,000). Be sure that the elapsed time is longer than the interval of your timer, or you'll end up with all zeros. If that happens, write some kernel code that executes in a fixed time interval that you can predict: (n loops of x cycles each).
It's hard to remove all the non-CUDA computations that might occur between cpuStart and cpuEnd (like interrupt processing), but making several runs and averaging can give good results.

Related

How to test the problem size scaling performance of code

I'm running a simple kernel which adds two streams of double-precision complex-values. I've parallelized it using OpenMP with custom scheduling: the slice_indices container contains different indices for different threads.
for (const auto& index : slice_indices)
{
auto* tens1_data_stream = tens1.get_slice_data(index);
const auto* tens2_data_stream = tens2.get_slice_data(index);
#pragma omp simd safelen(8)
for (auto d_index = std::size_t{}; d_index < tens1.get_slice_size(); ++d_index)
{
tens1_data_stream[d_index].real += tens2_data_stream[d_index].real;
tens1_data_stream[d_index].imag += tens2_data_stream[d_index].imag;
}
}
The target computer has a Intel(R) Xeon(R) Platinum 8168 CPU # 2.70GHz with 24 cores, L1 cache 32kB, L2 cache 1MB and L3 cache 33MB. The total memory bandwidth is 115GB/s.
The following is how my code scales with problem size S = N x N x N.
Can anybody tell me with the information I've provided if:
it's scaling well, and/or
how I could go about finding out if it's utilizing all the resources which are available to it?
Thanks in advance.
EDIT:
Now I've plotted the performance in GFLOP/s with 24 cores and 48 cores (two NUMA nodes, the same processor). It appears so:
And now the strong and weak scaling plots:
Note: I've measured the BW and it turns out to be 105GB/S.
Question: The meaning of the weird peak at 6 threads/problem size 90x90x90x16 B in the weak scaling plot is not obvious to me. Can anybody clear this?

Your graph has roughly the right shape: tiny arrays should fit in the L1 cache, and therefore get very high performance. Arrays of a megabyte or so fit in L2 and get lower performance, beyond that you should stream from memory and get low performance. So the relation between problem size and runtime should indeed get steeper with increasing size. However, the resulting graph (btw, ops/sec is more common than mere runtime) should have a stepwise structure as you hit successive cache boundaries. I'd say you don't have enough data points to demonstrate this.
Also, typically you would repeat each "experiment" several times to 1. even out statistical hiccups and 2. make sure that data is indeed in cache.
Since you tagged this "openmp" you should also explore taking a given array size, and varying the core count. You should then get a more or less linear increase in performance, until the processor does not have enough bandwidth to sustain all the cores.
A commenter brought up the concepts of strong/weak scaling. Strong scaling means: given a certain problem size, use more and more cores. That should give you increasing performance, but with diminishing returns as overhead starts to dominate. Weak scaling means: keep the problem size per process/thread/whatever constant, and increase the number of processing elements. That should give you almost linear increasing performance, until -- as I indicated -- you run out of bandwidth. What you seem to do is actually neither of these: you're doing "optimistic scaling": increase the problem size, with everything else constant. That should give you better and better performance, except for cache effects as I indicated.
So if you want to say "this code scales" you have to decide under what scenario. For what it's worth, your figure of 200Gb/sec is plausible. It depends on details of your architecture, but for a fairly recent Intel node that sounds reasonable.

Rough estimate of programme execution time

How to get rough idea of time taken by a programme on normal pc So that based on input I can know if my algorithm is going to get TLE or not with given time limit (2 sec etc..)
Suppose I have to traverse a array of size 10^6,10^7,10^7 etc..
I think it will take 1 sec for traversal of 10^6 array..
if anyone can explain it clearly.

Check the Instructions per cycle for the current processor then I would look at the assembly code and calculate the number of cycles required.
Once you have the number of cycles, multiply it with cycle time.

Several factors are needed to be considered to reach to any conclusion in this case.
Every machine/assembly instruction takes one or more clock cycles to complete.
After fetching the assembly code for your program, you can calculate the total time using following formula:
Execution time = Total number of cycles * Clock cycle time = Instruction count * cycles per instruction * clock cycle time.
In general, you cannot directly estimate the the total time to process an array of size 10^6 to be 1 second.
The time to execute a program may be dependent on the following factors:
Processor: To find the closest estimate, you can read the processor manual to get the cycles per instruction for an instruction (as different instruction takes different number of cycles to retire) and use the above formula.
The data/operand: The size of the operand (in your case, the data in the array), has effect on latency.
Caching: The time required to access the data on the same cache line is same. Therefore, the total time is also dependent on the number of cache lines the CPU needs to access in total.
Compiler Optimizations: The modern compilers are very smart in optimising the code where read/write operations are not involved. In your case, you are just traversing the array and not performing any operations. Therefore, due to the optimisation, it may take much less than 1 second to traverse the array.

Do I need to prevent preemption while measuring performance

I want to measure the performance of block of code with the use of QueryPerformanceCounter in Windows. What I would like to know is whether between different runs I can do something to get equal measurements for the same data (I want to measure the performance of different sorting algorithms on different sizes of arrays containing pod or some custom objects). I know that the current process can be interrupted from execution because of interrupts or I/O operations. I'm not doing any I/O so it's only interrupts that may affect my measurement, I'm assuming that the kernel also has some time frame that allows my process to run, so I think that's gonna schedule away my proc as well.
How do people make accurate measurements through measuring the time of execution of a specific piece of code?

Time measurements are tricky because you need to find out why your algo is slower. That depends on the input data (e.g. presorted data see Why is it faster to process a sorted array than an unsorted array?) or the data set size (fits into L1,L2,L3 cache see http://igoro.com/archive/gallery-of-processor-cache-effects/).
That can hugely influence your measured times.
Also the order of measurements can play a critical role. If you execute the sort alogs in a loop and each of them allocates some memory the first test will most likely loose. Not because the algorithm is inferior but the first time you access newly allocated memory it will be soft faulted into your process working set. After the memory is freed the heap allocator will return pooled memory which will have an entirely different access performance. That becomes very noticeable if you sort larger (many MB) arrays.
Below are the touch times of a 2 GB array from different threads for the first and second time printed. Each page (4KB) of memory is only touched once.
Threads Size_MB Time_ms us/Page MB/s Scenario
1 2000 355 0.693 5634 Touch 1
1 2000 11 0.021 N.a. Touch 2
2 2000 276 0.539 7246 Touch 1
2 2000 12 0.023 N.a. Touch 2
3 2000 274 0.535 7299 Touch 1
3 2000 13 0.025 N.a. Touch 2
4 2000 288 0.563 6944 Touch 1
4 2000 11 0.021 N.a. Touch 2
// Touch is from the compiler point of view a nop operation with no observable side effect
// This is true from a pure data content point of view but performance wise there is a huge
// difference. Turn optimizations off to prevent the compiler to outsmart us.
#pragma optimize( "", off )
void Program::Touch(void *p, size_t N)
{
char *pB = (char *)p;
char tmp;
for (size_t i = 0; i < N; i += 4096)
{
tmp = pB[i];
}
}
#pragma optimize("", on)
To truly judge the performance of an algorithm it is not sufficient to perform time measurements but you need a profiler (e.g. the Windows Performance Toolkit free, VTune from Intel (not free)) to ensure that you have measured the right thing and not something entirely different.

Just went to a conference with Andrei Alexandrescu on Fastware and he was adressing this exact issue, how to measure speed. Apparently getting the mean is a bad idea BUT, measuring many times is a great idea. So with that in mind you measure a million times and remember the smallest measurement because in fact that's where you would get the least amount of noise.
Means are awful because you're actually adding more of the noise's weight to the actual speed you're measuring (these are not the only things that you should consider when evaluating code speed but this is a good start, there's even more horrid stuff regarding where the code will execute, and the overhead brought by the code starting execution on one core and finishing on another, but that's a different story and I don't think it applies to my sort).
A good joke was: if you put Bill Gates into a bus, on average everybody in that bus is a millionaire :))
Cheers and thanks to all who provided input.

Apparent CUDA magic

I'm using CUDA (in reality I'm using pyCUDA if the difference matters) and performing some computation over arrays. I'm launching a kernel with a grid of 320*600 threads. Inside the kernel I'm declaring two linear arrays of 20000 components using:
float test[20000]
float test2[20000]
With these arrays I perform simple calculations, like for example filling them with constant values. The point is that the kernel executes normally and perform correctly the computations (you can see this filling an array with a random component of test and sending that array to host from device).
The problem is that my NVIDIA card has only 2GB of memory and the total amount of memory to allocate the arrays test and test2 is 320*600*20000*4 bytes that is much more than 2GB.
Where is this memory coming from? and how can CUDA perform the computation in every thread?
Thank you for your time

The actual sizing of the local/stack memory requirements is not as you suppose (for the entire grid, all at once) but is actually based on a formula described by #njuffa here.
Basically, the local/stack memory require is sized based on the maximum instantaneous capacity of the device you are running on, rather than the size of the grid.
Based on the information provided by njuffa, the available stack size limit (per thread) is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
available GPU memory/(#of SMs)/(max threads per SM)
For your first case:
float test[20000];
float test2[20000];
That total is 160KB (per thread) so we are under the maximum limit of 512KB per thread. What about the 2nd limit?
GTX 650m has 2 cc 3.0 (kepler) SMs (each Kepler SM has 192 cores). Therefore, the second limit above gives, if all the GPU memory were available:
2GB/2/2048 = 512KB
(kepler has 2048 max threads per multiprocessor)
so it is the same limit in this case. But this assumes all the GPU memory is available.
Since you're suggesting in the comments that this configuration fails:
float test[40000];
float test2[40000];
i.e. 320KB, I would conclude that your actual available GPU memory is at the point of this bulk allocation attempt is somewhere above (160/512)*100% i.e. above 31% but below (320/512)*100% i.e. below 62.5% of 2GB, so I would conclude that your available GPU memory at the time of this bulk allocation request for the stack frame would be something less than 1.25GB.
You could try to see if this is the case by calling cudaGetMemInfo right before the kernel launch in question (although I don't know how to do this in pycuda). Even though your GPU starts out with 2GB, if you are running the display from it, you are likely starting with a number closer to 1.5GB. And dynamic (e.g. cudaMalloc) and or static (e.g. __device__) allocations that occur prior to this bulk allocation request at kernel launch, will all impact available memory.
This is all to explain some of the specifics. The general answer to your question is that the "magic" arises due to the fact that the GPU does not necessarily allocate the stack frame and local memory for all threads in the grid, all at once. It need only allocate what is required for the maximum instantaneous capacity of the device (i.e. SMs * max threads per SM), which may be a number that is significantly less than what would be required for the whole grid.

Higher core load Intel TBB

I am using Intel TBB parallel_for to speed up a for loop doing some calculations:
tbb::parallel_for(tbb::blocked_range<int>(0,ListSize,1000),Calc);
Calc is an object of the class doCalc
class DoCalc
{
vector<string>FileList;
public:
void operator()(const tbb::blocked_range<int>& range) const{
for(int i=range.begin(); i!=range.end();++i){
//Do some calculations
}
}
DoCalc(vector<string> ilist):FileList(ilist){}
};
It takes approx. 60 seconds when I use the standard serial form of the for loop and approx. 20 seconds when I use the parallel_for from TBB to get the job done. When using standard for, the load of each core of my i5 CPU is at approx. 15% (according windows task manager) and very inhomogeneous and at approx. 50% and very homogeneous when using parallel_for.
I wonder if it's possible to get an even higher core load when using parallel_for. Are there any other parameters except grain_size? How can I boost the speed of parallel_for without changing the operations within the for loop (here as //Do some calculations in the code sample above).

The grainsize parameter is optional. When grainsizee is not specified, a partitioner should be supplied to the algorithm template. A partitioner is an object that guides the chunking of a range. The auto_partitioner provides an alternative that heuristically chooses the grain size so that you do not have to specify one. The heuristic attempt to limit overhead while still providing ample opportunities for load balancing.
Go to the tbb website for more information. www.threadingbuildingblocks.org

As #Eugene Roader already suggested, you might want to use the auto_partitioner (which is default from TBB version 2.2) to automatic chuncking of the range:
tbb::parallel_for(tbb::blocked_range<int>(0,ListSize),Calc,tbb:auto_partitioner());
I assume that your i5-CPU has 4 cores, so you get a speedup of 3 (60s => 20s) which is already "quite nice" as there might be certain overheads in the parallelization. One problem could be the maximum limit of memory bandwidth of you CPU which is saturated with 3 threads - or you might have a lot of allocation/deallocations within your could which are/must be synchronized between the threads with the standard memory manager. One trick to tackle this problem without much code changes in the inner loop might be using a thread local allocator, e.g. for FileList:
vector<string,tbb:scalable_allocator<string>> FileList;
Note that you should try the tbb::scalable_allocator for all other containers used in the loop too, in order bring your parallelization speedup closer to the number of cores, 4.

The answer to your question also depends on the ratio between memory accesses and computation in your algorithm. If you do very few operations on a lot of data, your problem is memory bound and that will limit the core load. If on the other hand you compute a lot with little data, your chances of improving are better.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js