CUDA C++ overlapping SERIAL kernel execution and data transfer - c++

So this guide here shows the general way to overlap kernel execution and data transfer.
cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; ++i) {
cudaStreamCreate(&streams[i]);
int offset = ...;
cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]);
kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
// edit: no deviceToHost copy
}
However, the kernel is serial. So it must process 0->1000, then 1000->2000, ... In short, the order to correctly perform this kernel while overlapping data transfer is:
copy[a->b] must happen before kernel[a->b]
kernel [a->b] must happen before kernel[b->c], where c > a, b
Is it possible to do this without using cudaDeviceSynchronize() ? If not, what's the fastest way to do it?

So each kernel is dependent on (cannot begin until):
The associated H->D copy is complete
The previous kernel execution is complete
Ordinary stream semantics won't handle this case (2 separate dependencies, from 2 separate streams), so we'll need to put an extra interlock in there. We can use a set of events and cudaStreamWaitEvent() to handle it.
For the most general case (no knowledge of the total number of chunks) I would recommend something like this:
$ cat t1783.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__global__ void process(const T * __restrict__ in, const T * __restrict__ prev, T * __restrict__ out, size_t ds){
for (size_t i = threadIdx.x+blockDim.x*blockIdx.x; i < ds; i += gridDim.x*blockDim.x){
out[i] = in[i] + prev[i];
}
}
const int nTPB = 256;
typedef int mt;
const int chunk_size = 1048576;
const int data_size = 10*1048576;
const int ns = 3;
int main(){
mt *din, *dout, *hin, *hout;
cudaStream_t str[ns];
cudaEvent_t evt[ns];
for (int i = 0; i < ns; i++) {
cudaStreamCreate(str+i);
cudaEventCreate( evt+i);}
cudaMalloc(&din, sizeof(mt)*data_size);
cudaMalloc(&dout, sizeof(mt)*data_size);
cudaHostAlloc(&hin, sizeof(mt)*data_size, cudaHostAllocDefault);
cudaHostAlloc(&hout, sizeof(mt)*data_size, cudaHostAllocDefault);
cudaMemset(dout, 0, sizeof(mt)*chunk_size); // for first loop iteration
for (int i = 0; i < data_size; i++) hin[i] = 1;
cudaEventRecord(evt[ns-1], str[ns-1]); // this event will immediately "complete"
unsigned long long dt = dtime_usec(0);
for (int i = 0; i < (data_size/chunk_size); i++){
cudaStreamSynchronize(str[i%ns]); // so we can reuse event safely
cudaMemcpyAsync(din+i*chunk_size, hin+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyHostToDevice, str[i%ns]);
cudaStreamWaitEvent(str[i%ns], evt[(i>0)?(i-1)%ns:ns-1], 0);
process<<<(chunk_size+nTPB-1)/nTPB, nTPB, 0, str[i%ns]>>>(din+i*chunk_size, dout+((i>0)?(i-1)*chunk_size:0), dout+i*chunk_size, chunk_size);
cudaEventRecord(evt[i%ns]);
cudaMemcpyAsync(hout+i*chunk_size, dout+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyDeviceToHost, str[i%ns]);
}
cudaDeviceSynchronize();
dt = dtime_usec(dt);
for (int i = 0; i < data_size; i++) if (hout[i] != (i/chunk_size)+1) {std::cout << "error at index: " << i << " was: " << hout[i] << " should be: " << (i/chunk_size)+1 << std::endl; return 0;}
std::cout << "elapsed time: " << dt << " microseconds" << std::endl;
}
$ nvcc -o t1783 t1783.cu
$ ./t1783
elapsed time: 4366 microseconds
Good practice here would be to use a profiler to verify the expected overlap scenarios. However, we can take a shortcut based on the elapsed time measurement.
The loop is transferring a total of 40MB of data to the device, and 40MB back. The elapsed time is 4366us. This gives an average throughput for each direction of (40*1048576)/4366 or 9606 bytes/us which is 9.6GB/s. This is basically saturating the Gen3 link in both directions, therefore my chunk processing is approximately back-to-back, and I have essentially full overlap of D->H with H->D memcopies. The kernel here is trivial so it shows up as just slivers in the profile.
For your case, you indicated you didn't need the D->H copy, but it adds no extra complexity so I chose to show it. The desired behavior still occurs if you comment that line out of the loop (although this affects results checking later).
A possible criticism of this approach is that the cudaStreamSynchronize() call, which is necessary so we don't "overrun" the event interlock, means that the loop will only proceed to ns number of iterations beyond the one that is currently executing on the device. So it is not possible to launch more work asynchronously than that. If you wanted to launch all the work at once and go on and do something else on the CPU, this method will not fully allow that (the CPU will proceed past the loop when the stream processing has reach ns iterations from the last one).
The code is presented to illustrate an approach, conceptually. It is not guaranteed to be defect free, nor do I claim it is suitable for any particular purpose.

Related

CUDA Unified Memory: Difference in behaviour on Windows and Linux

I am porting an application from Linux to Windows and discovered significant runtime differences of the same code on the same hardware between Windows and Linux.
A minimal working example:
#include <iostream>
#include <chrono>
#include <cuda.h>
constexpr unsigned int MB = 1000000;
constexpr unsigned int num_bytes = 20 * MB;
constexpr unsigned int repeats = 50;
constexpr unsigned int the_answer = 42;
constexpr unsigned int half_of_the_answer = the_answer / 2;
constexpr unsigned int array_index = 100;
__global__ void kernel(uint8_t* data){
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<num_bytes){
data[i] = half_of_the_answer;
}
}
void doSomethingOnGPU(uint8_t* data){
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaStreamAttachMemAsync(stream, data, 0, cudaMemAttachSingle);
kernel<<<num_bytes/1000, 1000, 0, stream>>>(data);
cudaStreamSynchronize(stream);
cudaStreamDestroy(stream);
cudaDeviceSynchronize();
}
void doSomethingOnCPU(uint8_t* pic_unpacked){
for(unsigned int i=0; i < num_bytes; i++){
pic_unpacked[i] = the_answer;
}
}
int main() {
uint8_t* data{};
cudaMallocManaged(&data, num_bytes, cudaMemAttachHost);
for(unsigned int i=0;i<repeats;i++){
auto start_time_cpu = std::chrono::high_resolution_clock::now();
doSomethingOnCPU(data);
auto stop_time_cpu = std::chrono::high_resolution_clock::now();
auto duration_cpu = std::chrono::duration_cast<std::chrono::milliseconds>(stop_time_cpu-start_time_cpu);
std::cout << "CPU computation took "<< duration_cpu.count() << "ms, data[" << array_index << "]="
<< static_cast<unsigned int>(data[array_index]) << std::endl;
auto start_time_gpu = std::chrono::high_resolution_clock::now();
doSomethingOnGPU(data);
auto stop_time_gpu = std::chrono::high_resolution_clock::now();
auto duration_gpu = std::chrono::duration_cast<std::chrono::milliseconds>(stop_time_gpu-start_time_gpu);
std::cout << "GPU computation took "<< duration_gpu.count() << "ms, data[" << array_index << "]="
<< static_cast<unsigned int>(data[array_index]) << std::endl << std::endl;
}
cudaFree(data);
return 0;
}
This leads to the following output on Windows:
CPU computation took 216ms, data[100]=42
GPU computation took 29ms, data[100]=21
and to the following output on Linux:
CPU computation took 20ms, data[100]=42
GPU computation took 1ms, data[100]=21
Both are built in Release mode (Linux->GCC, Win->MSVC).
It seems to me, that the automatic memory transfers do not work well under Windows.
Explicit memory transfers with
cudaMallocHost(&hostMem, size);
cudaMalloc(&cudaMem, size);
cudaMemcpy(hostMem, cudaMem, size, cudaMemcpyDeviceToHost);
cudaMemcpy(cudaMem, hostMem, size, cudaMemcpyHostToDevice);
work more or less with the same speed under Linux and Windows.
Why is there this big runtime difference between Linux and Windows when working with unified memory?
According to the documentation:
GPUs with SM architecture 6.x or higher (Pascal class or newer) provide additional Unified Memory features such as on-demand page migration and GPU memory oversubscription. [...] Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.
Of the features explicitly mentioned here, I would think that "on-demand page migration" is very relevant for the increased performance under Linux.

How can I test two algorithms and determine which is faster?

Whenever working on a specific problem, I may come across different solutions. I'm not sure how to choose the better of the two options. The first idea is to compute the complexity of the two solutions, but sometimes they may share the same complexity, or they may differ but the range of the input is small that the constant factor matters.
The second idea is to benchmark both solutions. However, I'm not sure how to time them using c++. I have found this question:
How to Calculate Execution Time of a Code Snippet in C++ , but I don't know how to properly deal with compiler optimizations or processor inconsistencies.
In short: is the code provided in the question above sufficient for everyday tests? is there some options that I should enable in the compiler before I run the tests? (I'm using Visual C++) How many tests should I do, and how much time difference between the two benchmarks matters?
Here is an example of a code I want to test. Which of these is faster? How can I calculate that myself?
unsigned long long fiborecursion(int rank){
if (rank == 0) return 1;
else if (rank < 0) return 0;
return fiborecursion(rank-1) + fiborecursion(rank-2);
}
double sq5 = sqrt(5);
unsigned long long fiboconstant(int rank){
return pow((1 + sq5) / 2, rank + 1) / sq5 + 0.5;
}
Using the clock from this answer
#include <iostream>
#include <chrono>
class Timer
{
public:
Timer() : beg_(clock_::now()) {}
void reset() { beg_ = clock_::now(); }
double elapsed() const {
return std::chrono::duration_cast<second_>
(clock_::now() - beg_).count(); }
private:
typedef std::chrono::high_resolution_clock clock_;
typedef std::chrono::duration<double, std::ratio<1> > second_;
std::chrono::time_point<clock_> beg_;
};
You can write a program to time both of your functions.
int main() {
const int N = 10000;
Timer tmr;
tmr.reset();
for (int i = 0; i < N; i++) {
auto value = fiborecursion(i%50);
}
double time1 = tmr.elapsed();
tmr.reset();
for (int i = 0; i < N; i++) {
auto value = fiboconstant(i%50);
}
double time2 = tmr.elapsed();
std::cout << "Recursion"
<< "\n\tTotal: " << time1
<< "\n\tAvg: " << time1 / N
<< "\n"
<< "\nConstant"
<< "\n\tTotal: " << time2
<< "\n\tAvg: " << time2 / N
<< "\n";
}
I would try compiling with no compiler optimizations (-O0) and max compiler optimizations (-O3) just to see what the differences are. It is likely that at max optimizations the compiler may eliminate the loops entirely.

Explanations over parallel code execution and further performance gain in a simple example

playing around with multithreaded programming with c++11 threads, I wanted to make sure that diving the algorithm into data independent parts and processing them in parallel should decrease the overall runtime.
Lets say the task is to find a maximum in an array of integers for which the parallelization is pretty simple - each thread finds a local maximum on particular chunk of data, then at the end when all of local maximums are found, we should find the final maximum from local maximums - so the runtime should decrease up to 3-4 times with 4 hardware threads (on my pc it is 4)
the code
void max_el(
std::vector<int>& v,
std::vector<int>::value_type& max,
const int& n_threads=1,
const unsigned int& tid = 0)
{
max = v[tid];
for (size_t i = tid, end = v.size(); i < end; i += n_threads)
{
if (v[i] > max)
{
max = v[i];
}
}
}
void max_el_concurrent(std::vector<int>& v)
{
int n_threads = std::thread::hardware_concurrency();
std::cout << n_threads << " threads" << std::endl;
std::vector<std::thread> workers(n_threads);
std::vector<int> res(n_threads);
for (size_t i = 0; i < n_threads; ++i)
{
workers[i] = std::thread(max_el, std::ref(v), std::ref(res[i]), n_threads, i);
}
for (auto& worker: workers)
{
worker.join();
}
std::vector<int>::value_type final_max;
max_el(std::ref(res), std::ref(final_max));
std::cout << final_max << std::endl;
}
void max_el_sequential(std::vector<int>& v)
{
std::vector<int>::value_type max;
std::cout << "sequential" << std::endl;
max_el(v, max);
std::cout << max << std::endl;
}
template< class Func, class Container >
void profile(Func func, Container cont)
{
high_resolution_clock::time_point start, now;
double runtime = 0.0f;
start = high_resolution_clock::now();
func(cont);
now = high_resolution_clock::now();
runtime = duration<double>(now - start).count();
std::cout << "runing time = " << runtime << " sec" << std::endl;
}
#define NUM_ELEMENTS 100000000
int main()
{
std::vector<int> v;
v.reserve(NUM_ELEMENTS + 100);
// filling
std::cout << "data is ready, running ... " << std::endl;
profile(max_el_sequential, v); // 0.506731 sec
profile(max_el_concurrent, v); // 0.26108 sec why only ~2 times faster !?
return 0;
}
Despite that std::thread::hardware_concurrency returns 4 execution of this code shows onlyy 2 times performance gain compared to sequential algorithm.
Taking into account that /proc/cpu/info shows 2 cpus with 2 cores for each and the fact that there's no any lock/unlock, I/O or threads communications overhead in the code, I expect the theory working just fine and at least x3, x4 times runtime decrease, however this is not happening in practice...
So why there's such a behavour ?
What's exactly going on there ?
On my system (Core i7-5820k), your application seems to be memory-bound.
The speedup I got was 2.9 (with 12 threads).
On my system, max DRAM bandwidth is 45GB/s:
Single-threaded run of your application used around 16GB/s:
And with 12 threads: 45GB/s:
(had same results and overall execution time with 3..11 threads)
The way you're striding over contiguous memory in this loop isn't too efficient:
for (size_t i = tid, end = v.size(); i < end; i += n_threads)
Memory is read into the L2 cache in contiguous blocks, so doing this in parallel is going to be wasteful; with a 64 byte cache line and a 4-byte int this is going to load the whole array in every thread, up to 16 threads. It is also very wasteful for the L2 cache, as only a small part of each cache line is actually used (we assume the threads aren't perfectly in sync and the distance between the active regions quickly exceeds the L2 size).
Additional remarks:
Do not time I/O (that includes std::cout), this will skew the results.
Try not to write to adjacent memory from different threads (like you do with the res vector), or your application will suffer from false sharing. You want to keep a distance of at least 64 bytes between memory written by different threads. As a quick fix, collect the local maximum into a local variable and write max only once at the end.
Fixing both of these had no significant effect on overall performance in this particular case, however.
Finally, your CPU (Core i5-5200) is a 2-core, hyper-threaded processor. According to Intel, the speedup of hyper-threading is on average 30%. That means that you should expect a max speedup of 2.6 (2 + 2*0.3) and not 4.0.

thrust::max_element slow in comparison cublasIsamax - More efficient implementation?

I need a fast and efficient implementation for finding the index of the maximum value in an array in CUDA. This operation needs to be performed several times. I originally used cublasIsamax for this, however, it sadly returns the index of the maximum absolute value, which is not what I want. Instead, I'm using thrust::max_element, however the speed is rather slow in comparison to cublasIsamax. I use it in the following manner:
//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_vector);
thrust::device_vector<float>::iterator d_it = thrust::max_element(d_ptr, d_ptr + nrElements);
max_index = d_it - (thrust::device_vector<float>::iterator)d_ptr;
The number of elements in the vector range between 10'000 and 20'000. The difference in speed between thrust::max_element and cublasIsamax is rather big. Perhaps I'm performing several memory transactions without knowing?
A more efficient implementation would be to write your own max-index reduction code in CUDA. It's likely that cublasIsamax is using something like this under the hood.
We can compare 3 approaches:
thrust::max_element
cublasIsamax
custom CUDA kernel
Here's a fully worked example:
$ cat t665.cu
#include <cublas_v2.h>
#include <thrust/extrema.h>
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <iostream>
#include <stdlib.h>
#define DSIZE 10000
// nTPB should be a power-of-2
#define nTPB 256
#define MAX_KERNEL_BLOCKS 30
#define MAX_BLOCKS ((DSIZE/nTPB)+1)
#define MIN(a,b) ((a>b)?b:a)
#define FLOAT_MIN -1.0f
#include <time.h>
#include <sys/time.h>
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
__device__ volatile float blk_vals[MAX_BLOCKS];
__device__ volatile int blk_idxs[MAX_BLOCKS];
__device__ int blk_num = 0;
template <typename T>
__global__ void max_idx_kernel(const T *data, const int dsize, int *result){
__shared__ volatile T vals[nTPB];
__shared__ volatile int idxs[nTPB];
__shared__ volatile int last_block;
int idx = threadIdx.x+blockDim.x*blockIdx.x;
last_block = 0;
T my_val = FLOAT_MIN;
int my_idx = -1;
// sweep from global memory
while (idx < dsize){
if (data[idx] > my_val) {my_val = data[idx]; my_idx = idx;}
idx += blockDim.x*gridDim.x;}
// populate shared memory
vals[threadIdx.x] = my_val;
idxs[threadIdx.x] = my_idx;
__syncthreads();
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1){
if (threadIdx.x < i)
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
__syncthreads();}
// perform block-level reduction
if (!threadIdx.x){
blk_vals[blockIdx.x] = vals[0];
blk_idxs[blockIdx.x] = idxs[0];
if (atomicAdd(&blk_num, 1) == gridDim.x - 1) // then I am the last block
last_block = 1;}
__syncthreads();
if (last_block){
idx = threadIdx.x;
my_val = FLOAT_MIN;
my_idx = -1;
while (idx < gridDim.x){
if (blk_vals[idx] > my_val) {my_val = blk_vals[idx]; my_idx = blk_idxs[idx]; }
idx += blockDim.x;}
// populate shared memory
vals[threadIdx.x] = my_val;
idxs[threadIdx.x] = my_idx;
__syncthreads();
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1){
if (threadIdx.x < i)
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
__syncthreads();}
if (!threadIdx.x)
*result = idxs[0];
}
}
int main(){
int nrElements = DSIZE;
float *d_vector, *h_vector;
h_vector = new float[DSIZE];
for (int i = 0; i < DSIZE; i++) h_vector[i] = rand()/(float)RAND_MAX;
h_vector[10] = 10; // create definite max element
cublasHandle_t my_handle;
cublasStatus_t my_status = cublasCreate(&my_handle);
cudaMalloc(&d_vector, DSIZE*sizeof(float));
cudaMemcpy(d_vector, h_vector, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
int max_index = 0;
unsigned long long dtime = dtime_usec(0);
//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_vector);
thrust::device_vector<float>::iterator d_it = thrust::max_element(d_ptr, d_ptr + nrElements);
max_index = d_it - (thrust::device_vector<float>::iterator)d_ptr;
cudaDeviceSynchronize();
dtime = dtime_usec(dtime);
std::cout << "thrust time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
max_index = 0;
dtime = dtime_usec(0);
my_status = cublasIsamax(my_handle, DSIZE, d_vector, 1, &max_index);
cudaDeviceSynchronize();
dtime = dtime_usec(dtime);
std::cout << "cublas time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
max_index = 0;
int *d_max_index;
cudaMalloc(&d_max_index, sizeof(int));
dtime = dtime_usec(0);
max_idx_kernel<<<MIN(MAX_KERNEL_BLOCKS, ((DSIZE+nTPB-1)/nTPB)), nTPB>>>(d_vector, DSIZE, d_max_index);
cudaMemcpy(&max_index, d_max_index, sizeof(int), cudaMemcpyDeviceToHost);
dtime = dtime_usec(dtime);
std::cout << "kernel time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
return 0;
}
$ nvcc -O3 -arch=sm_20 -o t665 t665.cu -lcublas
$ ./t665
thrust time: 0.00075 max index: 10
cublas time: 6.3e-05 max index: 11
kernel time: 2.5e-05 max index: 10
$
Notes:
CUBLAS returns an index 1 higher than the others because CUBLAS uses 1-based indexing.
CUBLAS might be quicker if you used CUBLAS_POINTER_MODE_DEVICE, however for validation you would still have to copy the result back to the host.
CUBLAS with CUBLAS_POINTER_MODE_DEVICE should be asynchronous, so the cudaDeviceSynchronize() will be desirable for the host based timing I've shown here. In some cases, thrust can be asynchronous as well.
For convenience and results comparison between CUBLAS and the other methods, I am using all nonnegative values for my data. You may want to adjust the FLOAT_MIN value if you are using negative values as well.
If you're freaky about performance, you can try tuning the nTPB and MAX_KERNEL_BLOCKS parameters to see if you can max out performance on your specific GPU. The kernel code also arguably leaves some performance on the table by not switching carefully into a warp-synchronous mode for the final stages of the (two) threadblock reduction(s).
The threadblock reduction kernel uses a block-draining/last-block strategy to avoid the overhead of an additional kernel launch to perform the final reduction.

Why cublas on GTX Titan is slower than single threaded CPU code?

I am testing Nvidia Cublas Library on my GTX Titan. I have the following code:
#include "cublas.h"
#include <stdlib.h>
#include <conio.h>
#include <Windows.h>
#include <iostream>
#include <iomanip>
/* Vector size */
#define N (1024 * 1024 * 32)
/* Main */
int main(int argc, char** argv)
{
LARGE_INTEGER frequency;
LARGE_INTEGER t1, t2;
float* h_A;
float* h_B;
float* d_A = 0;
float* d_B = 0;
/* Initialize CUBLAS */
cublasInit();
/* Allocate host memory for the vectors */
h_A = (float*)malloc(N * sizeof(h_A[0]));
h_B = (float*)malloc(N * sizeof(h_B[0]));
/* Fill the vectors with test data */
for (int i = 0; i < N; i++)
{
h_A[i] = rand() / (float)RAND_MAX;
h_B[i] = rand() / (float)RAND_MAX;
}
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&t1);
/* Allocate device memory for the vectors */
cublasAlloc(N, sizeof(d_A[0]), (void**)&d_A);
cublasAlloc(N, sizeof(d_B[0]), (void**)&d_B);
/* Initialize the device matrices with the host vectors */
cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1);
cublasSetVector(N, sizeof(h_B[0]), h_B, 1, d_B, 1);
/* Performs operation using cublas */
float res = cublasSdot(N, d_A, 1, d_B, 1);
/* Memory clean up */
cublasFree(d_A);
cublasFree(d_B);
QueryPerformanceCounter(&t2);
double elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
std::cout << "GPU time = " << std::setprecision(16) << elapsedTime << std::endl;
std::cout << "GPU result = " << res << std::endl;
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&t1);
float sum = 0.;
for (int i = 0; i < N; i++) {
sum += h_A[i] * h_B[i];
}
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
std::cout << "CPU time = " << std::setprecision(16) << elapsedTime << std::endl;
std::cout << "CPU result = " << sum << std::endl;
free(h_A);
free(h_B);
/* Shutdown */
cublasShutdown();
getch();
return EXIT_SUCCESS;
}
When I run the code I get the following result:
GPU time = 164.7487009845991
GPU result = 8388851
CPU time = 45.22368030957917
CPU result = 7780599.5
Why using cublas library on GTX Titan is 3 times slower than calculations on one Xeon 2.4GHz IvyBridge core?
When I increase or decrease the vector sizes, I get the same results: GPU is slower than CPU. Double precision doesn't change it.
Because dot product is a function that uses each vector element only once. That means that the time to send it to the video card is much greater than to calculate everything on cpu, because PCIExpress is much slower than RAM.
I think you should read this:
http://blog.theincredibleholk.org/blog/2012/12/10/optimizing-dot-product/
There are three main points, I will briefly comment on those:
GPUs are good at hiding latencies with lots of computations (if you can balance between calculations and data transfers), here the memory is accessed a lot (bandwidth limited problem)and there isn't enough computation to hide latencies that, indeed, kill your performances.
Furthermore data is read only once so caching capabilities aren't exploited at all while CPUs are extremely good at predicting which data will be accessed next.
Plus you're also timing the allocation times.. that means PCI-E bus time which is very slow compared to main memory accesses.
All of the above render the example you just posted a case in which CPU outperform a massive parallel architecture like your GPU.
Optimizations for such a problem could be:
Keeping data on the device as much as possible
Having threads calculate more elements (and thus hide latencies)
Also: http://www.nvidia.com/object/nvidia_research_pub_001.html