CUDA Unified Memory: Difference in behaviour on Windows and Linux

CUDA Unified Memory: Difference in behaviour on Windows and Linux - c++

I am porting an application from Linux to Windows and discovered significant runtime differences of the same code on the same hardware between Windows and Linux.
A minimal working example:
#include <iostream>
#include <chrono>
#include <cuda.h>
constexpr unsigned int MB = 1000000;
constexpr unsigned int num_bytes = 20 * MB;
constexpr unsigned int repeats = 50;
constexpr unsigned int the_answer = 42;
constexpr unsigned int half_of_the_answer = the_answer / 2;
constexpr unsigned int array_index = 100;
__global__ void kernel(uint8_t* data){
int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i<num_bytes){
data[i] = half_of_the_answer;
}
}
void doSomethingOnGPU(uint8_t* data){
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaStreamAttachMemAsync(stream, data, 0, cudaMemAttachSingle);
kernel<<<num_bytes/1000, 1000, 0, stream>>>(data);
cudaStreamSynchronize(stream);
cudaStreamDestroy(stream);
cudaDeviceSynchronize();
}
void doSomethingOnCPU(uint8_t* pic_unpacked){
for(unsigned int i=0; i < num_bytes; i++){
pic_unpacked[i] = the_answer;
}
}
int main() {
uint8_t* data{};
cudaMallocManaged(&data, num_bytes, cudaMemAttachHost);
for(unsigned int i=0;i<repeats;i++){
auto start_time_cpu = std::chrono::high_resolution_clock::now();
doSomethingOnCPU(data);
auto stop_time_cpu = std::chrono::high_resolution_clock::now();
auto duration_cpu = std::chrono::duration_cast<std::chrono::milliseconds>(stop_time_cpu-start_time_cpu);
std::cout << "CPU computation took "<< duration_cpu.count() << "ms, data[" << array_index << "]="
<< static_cast<unsigned int>(data[array_index]) << std::endl;
auto start_time_gpu = std::chrono::high_resolution_clock::now();
doSomethingOnGPU(data);
auto stop_time_gpu = std::chrono::high_resolution_clock::now();
auto duration_gpu = std::chrono::duration_cast<std::chrono::milliseconds>(stop_time_gpu-start_time_gpu);
std::cout << "GPU computation took "<< duration_gpu.count() << "ms, data[" << array_index << "]="
<< static_cast<unsigned int>(data[array_index]) << std::endl << std::endl;
}
cudaFree(data);
return 0;
}
This leads to the following output on Windows:
CPU computation took 216ms, data[100]=42
GPU computation took 29ms, data[100]=21
and to the following output on Linux:
CPU computation took 20ms, data[100]=42
GPU computation took 1ms, data[100]=21
Both are built in Release mode (Linux->GCC, Win->MSVC).
It seems to me, that the automatic memory transfers do not work well under Windows.
Explicit memory transfers with
cudaMallocHost(&hostMem, size);
cudaMalloc(&cudaMem, size);
cudaMemcpy(hostMem, cudaMem, size, cudaMemcpyDeviceToHost);
cudaMemcpy(cudaMem, hostMem, size, cudaMemcpyHostToDevice);
work more or less with the same speed under Linux and Windows.
Why is there this big runtime difference between Linux and Windows when working with unified memory?

According to the documentation:
GPUs with SM architecture 6.x or higher (Pascal class or newer) provide additional Unified Memory features such as on-demand page migration and GPU memory oversubscription. [...] Applications running on Windows (whether in TCC or WDDM mode) will use the basic Unified Memory model as on pre-6.x architectures even when they are running on hardware with compute capability 6.x or higher.
Of the features explicitly mentioned here, I would think that "on-demand page migration" is very relevant for the increased performance under Linux.

Related

What will decrease the performance of multithread except lock and the cost of thread's create and destory?

I wrote a program which use std::thread::hardware_concurrency to get how much threads my computer could support.Then I divide the size of array by N and get N blocks. And I create N threads to calculate the sum of the block.Here is the code
#include <algorithm>
#include <chrono>
#include <functional>
#include <iostream>
#include <numeric>
#include <thread>
#include <vector>
#include <stdlib.h>
int64_t thread_cost_time = 0;
template <typename Iterator, typename T> struct accumulate_block {
void operator()(Iterator first, Iterator last, T &result) {
using namespace std::chrono;
auto start = std::chrono::high_resolution_clock::now();
result = std::accumulate(first, last, result);
auto stop = std::chrono::high_resolution_clock::now();
auto thread_time =
std::chrono::duration_cast<microseconds>(stop - start).count();
thread_cost_time = std::max(thread_time, thread_cost_time);
}
};
template <typename Iterator, typename T>
T parallel_accumulate(Iterator first, Iterator last, T &init, uint64_t num) {
uint64_t length = std::distance(first, last);
const uint64_t min_per_thread = 25;
// it will assign 12 to hard_ware_threads in my pc
const uint64_t hardware_threads = std::thread::hardware_concurrency();
const uint64_t max_threads = (length + min_per_thread - 1) / (min_per_thread);
// const uint64_t num_threads = std::min(hardware_threads != 0 ?
// hardware_threads : 2,
// max_threads);
const uint64_t num_threads = num;
const uint64_t block_size = length / num_threads;
std::vector<T> results(num_threads);
std::vector<std::thread> threads(num_threads - 1);
Iterator block_start = first;
for (uint64_t i = 0; i < num_threads - 1; i++) {
Iterator block_end = block_start;
std::advance(block_end, block_size);
// calculate the sum of block
threads[i] = std::thread{accumulate_block<Iterator, T>(), block_start,
block_end, std::ref(results[i])};
block_start = block_end;
}
accumulate_block<Iterator, T>()(block_start, last, results[num_threads - 1]);
std::for_each(threads.begin(), threads.end(),
std::mem_fn(&std::thread::join));
return std::accumulate(results.begin(), results.end(), init);
}
int main(int argc, char *argv[]) {
// constexpr const uint64_t sz = 1000000000;
for (int number = 2; number < 32; number++) {
int64_t parr = 0;
int64_t single = 0;
int64_t thread_trivial = 0;
std::cout
<< "--------------------------------------------------------------"
<< std::endl;
std::cout << "---------------------thread: " << number
<< "-----------------------" << std::endl;
int iter_times = 10;
for (int iter = 0; iter < iter_times; iter++) {
thread_cost_time = 0;
constexpr const uint64_t sz = 100000000 ;
std::vector<uint64_t> arr;
for (uint32_t i = 0; i < sz; i++) {
arr.emplace_back(i);
}
using namespace std::chrono;
auto start = std::chrono::high_resolution_clock::now();
uint64_t init = 0;
parallel_accumulate<decltype(arr.begin()), uint64_t>(
arr.begin(), arr.end(), std::ref(init), number);
auto stop = std::chrono::high_resolution_clock::now();
parr += std::chrono::duration_cast<microseconds>(stop - start).count();
thread_trivial +=
std::chrono::duration_cast<microseconds>(stop - start).count() -
thread_cost_time;
uint64_t init_ = 0;
uint64_t arr_sz = arr.size();
// uint64_t block_sz = arr.size() / 2;
start = std::chrono::high_resolution_clock::now();
std::accumulate(arr.begin(), arr.end(), init_);
// std::cout << init_ << std::endl;
stop = std::chrono::high_resolution_clock::now();
single += std::chrono::duration_cast<microseconds>(stop - start).count();
}
std::cout << "parallel " << parr / iter_times<< std::endl;
std::cout << "single thread " << single / iter_times<< std::endl;
std::cout << "parr is "
<< static_cast<double>(single) / static_cast<double>(parr)
<< "X fast" << std::endl;
std::cout << "thread create and destory time " << thread_trivial / iter_times
<< std::endl;
}
}
I record the time of multithread and single thread.
I can only achieve at most 6.57x faster than use only one thread, even though std::thread::hardware_concurrency tell me I have 12 threads could run simultaneously.
There were no contention of lock in this program.I also record the time of create and destory the thread, even if I minus it , I still cannot achieve 12X faster.
I think maybe thread schedule will make multithreads slow, but I have 12 threads, It shouldn't achieve only 6.57x faster.
I think maybe multithreads will decrease the hit ratio of cache,but I'm not quite sure.
So how can I achieve 12X faster than use only one thread?
Here is my static of my program
threads
parallel
single
faster
2
324868
633777
1.95
3
218584
633777
2.87
4
167169
633777
3.77
5
136542
633777
4.64
6
113207
633777
5.48
7
147324
633777
4.27
8
136768
633777
4.67
You could run my code to get the data from 2 threads to 31 threads

Apparently, at least on my Intel core i7, std::thread::hardware_concurrency() returns the number of hardware threads available. On hardware with simultaneous multi-threading typically 2 hardware threads share time on a single hardware core. The hardware core switches transparently between the 2 hardware threads. That means you only get about half the speedup factor that you might expect based on the result of std::thread::hardware_concurrency().
In practice each hardware thread will stall from time to time for various reasons, e.g. waiting for data to arrive from memory, giving the other hardware thread extra processing time. Typically simultaneous multi-threading (or Hyper-threading as Intel calls it) will give you an extra 15% of performance that way, so you may expect a speedup factor of up to (12/2)*(115/100) = 6.9.
Overheads, including the one you mention, but also in my experience the increased working-set size, can further reduce the speed-up factor.

CUDA C++ overlapping SERIAL kernel execution and data transfer

So this guide here shows the general way to overlap kernel execution and data transfer.
cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; ++i) {
cudaStreamCreate(&streams[i]);
int offset = ...;
cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]);
kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
// edit: no deviceToHost copy
}
However, the kernel is serial. So it must process 0->1000, then 1000->2000, ... In short, the order to correctly perform this kernel while overlapping data transfer is:
copy[a->b] must happen before kernel[a->b]
kernel [a->b] must happen before kernel[b->c], where c > a, b
Is it possible to do this without using cudaDeviceSynchronize() ? If not, what's the fastest way to do it?

So each kernel is dependent on (cannot begin until):
The associated H->D copy is complete
The previous kernel execution is complete
Ordinary stream semantics won't handle this case (2 separate dependencies, from 2 separate streams), so we'll need to put an extra interlock in there. We can use a set of events and cudaStreamWaitEvent() to handle it.
For the most general case (no knowledge of the total number of chunks) I would recommend something like this:
$ cat t1783.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__global__ void process(const T * __restrict__ in, const T * __restrict__ prev, T * __restrict__ out, size_t ds){
for (size_t i = threadIdx.x+blockDim.x*blockIdx.x; i < ds; i += gridDim.x*blockDim.x){
out[i] = in[i] + prev[i];
}
}
const int nTPB = 256;
typedef int mt;
const int chunk_size = 1048576;
const int data_size = 10*1048576;
const int ns = 3;
int main(){
mt *din, *dout, *hin, *hout;
cudaStream_t str[ns];
cudaEvent_t evt[ns];
for (int i = 0; i < ns; i++) {
cudaStreamCreate(str+i);
cudaEventCreate( evt+i);}
cudaMalloc(&din, sizeof(mt)*data_size);
cudaMalloc(&dout, sizeof(mt)*data_size);
cudaHostAlloc(&hin, sizeof(mt)*data_size, cudaHostAllocDefault);
cudaHostAlloc(&hout, sizeof(mt)*data_size, cudaHostAllocDefault);
cudaMemset(dout, 0, sizeof(mt)*chunk_size); // for first loop iteration
for (int i = 0; i < data_size; i++) hin[i] = 1;
cudaEventRecord(evt[ns-1], str[ns-1]); // this event will immediately "complete"
unsigned long long dt = dtime_usec(0);
for (int i = 0; i < (data_size/chunk_size); i++){
cudaStreamSynchronize(str[i%ns]); // so we can reuse event safely
cudaMemcpyAsync(din+i*chunk_size, hin+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyHostToDevice, str[i%ns]);
cudaStreamWaitEvent(str[i%ns], evt[(i>0)?(i-1)%ns:ns-1], 0);
process<<<(chunk_size+nTPB-1)/nTPB, nTPB, 0, str[i%ns]>>>(din+i*chunk_size, dout+((i>0)?(i-1)*chunk_size:0), dout+i*chunk_size, chunk_size);
cudaEventRecord(evt[i%ns]);
cudaMemcpyAsync(hout+i*chunk_size, dout+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyDeviceToHost, str[i%ns]);
}
cudaDeviceSynchronize();
dt = dtime_usec(dt);
for (int i = 0; i < data_size; i++) if (hout[i] != (i/chunk_size)+1) {std::cout << "error at index: " << i << " was: " << hout[i] << " should be: " << (i/chunk_size)+1 << std::endl; return 0;}
std::cout << "elapsed time: " << dt << " microseconds" << std::endl;
}
$ nvcc -o t1783 t1783.cu
$ ./t1783
elapsed time: 4366 microseconds
Good practice here would be to use a profiler to verify the expected overlap scenarios. However, we can take a shortcut based on the elapsed time measurement.
The loop is transferring a total of 40MB of data to the device, and 40MB back. The elapsed time is 4366us. This gives an average throughput for each direction of (40*1048576)/4366 or 9606 bytes/us which is 9.6GB/s. This is basically saturating the Gen3 link in both directions, therefore my chunk processing is approximately back-to-back, and I have essentially full overlap of D->H with H->D memcopies. The kernel here is trivial so it shows up as just slivers in the profile.
For your case, you indicated you didn't need the D->H copy, but it adds no extra complexity so I chose to show it. The desired behavior still occurs if you comment that line out of the loop (although this affects results checking later).
A possible criticism of this approach is that the cudaStreamSynchronize() call, which is necessary so we don't "overrun" the event interlock, means that the loop will only proceed to ns number of iterations beyond the one that is currently executing on the device. So it is not possible to launch more work asynchronously than that. If you wanted to launch all the work at once and go on and do something else on the CPU, this method will not fully allow that (the CPU will proceed past the loop when the stream processing has reach ns iterations from the last one).
The code is presented to illustrate an approach, conceptually. It is not guaranteed to be defect free, nor do I claim it is suitable for any particular purpose.

How to benchmark CUDA programs?

I was trying to benchmark my first CUDA application that adds two arrays first using the CPU and then using the GPU.
Here is the program.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include<iostream>
#include<chrono>
using namespace std;
using namespace std::chrono;
// add two arrays
void add(int n, float *x, float *y) {
for (int i = 0; i < n; i++) {
y[i] += x[i];
}
}
__global__ void addParallel(int n, float *x, float *y) {
int i = threadIdx.x;
if (i < n)
y[i] += x[i];
}
void printElapseTime(std::chrono::microseconds elapsed_time) {
cout << "completed in " << elapsed_time.count() << " microseconds" << endl;
}
int main() {
// generate two arrays of million float values each
cout << "Generating two lists of a million float values ... ";
int n = 1 << 28;
float *x, *y;
cudaMallocManaged(&x, sizeof(float)*n);
cudaMallocManaged(&y, sizeof(float)*n);
// begin benchmark array generation
auto begin = high_resolution_clock::now();
for (int i = 0; i < n; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// end benchmark array generation
auto end = high_resolution_clock::now();
auto elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition cpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using CPU ... ";
add(n, x, y);
// end benchmark addition cpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition gpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using GPU ... ";
addParallel << <1, 1024 >> > (n, x, y);
cudaDeviceSynchronize();
// end benchmark addition gpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
cudaFree(x);
cudaFree(y);
return 0;
}
Surprisingly though, the program is generating the following output.
Generating two lists of a million float values ... completed in 13343211 microseconds
Adding both arrays using CPU ... completed in 543994 microseconds
Adding both arrays using GPU ... completed in 3030147 microseconds
I wonder where exactly I am going wrong. Why is the GPU computation taking 6 times longer than the one that is running on the CPU.
For your reference, I'm running Windows 10 on Intel i7 8750H and Nvidia GTX 1060.

Note that your unified memory array contains 268 million floats, meaning you're transferring about 1 GB of data to the device when you invoke your kernel. Use a GPU profiler (nvprof, nvvp, or nsight) and you should see a HtoD transfer taking the bulk of your computation time.

thrust::max_element slow in comparison cublasIsamax - More efficient implementation?

I need a fast and efficient implementation for finding the index of the maximum value in an array in CUDA. This operation needs to be performed several times. I originally used cublasIsamax for this, however, it sadly returns the index of the maximum absolute value, which is not what I want. Instead, I'm using thrust::max_element, however the speed is rather slow in comparison to cublasIsamax. I use it in the following manner:
//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_vector);
thrust::device_vector<float>::iterator d_it = thrust::max_element(d_ptr, d_ptr + nrElements);
max_index = d_it - (thrust::device_vector<float>::iterator)d_ptr;
The number of elements in the vector range between 10'000 and 20'000. The difference in speed between thrust::max_element and cublasIsamax is rather big. Perhaps I'm performing several memory transactions without knowing?

A more efficient implementation would be to write your own max-index reduction code in CUDA. It's likely that cublasIsamax is using something like this under the hood.
We can compare 3 approaches:
thrust::max_element
cublasIsamax
custom CUDA kernel
Here's a fully worked example:
$ cat t665.cu
#include <cublas_v2.h>
#include <thrust/extrema.h>
#include <thrust/device_ptr.h>
#include <thrust/device_vector.h>
#include <iostream>
#include <stdlib.h>
#define DSIZE 10000
// nTPB should be a power-of-2
#define nTPB 256
#define MAX_KERNEL_BLOCKS 30
#define MAX_BLOCKS ((DSIZE/nTPB)+1)
#define MIN(a,b) ((a>b)?b:a)
#define FLOAT_MIN -1.0f
#include <time.h>
#include <sys/time.h>
unsigned long long dtime_usec(unsigned long long prev){
#define USECPSEC 1000000ULL
timeval tv1;
gettimeofday(&tv1,0);
return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}
__device__ volatile float blk_vals[MAX_BLOCKS];
__device__ volatile int blk_idxs[MAX_BLOCKS];
__device__ int blk_num = 0;
template <typename T>
__global__ void max_idx_kernel(const T *data, const int dsize, int *result){
__shared__ volatile T vals[nTPB];
__shared__ volatile int idxs[nTPB];
__shared__ volatile int last_block;
int idx = threadIdx.x+blockDim.x*blockIdx.x;
last_block = 0;
T my_val = FLOAT_MIN;
int my_idx = -1;
// sweep from global memory
while (idx < dsize){
if (data[idx] > my_val) {my_val = data[idx]; my_idx = idx;}
idx += blockDim.x*gridDim.x;}
// populate shared memory
vals[threadIdx.x] = my_val;
idxs[threadIdx.x] = my_idx;
__syncthreads();
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1){
if (threadIdx.x < i)
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
__syncthreads();}
// perform block-level reduction
if (!threadIdx.x){
blk_vals[blockIdx.x] = vals[0];
blk_idxs[blockIdx.x] = idxs[0];
if (atomicAdd(&blk_num, 1) == gridDim.x - 1) // then I am the last block
last_block = 1;}
__syncthreads();
if (last_block){
idx = threadIdx.x;
my_val = FLOAT_MIN;
my_idx = -1;
while (idx < gridDim.x){
if (blk_vals[idx] > my_val) {my_val = blk_vals[idx]; my_idx = blk_idxs[idx]; }
idx += blockDim.x;}
// populate shared memory
vals[threadIdx.x] = my_val;
idxs[threadIdx.x] = my_idx;
__syncthreads();
// sweep in shared memory
for (int i = (nTPB>>1); i > 0; i>>=1){
if (threadIdx.x < i)
if (vals[threadIdx.x] < vals[threadIdx.x + i]) {vals[threadIdx.x] = vals[threadIdx.x+i]; idxs[threadIdx.x] = idxs[threadIdx.x+i]; }
__syncthreads();}
if (!threadIdx.x)
*result = idxs[0];
}
}
int main(){
int nrElements = DSIZE;
float *d_vector, *h_vector;
h_vector = new float[DSIZE];
for (int i = 0; i < DSIZE; i++) h_vector[i] = rand()/(float)RAND_MAX;
h_vector[10] = 10; // create definite max element
cublasHandle_t my_handle;
cublasStatus_t my_status = cublasCreate(&my_handle);
cudaMalloc(&d_vector, DSIZE*sizeof(float));
cudaMemcpy(d_vector, h_vector, DSIZE*sizeof(float), cudaMemcpyHostToDevice);
int max_index = 0;
unsigned long long dtime = dtime_usec(0);
//d_vector is a pointer on the device pointing to the beginning of the vector, containing nrElements floats.
thrust::device_ptr<float> d_ptr = thrust::device_pointer_cast(d_vector);
thrust::device_vector<float>::iterator d_it = thrust::max_element(d_ptr, d_ptr + nrElements);
max_index = d_it - (thrust::device_vector<float>::iterator)d_ptr;
cudaDeviceSynchronize();
dtime = dtime_usec(dtime);
std::cout << "thrust time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
max_index = 0;
dtime = dtime_usec(0);
my_status = cublasIsamax(my_handle, DSIZE, d_vector, 1, &max_index);
cudaDeviceSynchronize();
dtime = dtime_usec(dtime);
std::cout << "cublas time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
max_index = 0;
int *d_max_index;
cudaMalloc(&d_max_index, sizeof(int));
dtime = dtime_usec(0);
max_idx_kernel<<<MIN(MAX_KERNEL_BLOCKS, ((DSIZE+nTPB-1)/nTPB)), nTPB>>>(d_vector, DSIZE, d_max_index);
cudaMemcpy(&max_index, d_max_index, sizeof(int), cudaMemcpyDeviceToHost);
dtime = dtime_usec(dtime);
std::cout << "kernel time: " << dtime/(float)USECPSEC << " max index: " << max_index << std::endl;
return 0;
}
$ nvcc -O3 -arch=sm_20 -o t665 t665.cu -lcublas
$ ./t665
thrust time: 0.00075 max index: 10
cublas time: 6.3e-05 max index: 11
kernel time: 2.5e-05 max index: 10
$
Notes:
CUBLAS returns an index 1 higher than the others because CUBLAS uses 1-based indexing.
CUBLAS might be quicker if you used CUBLAS_POINTER_MODE_DEVICE, however for validation you would still have to copy the result back to the host.
CUBLAS with CUBLAS_POINTER_MODE_DEVICE should be asynchronous, so the cudaDeviceSynchronize() will be desirable for the host based timing I've shown here. In some cases, thrust can be asynchronous as well.
For convenience and results comparison between CUBLAS and the other methods, I am using all nonnegative values for my data. You may want to adjust the FLOAT_MIN value if you are using negative values as well.
If you're freaky about performance, you can try tuning the nTPB and MAX_KERNEL_BLOCKS parameters to see if you can max out performance on your specific GPU. The kernel code also arguably leaves some performance on the table by not switching carefully into a warp-synchronous mode for the final stages of the (two) threadblock reduction(s).
The threadblock reduction kernel uses a block-draining/last-block strategy to avoid the overhead of an additional kernel launch to perform the final reduction.

Why cublas on GTX Titan is slower than single threaded CPU code?

I am testing Nvidia Cublas Library on my GTX Titan. I have the following code:
#include "cublas.h"
#include <stdlib.h>
#include <conio.h>
#include <Windows.h>
#include <iostream>
#include <iomanip>
/* Vector size */
#define N (1024 * 1024 * 32)
/* Main */
int main(int argc, char** argv)
{
LARGE_INTEGER frequency;
LARGE_INTEGER t1, t2;
float* h_A;
float* h_B;
float* d_A = 0;
float* d_B = 0;
/* Initialize CUBLAS */
cublasInit();
/* Allocate host memory for the vectors */
h_A = (float*)malloc(N * sizeof(h_A[0]));
h_B = (float*)malloc(N * sizeof(h_B[0]));
/* Fill the vectors with test data */
for (int i = 0; i < N; i++)
{
h_A[i] = rand() / (float)RAND_MAX;
h_B[i] = rand() / (float)RAND_MAX;
}
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&t1);
/* Allocate device memory for the vectors */
cublasAlloc(N, sizeof(d_A[0]), (void**)&d_A);
cublasAlloc(N, sizeof(d_B[0]), (void**)&d_B);
/* Initialize the device matrices with the host vectors */
cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1);
cublasSetVector(N, sizeof(h_B[0]), h_B, 1, d_B, 1);
/* Performs operation using cublas */
float res = cublasSdot(N, d_A, 1, d_B, 1);
/* Memory clean up */
cublasFree(d_A);
cublasFree(d_B);
QueryPerformanceCounter(&t2);
double elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
std::cout << "GPU time = " << std::setprecision(16) << elapsedTime << std::endl;
std::cout << "GPU result = " << res << std::endl;
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&t1);
float sum = 0.;
for (int i = 0; i < N; i++) {
sum += h_A[i] * h_B[i];
}
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
std::cout << "CPU time = " << std::setprecision(16) << elapsedTime << std::endl;
std::cout << "CPU result = " << sum << std::endl;
free(h_A);
free(h_B);
/* Shutdown */
cublasShutdown();
getch();
return EXIT_SUCCESS;
}
When I run the code I get the following result:
GPU time = 164.7487009845991
GPU result = 8388851
CPU time = 45.22368030957917
CPU result = 7780599.5
Why using cublas library on GTX Titan is 3 times slower than calculations on one Xeon 2.4GHz IvyBridge core?
When I increase or decrease the vector sizes, I get the same results: GPU is slower than CPU. Double precision doesn't change it.

Because dot product is a function that uses each vector element only once. That means that the time to send it to the video card is much greater than to calculate everything on cpu, because PCIExpress is much slower than RAM.

I think you should read this:
http://blog.theincredibleholk.org/blog/2012/12/10/optimizing-dot-product/
There are three main points, I will briefly comment on those:
GPUs are good at hiding latencies with lots of computations (if you can balance between calculations and data transfers), here the memory is accessed a lot (bandwidth limited problem)and there isn't enough computation to hide latencies that, indeed, kill your performances.
Furthermore data is read only once so caching capabilities aren't exploited at all while CPUs are extremely good at predicting which data will be accessed next.
Plus you're also timing the allocation times.. that means PCI-E bus time which is very slow compared to main memory accesses.
All of the above render the example you just posted a case in which CPU outperform a massive parallel architecture like your GPU.
Optimizations for such a problem could be:
Keeping data on the device as much as possible
Having threads calculate more elements (and thus hide latencies)
Also: http://www.nvidia.com/object/nvidia_research_pub_001.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CUDA Unified Memory: Difference in behaviour on Windows and Linux - c++

Related

What will decrease the performance of multithread except lock and the cost of thread's create and destory?

CUDA C++ overlapping SERIAL kernel execution and data transfer

How to benchmark CUDA programs?

thrust::max_element slow in comparison cublasIsamax - More efficient implementation?

Why cublas on GTX Titan is slower than single threaded CPU code?

Categories

Resources