Why cublas on GTX Titan is slower than single threaded CPU code? - c++

I am testing Nvidia Cublas Library on my GTX Titan. I have the following code:
#include "cublas.h"
#include <stdlib.h>
#include <conio.h>
#include <Windows.h>
#include <iostream>
#include <iomanip>
/* Vector size */
#define N (1024 * 1024 * 32)
/* Main */
int main(int argc, char** argv)
{
LARGE_INTEGER frequency;
LARGE_INTEGER t1, t2;
float* h_A;
float* h_B;
float* d_A = 0;
float* d_B = 0;
/* Initialize CUBLAS */
cublasInit();
/* Allocate host memory for the vectors */
h_A = (float*)malloc(N * sizeof(h_A[0]));
h_B = (float*)malloc(N * sizeof(h_B[0]));
/* Fill the vectors with test data */
for (int i = 0; i < N; i++)
{
h_A[i] = rand() / (float)RAND_MAX;
h_B[i] = rand() / (float)RAND_MAX;
}
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&t1);
/* Allocate device memory for the vectors */
cublasAlloc(N, sizeof(d_A[0]), (void**)&d_A);
cublasAlloc(N, sizeof(d_B[0]), (void**)&d_B);
/* Initialize the device matrices with the host vectors */
cublasSetVector(N, sizeof(h_A[0]), h_A, 1, d_A, 1);
cublasSetVector(N, sizeof(h_B[0]), h_B, 1, d_B, 1);
/* Performs operation using cublas */
float res = cublasSdot(N, d_A, 1, d_B, 1);
/* Memory clean up */
cublasFree(d_A);
cublasFree(d_B);
QueryPerformanceCounter(&t2);
double elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
std::cout << "GPU time = " << std::setprecision(16) << elapsedTime << std::endl;
std::cout << "GPU result = " << res << std::endl;
QueryPerformanceFrequency(&frequency);
QueryPerformanceCounter(&t1);
float sum = 0.;
for (int i = 0; i < N; i++) {
sum += h_A[i] * h_B[i];
}
QueryPerformanceCounter(&t2);
elapsedTime = (t2.QuadPart - t1.QuadPart) * 1000.0 / frequency.QuadPart;
std::cout << "CPU time = " << std::setprecision(16) << elapsedTime << std::endl;
std::cout << "CPU result = " << sum << std::endl;
free(h_A);
free(h_B);
/* Shutdown */
cublasShutdown();
getch();
return EXIT_SUCCESS;
}
When I run the code I get the following result:
GPU time = 164.7487009845991
GPU result = 8388851
CPU time = 45.22368030957917
CPU result = 7780599.5
Why using cublas library on GTX Titan is 3 times slower than calculations on one Xeon 2.4GHz IvyBridge core?
When I increase or decrease the vector sizes, I get the same results: GPU is slower than CPU. Double precision doesn't change it.

Because dot product is a function that uses each vector element only once. That means that the time to send it to the video card is much greater than to calculate everything on cpu, because PCIExpress is much slower than RAM.

I think you should read this:
http://blog.theincredibleholk.org/blog/2012/12/10/optimizing-dot-product/
There are three main points, I will briefly comment on those:
GPUs are good at hiding latencies with lots of computations (if you can balance between calculations and data transfers), here the memory is accessed a lot (bandwidth limited problem)and there isn't enough computation to hide latencies that, indeed, kill your performances.
Furthermore data is read only once so caching capabilities aren't exploited at all while CPUs are extremely good at predicting which data will be accessed next.
Plus you're also timing the allocation times.. that means PCI-E bus time which is very slow compared to main memory accesses.
All of the above render the example you just posted a case in which CPU outperform a massive parallel architecture like your GPU.
Optimizations for such a problem could be:
Keeping data on the device as much as possible
Having threads calculate more elements (and thus hide latencies)
Also: http://www.nvidia.com/object/nvidia_research_pub_001.html

Related

Why doesn't my OpenMP program scale with number of threads?

I write a program to calculate the sum of an array of 1M numbers where all elements = 1. I use OpenMP for multithreading. However, the run time doesn't scale with the number of threads. Here is the code:
#include <iostream>
#include <omp.h>
#define SIZE 1000000
#define N_THREADS 4
using namespace std;
int main() {
int* arr = new int[SIZE];
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(N_THREADS);
double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, 16)
for (int i = 0; i < SIZE; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, 16) reduction(+:sum)
for (int i = 0; i < SIZE; i++) {
sum += arr[i];
}
}
double t2 = omp_get_wtime();
cout << "n_threads " << n_threads << endl;
cout << "time " << (t2 - t1)*1000 << endl;
cout << sum << endl;
}
The run time (in milliseconds) for different values of N_THREADS is as follows:
n_threads 1
time 3.6718
n_threads 2
time 2.5308
n_threads 3
time 3.4383
n_threads 4
time 3.7427
n_threads 5
time 2.4621
I used schedule(static, 16) to use chunks of 16 iterations per thread to avoid false sharing problem. I thought the performance issue was related to false sharing, but I now think it's not. What could possibly be the problem?
Your code is memory bound, not computation expensive. Its speed depends on the speed of memory access (cache utilization, number of memory channels, etc), therefore it is not expected to scale well with the number of threads.
UPDATE, I run this code using 1000x bigger SIZE (i.e. #define SIZE 100000000) (g++ -fopenmp -O3 -mavx2)
Here are the results, it still scales badly with number of threads:
n_threads 1
time 652.656
time 657.207
time 608.838
time 639.168
1000000000
n_threads 2
time 422.621
time 373.995
time 425.819
time 386.511
time 466.632
time 394.198
1000000000
n_threads 3
time 394.419
time 391.283
time 470.925
time 375.833
time 442.268
time 449.611
time 370.12
time 458.79
1000000000
n_threads 4
time 421.89
time 402.363
time 424.738
time 414.368
time 491.843
time 429.757
time 431.459
time 497.566
1000000000
n_threads 8
time 414.426
time 430.29
time 494.899
time 442.164
time 458.576
time 449.313
time 452.309
1000000000
5 threads contending for same accumulator for reduction or having only 16 chunk size must be inhibiting efficient pipelining of loop iterations. Try coarser region per thread.
Maybe more importantly, you need multiple repeats of benchmark programmatically to get an average and to heat CPU caches/cores into higher frequencies to have better measurement.
The benchmark results saying 1MB/s. Surely the worst RAM will do 1000 times better than that. So memory is not bottleneck (for now). 1 million elements per 4 second is like locking contention or non-heated benchmark. Normally even a Pentium 1 would make more bandwidth than that. You sure you are compiling with O3 optimization?
I have reimplemented the test as a Google Benchmark with different values:
#include <benchmark/benchmark.h>
#include <memory>
#include <omp.h>
constexpr int SCALE{32};
constexpr int ARRAY_SIZE{1000000};
constexpr int CHUNK_SIZE{16};
void original_benchmark(benchmark::State& state)
{
const int num_threads{state.range(0)};
const int array_size{state.range(1)};
const int chunk_size{state.range(2)};
auto arr = std::make_unique<int[]>(array_size);
long long sum = 0;
int n_threads = 0;
omp_set_num_threads(num_threads);
// double t1 = omp_get_wtime();
#pragma omp parallel
{
if (omp_get_thread_num() == 0) {
n_threads = omp_get_num_threads();
}
#pragma omp for schedule(static, chunk_size)
for (int i = 0; i < array_size; i++) {
arr[i] = 1;
}
#pragma omp for schedule(static, chunk_size) reduction(+:sum)
for (int i = 0; i < array_size; i++) {
sum += arr[i];
}
}
// double t2 = omp_get_wtime();
// cout << "n_threads " << n_threads << endl;
// cout << "time " << (t2 - t1)*1000 << endl;
// cout << sum << endl;
state.counters["n_threads"] = n_threads;
}
static void BM_original_benchmark(benchmark::State& state) {
for (auto _ : state) {
original_benchmark(state);
}
}
BENCHMARK(BM_original_benchmark)
->Args({1, ARRAY_SIZE, CHUNK_SIZE})
->Args({1, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({1, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({2, ARRAY_SIZE, CHUNK_SIZE})
->Args({2, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({2, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({4, ARRAY_SIZE, CHUNK_SIZE})
->Args({4, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({4, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({8, ARRAY_SIZE, CHUNK_SIZE})
->Args({8, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({8, ARRAY_SIZE, SCALE * CHUNK_SIZE})
->Args({16, ARRAY_SIZE, CHUNK_SIZE})
->Args({16, SCALE * ARRAY_SIZE, CHUNK_SIZE})
->Args({16, ARRAY_SIZE, SCALE * CHUNK_SIZE});
BENCHMARK_MAIN();
I only have access to Compiler Explorer at the moment which will not execute the complete suite of benchmarks. However, it looks like increasing the chunk size will improve the performance. Obviously, benchmark and optimize for your own system.

CUDA C++ overlapping SERIAL kernel execution and data transfer

So this guide here shows the general way to overlap kernel execution and data transfer.
cudaStream_t streams[nStreams];
for (int i = 0; i < nStreams; ++i) {
cudaStreamCreate(&streams[i]);
int offset = ...;
cudaMemcpyAsync(&d_a[offset], &a[offset], streamBytes, cudaMemcpyHostToDevice, stream[i]);
kernel<<<streamSize/blockSize, blockSize, 0, stream[i]>>>(d_a, offset);
// edit: no deviceToHost copy
}
However, the kernel is serial. So it must process 0->1000, then 1000->2000, ... In short, the order to correctly perform this kernel while overlapping data transfer is:
copy[a->b] must happen before kernel[a->b]
kernel [a->b] must happen before kernel[b->c], where c > a, b
Is it possible to do this without using cudaDeviceSynchronize() ? If not, what's the fastest way to do it?
So each kernel is dependent on (cannot begin until):
The associated H->D copy is complete
The previous kernel execution is complete
Ordinary stream semantics won't handle this case (2 separate dependencies, from 2 separate streams), so we'll need to put an extra interlock in there. We can use a set of events and cudaStreamWaitEvent() to handle it.
For the most general case (no knowledge of the total number of chunks) I would recommend something like this:
$ cat t1783.cu
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__global__ void process(const T * __restrict__ in, const T * __restrict__ prev, T * __restrict__ out, size_t ds){
for (size_t i = threadIdx.x+blockDim.x*blockIdx.x; i < ds; i += gridDim.x*blockDim.x){
out[i] = in[i] + prev[i];
}
}
const int nTPB = 256;
typedef int mt;
const int chunk_size = 1048576;
const int data_size = 10*1048576;
const int ns = 3;
int main(){
mt *din, *dout, *hin, *hout;
cudaStream_t str[ns];
cudaEvent_t evt[ns];
for (int i = 0; i < ns; i++) {
cudaStreamCreate(str+i);
cudaEventCreate( evt+i);}
cudaMalloc(&din, sizeof(mt)*data_size);
cudaMalloc(&dout, sizeof(mt)*data_size);
cudaHostAlloc(&hin, sizeof(mt)*data_size, cudaHostAllocDefault);
cudaHostAlloc(&hout, sizeof(mt)*data_size, cudaHostAllocDefault);
cudaMemset(dout, 0, sizeof(mt)*chunk_size); // for first loop iteration
for (int i = 0; i < data_size; i++) hin[i] = 1;
cudaEventRecord(evt[ns-1], str[ns-1]); // this event will immediately "complete"
unsigned long long dt = dtime_usec(0);
for (int i = 0; i < (data_size/chunk_size); i++){
cudaStreamSynchronize(str[i%ns]); // so we can reuse event safely
cudaMemcpyAsync(din+i*chunk_size, hin+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyHostToDevice, str[i%ns]);
cudaStreamWaitEvent(str[i%ns], evt[(i>0)?(i-1)%ns:ns-1], 0);
process<<<(chunk_size+nTPB-1)/nTPB, nTPB, 0, str[i%ns]>>>(din+i*chunk_size, dout+((i>0)?(i-1)*chunk_size:0), dout+i*chunk_size, chunk_size);
cudaEventRecord(evt[i%ns]);
cudaMemcpyAsync(hout+i*chunk_size, dout+i*chunk_size, sizeof(mt)*chunk_size, cudaMemcpyDeviceToHost, str[i%ns]);
}
cudaDeviceSynchronize();
dt = dtime_usec(dt);
for (int i = 0; i < data_size; i++) if (hout[i] != (i/chunk_size)+1) {std::cout << "error at index: " << i << " was: " << hout[i] << " should be: " << (i/chunk_size)+1 << std::endl; return 0;}
std::cout << "elapsed time: " << dt << " microseconds" << std::endl;
}
$ nvcc -o t1783 t1783.cu
$ ./t1783
elapsed time: 4366 microseconds
Good practice here would be to use a profiler to verify the expected overlap scenarios. However, we can take a shortcut based on the elapsed time measurement.
The loop is transferring a total of 40MB of data to the device, and 40MB back. The elapsed time is 4366us. This gives an average throughput for each direction of (40*1048576)/4366 or 9606 bytes/us which is 9.6GB/s. This is basically saturating the Gen3 link in both directions, therefore my chunk processing is approximately back-to-back, and I have essentially full overlap of D->H with H->D memcopies. The kernel here is trivial so it shows up as just slivers in the profile.
For your case, you indicated you didn't need the D->H copy, but it adds no extra complexity so I chose to show it. The desired behavior still occurs if you comment that line out of the loop (although this affects results checking later).
A possible criticism of this approach is that the cudaStreamSynchronize() call, which is necessary so we don't "overrun" the event interlock, means that the loop will only proceed to ns number of iterations beyond the one that is currently executing on the device. So it is not possible to launch more work asynchronously than that. If you wanted to launch all the work at once and go on and do something else on the CPU, this method will not fully allow that (the CPU will proceed past the loop when the stream processing has reach ns iterations from the last one).
The code is presented to illustrate an approach, conceptually. It is not guaranteed to be defect free, nor do I claim it is suitable for any particular purpose.

How to benchmark CUDA programs?

I was trying to benchmark my first CUDA application that adds two arrays first using the CPU and then using the GPU.
Here is the program.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include<iostream>
#include<chrono>
using namespace std;
using namespace std::chrono;
// add two arrays
void add(int n, float *x, float *y) {
for (int i = 0; i < n; i++) {
y[i] += x[i];
}
}
__global__ void addParallel(int n, float *x, float *y) {
int i = threadIdx.x;
if (i < n)
y[i] += x[i];
}
void printElapseTime(std::chrono::microseconds elapsed_time) {
cout << "completed in " << elapsed_time.count() << " microseconds" << endl;
}
int main() {
// generate two arrays of million float values each
cout << "Generating two lists of a million float values ... ";
int n = 1 << 28;
float *x, *y;
cudaMallocManaged(&x, sizeof(float)*n);
cudaMallocManaged(&y, sizeof(float)*n);
// begin benchmark array generation
auto begin = high_resolution_clock::now();
for (int i = 0; i < n; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// end benchmark array generation
auto end = high_resolution_clock::now();
auto elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition cpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using CPU ... ";
add(n, x, y);
// end benchmark addition cpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
// begin benchmark addition gpu
begin = high_resolution_clock::now();
cout << "Adding both arrays using GPU ... ";
addParallel << <1, 1024 >> > (n, x, y);
cudaDeviceSynchronize();
// end benchmark addition gpu
end = high_resolution_clock::now();
elapsed_time = duration_cast<microseconds>(end - begin);
printElapseTime(elapsed_time);
cudaFree(x);
cudaFree(y);
return 0;
}
Surprisingly though, the program is generating the following output.
Generating two lists of a million float values ... completed in 13343211 microseconds
Adding both arrays using CPU ... completed in 543994 microseconds
Adding both arrays using GPU ... completed in 3030147 microseconds
I wonder where exactly I am going wrong. Why is the GPU computation taking 6 times longer than the one that is running on the CPU.
For your reference, I'm running Windows 10 on Intel i7 8750H and Nvidia GTX 1060.
Note that your unified memory array contains 268 million floats, meaning you're transferring about 1 GB of data to the device when you invoke your kernel. Use a GPU profiler (nvprof, nvvp, or nsight) and you should see a HtoD transfer taking the bulk of your computation time.

How can i make GPU process much faster than CPU process with CUDA 10.0 in Visual Studio 2017?

Smart developer!
I am the beginner of CUDA programming and I have a big problem with my code.
Following code is a sample code from Nvidia and I changed a little bit for showing the GPU process much faster than from CPU process. However, after compiling this code, I got a unexpected result from that CPU process is much faster than GPU process.
This is my laptop gpu info.
This is my cuda code for Visual Studio 2017.
===========================================================================
#define N 10
This is add2 function() from GPU process
`___global____ void add2(int *a, int *b, int *c) {`
// GPU block from grid sector
//int tid = blockIdx.x; // checking the data of index = if you
insert min of N, you will get slow result from CPU. But if you put big number, this show much faster than CPU
// GPU thread
//int tid = threadIdx.x; // Same result as blockIdx.x
// GPU unexpected vector // Same result as above
int tid = threadIdx.x + blockIdx.x*blockDim.x;
if (tid < N) {
c[tid] = a[tid] + b[tid];
}
}
This is add function() from CPU process
`void add(int *a, int *b, int *c) {
int tid = 0;
while (tid < N) {
c[tid] = a[tid] + b[tid];
tid += 1;
}
}
This is Main function()
int main() {
// Values for time duration
LARGE_INTEGER tFreq, tStart, tEnd;
cudaEvent_t start, stop;
float tms, ms;
int a[N], b[N], c[N]; // CPU values
int *dev_a, *dev_b, *dev_c; // GPU values----------------------------------------------
// Creating alloc for GPU--------------------------------------------------------------
cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_c, N * sizeof(int));
// Fill 'a' and 'b' from CPU
for (int i = 0; i < N; i++) {
a[i] = -i;
b[i] = i * i;
}
// Copy values of CPU to GPU values----------------------------------------------------
cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
//////////////////////////////////////
QueryPerformanceFrequency(&tFreq); // Frequency set
QueryPerformanceCounter(&tStart); // Time count Start
// CPU operation
add(a, b, c);
//////////////////////////////////////
QueryPerformanceCounter(&tEnd); // TIme count End
tms = ((tEnd.QuadPart - tStart.QuadPart) / (float)tFreq.QuadPart) * 1000;
//////////////////////////////////////
// show result of CPU
cout << fixed;
cout.precision(10);
cout << "CPU Time=" << tms << endl << endl;
for (int i = 0; i < N; i++) {
printf("CPU calculate = %d + %d = %d\n", a[i], b[i], c[i]);
}
cout << endl;
///////////////////////////////////////
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
// GPU operatinog---------------------------------------------------------------------
//add2 <<<N,1 >>> (dev_a, dev_b, dev_c); // block
//add2 << <1,N >> > (dev_a, dev_b, dev_c); // Thread
add2 << <N/32+1, 32 >> > (dev_a, dev_b, dev_c); // grid
///////////////////////////////////////
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&ms, start, stop);
///////////////////////////////////////
// show result of GPU
cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
cout << fixed;
cout.precision(10);
cout << "GPU Time=" << ms << endl << endl;
for (int i = 0; i < N; i++) {
printf("GPU calculate = %d + %d = %d\n", a[i], b[i], c[i]);
}
//Free GPU values
cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_c);
return 0;
}
This is result of compiling this code.
I want to make GPU process much faster than CPU process.
The GPU is generally actually slower than the CPU for running a single operation. Additionally it takes time to send data to the GPU and read it back again.
The advantage of the GPU is it can execute many operations in parallel.
As you have defined N to be 10 it probably takes longer to upload and download the data than to execute on the CPU. In order to see the advantage of the GPU increase your problem size to something much larger. Ideally you want to execute a minimum of a few operations on each GPU core before you start seeing some benefit. For example with your GPU's 1280 cores you would want to execute something like 4000 operations or more at once to get the benefit of the GPU.

Why is the cuda version 2x slower than the cpu code?

This is a little baffling to me why the cuda code runs about twice as slow as the cpu version. The cpu code is commented out above the main. I am just counting all the primes from 0 to (512 * 512 * 512). The cpu version executed in about 97 seconds whereas the gpu version took 182 seconds. I have an intel i7 running at 4 Ghz and an nvidia GTX 960. Any ideas why?
#include <cuda.h>
#include <iostream>
#include <cstdint>
#include <stdio.h>
#include <ctime>
#include <vector>
#include <cstdlib>
#include <climits>
using namespace std;
__host__ __device__ bool is_prime(uint32_t n)
{
if(n == 2)
return true;
if(n % 2 == 0)
return false;
uint32_t sr = sqrtf(n);
for(uint32_t i = 3; i <= sr; i += 2)
if(n % i == 0)
return false;
return true;
}
__global__ void prime_sum(unsigned int* count)
{
uint32_t n = (blockIdx.y * gridDim.y + blockIdx.x) * blockDim.x + threadIdx.x;
if(is_prime(n))
atomicAdd(count, 1);
}
int main()
{
/* CPU VERSION
time_t start = time(0);
int pcount = 0;
for(uint32_t i = 0; i < (512 * 512 * 512); i++)
{
if(is_prime(i)) pcount++;
}
start = time(0) - start;
std::cout << pcount << "\t" << start << std::endl;
*/
//CUDA VERSION
time_t start = time(0);
unsigned int* sum_d;
cudaMalloc(&sum_d, sizeof(unsigned int));
cudaMemset(sum_d, 0, sizeof(unsigned int));
prime_sum<<< dim3(512, 512), 512 >>>(sum_d);
unsigned int sum = 0;
cudaMemcpy(&sum, sum_d, sizeof(unsigned int), cudaMemcpyDeviceToHost);
start = time(0) - start;
std::cout << sum << "\t" << start << std::endl;
cudaFree(sum_d);
return 0;
}
Here is one idea. The efficiency of the is_prime function comes from being able to exit quickly most of the time because most numbers will be divisible by 2 or lower numbers so when executed in serial most of the time the loop exits fast. However due to warps each group of 32 threads must wait for the worst to finish. Also I am including evens so half the threads will be eliminated by the first if.
First, GPUs generally have good floating point computing power but not integer computing power, and modular (and division) operation is very slow.
Second, global atmoic operations are slow before Kelper architecture, but you have a GTX 960, so I think it's not the problem.
Third, for CPU version, each integer can exit the loop right after it is not a prime. However for GPU, an integer must wait until all of its 32 neighbor threads exit. In your code, the even threads exit right after they enter the kernel but they must wait until the odd threads finish their loop.
BTW, why do you use <<< dim3(512,512), 512>>>? I think 1D work dimension <<<512*512,512>>> is fairly enough.