Intel HD GPU vs Intel CPU Perfomance comparsion - c++

I am a newbie in OpenCL and currently have some questions about its performance.
I have Intel(R) Core(TM) i5-4460 CPU # 3.20GHz + ubuntu + Beignet (Intel open source openCL library see: http://arrayfire.com/opencl-on-intel-hd-iris-graphics-on-linux/ http://www.freedesktop.org/wiki/Software/Beignet/)
I have simple bench
#define __CL_ENABLE_EXCEPTIONS
#include "CL/cl.hpp"
#include <vector>
#include <iostream>
#include <iterator>
#include <algorithm>
using namespace cl;
using namespace std;
void CPUadd(vector<float> & A, vector<float> & B, vector<float> & C)
{
for (int i = 0; i < A.size(); i++)
{
C[i] = A[i] + B[i];
}
}
int main(int argc, char* argv[]) {
Context(CL_DEVICE_TYPE_GPU);
static const unsigned elements = 1000000;
vector<float> data(elements, 6);
Buffer a(begin(data), end(data), true, false);
Buffer b(begin(data), end(data), true, false);
Buffer c(CL_MEM_READ_WRITE, elements * sizeof(float));
Program addProg(R"d(
kernel
void add( global const float * restrict const a,
global const float * restrict const b,
global float * restrict const c) {
unsigned idx = get_global_id(0);
c[idx] = a[idx] + b[idx] + a[idx] * b[idx] + 5;
}
)d", true);
auto add = make_kernel<Buffer, Buffer, Buffer>(addProg, "add");
#if 1
for (int i = 0; i < 4000; i++)
{
add(EnqueueArgs(elements), a, b, c);
}
vector<float> result(elements);
cl::copy(c, begin(result), end(result));
#else
vector<float> result(elements);
for (int i = 0; i < 4000; i++)
{
CPUadd(data, data, result);
}
#endif
//std::copy(begin(result), end(result), ostream_iterator<float>(cout, ", "));
}
According to my measurements Intel HD is 20x faster then single CPU (see bench above). It is seems too small to me, because in case of using 4x cores I will get only 5x speed-up on GPU. Am I wrote correct bench and speed-up seems to be realistic? Unfortunately clinfo in my case do not find CPU as OpenCL device so I can't do direct compare.
UPDATE
Measurements
$ g++ -o main main.cpp -lOpenCL -std=c++11
$ time ./main
real 0m37.316s
user 0m37.280s
sys 0m0.016s
$ g++ -o main main.cpp -lOpenCL -std=c++11
$ time ./main
real 0m2.349s
user 0m0.524s
sys 0m0.624s
Total: 2.349 - 0.524 = 1.825 for GPU
37.316 - 0.524 = 36.724 for CPU
36.724 / 1.825 = 20.12x faster than single CPU => 5x faster than full CPU.

The two implementation you are comparing are not functionally equivalent.
Your CPU implementation needs 30% less memory bandwidth (which may explain the performance). It is accessing only array A and B while the GPU kernel it is using 3 arrays a, b and c.

Related

Why Eigen C++ with MKL doesn't use multi-threading for this large matrix multiplication?

I am doing some calculations that include QR decomposition of a large number(~40000 in each execution) of 4x4 matrix with complex double elements (chosen from a random distribution). I started with directly writing the code using Intel MKL functions. But after some research, it seems like working with Eigen will be much simpler and result in easier to maintain code. (Partly because I find it difficult to work with 2d arrays in intel MKL & care needed for memory management).
Before shifting to Eigen, I started with some performance checks. I took a code(from an earlier similar question on SO) for multiplications of a 10000x100000 matrix with another 100000x1000 matrix (large size chosen to have the effect of parallelization ). I run it on a 36 core node . When I checked the stat, Eigen without Intel MKL directive (but compiled with -O3 -fopenmp) used all the cores and completed the task within ~7 sec.
On the other hand with,
#define EIGEN_USE_MKL_ALL
#define EIGEN_VECTORIZE_SSE4_2
the code takes 28 s & uses an only a single core.
Here is my compilation instruction
g++ -m64 -std=c++17 -fPIC -c -I. -I/apps/IntelParallelStudio/mkl/include -O2 -DNDEBUG -Wall -Wno-unused-variable -O3 -fopenmp -I /home/bart/work/eigen-3.4.0 -o eigen_test.o eigen_test.cc
g++ -m64 -std=c++17 -fPIC -I. -I/apps/IntelParallelStudio/mkl/include -O2 -DNDEBUG -Wall -Wno-unused-variable -O3 -fopenmp -I /home/bart/work/eigen-3.4.0 eigen_test.o -o eigen_test -L/apps/IntelParallelStudio/linux/lib/intel64 -lmkl_intel_lp64 -lmkl_intel_thread -lmkl_rt -lmkl_core -liomp5 -lpthread
The code is here,
//#define EIGEN_USE_MKL_ALL // Determine if use MKL
//#define EIGEN_VECTORIZE_SSE4_2
#include <iostream>
#include <Eigen/Dense>
using namespace Eigen;
int main()
{
int n_a_rows = 10000;
int n_a_cols = 10000;
int n_b_rows = n_a_cols;
int n_b_cols = 1000;
MatrixXi a(n_a_rows, n_a_cols);
for (int i = 0; i < n_a_rows; ++ i)
for (int j = 0; j < n_a_cols; ++ j)
a (i, j) = n_a_cols * i + j;
MatrixXi b (n_b_rows, n_b_cols);
for (int i = 0; i < n_b_rows; ++ i)
for (int j = 0; j < n_b_cols; ++ j)
b (i, j) = n_b_cols * i + j;
MatrixXi d (n_a_rows, n_b_cols);
clock_t begin = clock();
d = a * b;
clock_t end = clock();
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "Time taken : " << elapsed_secs << std::endl;
}
In the previous question related to this topic, the difference in speed was found to be turbo boost (& difference was not such huge). I know for small matrices Eigen may work better than MKL. But I can't understand why Eigen+MKL refuses to use multiple cores even when I pass -liomp5 during compilation.
Thank you in advance .
(CentOS 7 with GCC 7.4.0 eigen 3.4.0)
Please set the following shell variables before executing you program.
export MKL_NUM_THREADS="$(nproc)"
export OMP_NUM_THREADS="$(nproc)"
Also the build command (the first line run with $define EIGEN_USE_MKL_ALL commented out)
. /opt/intel/oneapi/setvars.sh
$CXX -I /usr/local/include/eigen3 eigen.cpp -o eigen_test -lblas
$CXX -I /usr/local/include/eigen3 eigen.cpp -o eigen_test -lmkl_rt
works fine with CXX as clang++, g++ and icpx. Setting the environment as shown above is important. In that case -lmkl_rt is plenty. A little bit of adjustment to the code gives you the net benefit in wall clock:
#define EIGEN_USE_BLAS
#define EIGEN_USE_MKL_ALL
#include <iostream>
#include <chrono>
#include <Eigen/Dense>
using namespace Eigen;
using namespace std::chrono;
int main()
{
int n_a_rows = 10000;
int n_a_cols = 10000;
int n_b_rows = n_a_cols;
int n_b_cols = 1000;
MatrixXd a(n_a_rows, n_a_cols);
for (int i = 0; i < n_a_rows; ++ i)
for (int j = 0; j < n_a_cols; ++ j)
a (i, j) = n_a_cols * i + j;
MatrixXd b (n_b_rows, n_b_cols);
for (int i = 0; i < n_b_rows; ++ i)
for (int j = 0; j < n_b_cols; ++ j)
b (i, j) = n_b_cols * i + j;
MatrixXd d (n_a_rows, n_b_cols);
using wall_clock_t = std::chrono::high_resolution_clock;
auto const start = wall_clock_t::now();
clock_t begin = clock();
d = a * b;
clock_t end = clock();
auto const wall = std::chrono::duration<double>(wall_clock_t::now() - start);
double elapsed_secs = double(end - begin) / CLOCKS_PER_SEC;
std::cout << "CPU time : " << elapsed_secs << std::endl;
std::cout << "Wall time : " << wall.count() << std::endl;
std::cout << "Speed up : " << elapsed_secs/wall.count() << std::endl;
}
The runtime on my 8 core i7-4790K #4GHz shows perfect parallelisation:
With on board blas:
CPU time : 12.5134
Wall time : 1.69036
Speed up : 7.40277
With MKL:
> ./eigen_test
CPU time : 11.4391
Wall time : 1.52542
Speed up : 7.49898

Compute reduction sum of a device array with thrust

I know we can compute sum of a CPU(host) array with thrust like this.
int data[6] = {1, 0, 2, 2, 1, 3};
int result = thrust::reduce(data, data + 6, 0);
Can we find sum of GPU array with thrust without cudaMemcpy to CPU array?
Suppose I have a device array created using cudaMalloc like this,
cudaMalloc(&gpuspeed, n* sizeof(int));
and did modifications to gpuspeed with some kernels. Now can I find sum of that with thrust? If we can, what changes I have to make?
Yes, you can do that with thrust.
You can pass device pointers to thrust, and thrust will do the right thing if you specify explicitly the device execution path, using thrust execution policies.
Alternatively, you can use thrust::device_ptr to refer to your data, and thrust will also do the right thing, even without explicitly specifying the device execution path.
This answer covers both approaches, albeit with inclusive_scan.
Here is an example:
$ cat t137.cu
#include <thrust/reduce.h>
#include <thrust/device_ptr.h>
#include <thrust/execution_policy.h>
#include <iostream>
__global__ void k(int *d, int n){
int idx = threadIdx.x+blockDim.x*blockIdx.x;
if (idx < n)
d[idx] = idx;
}
const int ds = 10;
const int nTPB = 256;
int main(){
int *d, r1, r2;
cudaMalloc(&d, ds*sizeof(d[0]));
k<<<(ds+nTPB-1)/nTPB,nTPB>>>(d, ds);
thrust::device_ptr<int> tdp = thrust::device_pointer_cast(d);
r1 = thrust::reduce(tdp, tdp+ds);
r2 = thrust::reduce(thrust::device, d, d+ds);
std::cout << "r1: " << r1 << " r2: " << r2 << std::endl;
}
$ nvcc -std=c++14 -o t137 t137.cu
$ ./t137
r1: 45 r2: 45
$

FFT calculation using GPU: unable to compile program with recursion

I am trying to learn programming a GPU. My system environment is as follows:
OS: windows 10 pro
GPU: NVIDIA GTX 1080 Ti (display does not run on this; there is another gpu for that)
CUDA toolkit: v9.1
I wrote this simple program using CUDA to calculate FFT from scratch on a GPU. The algorithm follows the wikipedia example of Cooley-Tukey algorithm. The code uses recursive functions to calculate the FFT of an array of complex values.
#include <iostream>
#include <string>
#include "conio.h"
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <thrust\complex.h>
#include <cstdio>
#include <fstream>
using namespace std;
#define winSize 2048
#define winShift 1024
#define M_PI 3.14159265358979323846
__device__ void separate(thrust::complex<double>* a, int n)
{
thrust::complex<double>* b = new thrust::complex<double>[n / 2]; // get temp heap storage
for (int i = 0; i<n / 2; i++) // copy all odd elements to heap storage
b[i] = a[i * 2 + 1];
for (int i = 0; i<n / 2; i++) // copy all even elements to lower-half of a[]
a[i] = a[i * 2];
for (int i = 0; i<n / 2; i++) // copy all odd (from heap) to upper-half of a[]
a[i + n / 2] = b[i];
cudaFree(b); // delete heap storage
}
// N must be a power-of-2, or bad things will happen.
// Currently no check for this condition.
//
// N input samples in X[] are FFT'd and results left in X[].
// Because of Nyquist theorem, N samples means
// only first N/2 FFT results in X[] are the answer.
// (upper half of X[] is a reflection with no new information).
__global__ void fft2(thrust::complex<double>* X, int N)
{
if (N < 2)
{
// bottom of recursion.
// Do nothing here, because already X[0] = x[0]
}
else
{
separate(X, N); // all evens to lower half, all odds to upper half
fft2 << <1, 1 >> >(X, N / 2); // recurse even items
fft2 << <1, 1 >> >(X + N / 2, N / 2); // recurse odd items
// combine results of two half recursions
for (int k = 0; k<N / 2; k++)
{
thrust::complex<double> e = X[k]; // even
thrust::complex<double> o = X[k + N / 2]; // odd
// w is the "twiddle-factor"
thrust::complex<double> w = exp(thrust::complex<double>(0, -2.*M_PI*k / N));
X[k] = e + w * o;
X[k + N / 2] = e - w * o;
}
}
}
int main()
{
const int nSamples = 64;
double nSeconds = 0.02; // total time for sampling
double sampleRate = nSamples / nSeconds; // n Hz = n / second
double freqResolution = sampleRate / nSamples; // freq step in FFT result
thrust::complex<double> x[nSamples]; // storage for sample data
thrust::complex<double> X[nSamples]; // storage for FFT answer
thrust::complex<double> *d_arr1;
const int nFreqs = 5;
double freq[nFreqs] = { 2,4,8,32,72 }; // known freqs for testing
size_t n_byte = nSamples * sizeof(complex<double>);
// generate samples for testing
for (int i = 0; i<nSamples; i++)
{
x[i] = thrust::complex<double>(0., 0.);
// sum several known sinusoids into x[]
for (int j = 0; j < nFreqs; j++)
x[i] += sin(2 * M_PI*freq[j] * i); // / nSamples);
X[i] = x[i]; // copy into X[] for FFT work & result
}
// compute fft for this data
cudaMalloc((void**)&d_arr1, n_byte);
cudaMemcpy(d_arr1, X, n_byte, cudaMemcpyHostToDevice);
//launchKernel << <1, 1 >> >(d_arr1, nSamples);
fft2 << <1, 1 >> > (d_arr1, nSamples);
cudaMemcpy(X, d_arr1, n_byte, cudaMemcpyDeviceToHost);
printf(" n\tx[]\tX[]\tf\n"); // header line
// loop to print values
for (int i = 0; i<nSamples; i++)
{
printf("% 3d\t%+.3f\t%+.3f\t%g\n",
i, x[i].real(), abs(X[i]), i*freqResolution);
}
ofstream myfile("example_cuda.txt");
printf("I am trying to write to file\n");
if (myfile.is_open())
{
for (int count = 0; count < nSamples; count++)
{
myfile << x[count].real() << "," << abs(X[count]) << "," << count*freqResolution << "\n";
}
myfile.close();
}
}
I used the following command to compile the code using VS2015 command prompt:
nvcc -o fft_Wiki2.exe -c -arch=compute_35 -rdc=true
--expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
The compilation itself doesn't show any errors or warnings, but the executable does not run. When I try the
fft_Wiki2.exe
it simply says the version of this executable is incompatible with the 64 bit Windows version and so cannot execute. But I am using the --machine 64 option to force the executable version.
How do I get this program to execute ?
How do I get this program to execute ?
It isn't a program you are trying to run, it is an object file.
In your compilation command you pass -c:
nvcc -o fft_Wiki2.exe -c -arch=compute_35 -rdc=true --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
which means only compilation and no linking. What you would need to do is something like this:
nvcc -o fft_Wiki2.obj -c -arch=compute_35 -rdc=true --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.cu
nvcc -o fft_Wiki2.exe -arch=compute_35 --expt-relaxed-constexpr --machine 64 -Xcompiler "/wd4819" fftWiki_2.obj
[Note I don't have access to a Windows development platform to check the accuracy of the commands]
The first command compiles and emits an object file. The second performs both host and device code linking and emits an executable which you should be able to run

Thrust Complex Inner Product Run on GPU Slower Than STL Implementation on CPU

I've got the following two implementations of computing a complex inner product, one using STL libraries running on the CPU and one using Thrust running on the GPU:
CPU Implementation
#include <vector>
#include <numeric>
#include <complex>
int main(int argc, char **argv)
{
int vec_size = atoi(argv[1]);
std::vector< std::complex<float> > host_x( vec_size );
std::generate(host_x.begin(), host_x.end(), std::rand);
std::vector< std::complex<float> > host_y( vec_size );
std::generate(host_y.begin(), host_y.end(), std::rand);
std::complex<float> z = std::inner_product(host_x.begin(), host_x.end(), host_y.begin(), std::complex<float>(0.0f,0.0f) );
return 0;
}
GPU Implementation
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/complex.h>
int main(int argc, char **argv)
{
int vec_size = atoi(argv[1]);
thrust::host_vector< thrust::complex<float> > host_x( vec_size );
thrust::generate(host_x.begin(), host_x.end(), rand);
thrust::host_vector< thrust::complex<float> > host_y( vec_size );
thrust::generate(host_y.begin(), host_y.end(), rand);
thrust::device_vector< thrust::complex<float> > device_x = host_x;
thrust::device_vector< thrust::complex<float> > device_y = host_y;
thrust::complex<float> z = thrust::inner_product(device_x.begin(), device_x.end(), device_y.begin(), thrust::complex<float>(0.0f,0.0f) );
return 0;
}
I'm compiling the CPU implementation using g++ and the GPU implementation using mvcc. Both have -O3 optimizations on. I run both implementations with 3,000,000 elements in the vector and get the following timing results:
CPU:
real 0m0.159s
user 0m0.100s
sys 0m0.048s
GPU:
real 0m0.284s
user 0m0.190s
sys 0m0.083s
I'm using the following pieces of software:
$ gcc -v
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.9.sdk/usr/include/c++/4.2.1
Apple LLVM version 5.1 (clang-503.0.40) (based on LLVM 3.4svn)
Target: x86_64-apple-darwin13.3.0
Thread model: posix
$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2013 NVIDIA Corporation
Built on Thu_Sep__5_10:17:14_PDT_2013
Cuda compilation tools, release 5.5, V5.5.0
Along with the latest version of Thrust from the GitHub repo.
My CPU is a 2.4 GHz Intel Core 2 Duo and my GPU is a NVIDIA GeForce 320M 256 MB.
Question:
I'm new to the use of Thrust, but shouldn't my GPU implementation be significantly faster than my CPU implementation? I realize that there are memory transaction costs with GPUs, but I guess I'm trying to figure out if I'm using Thrust correctly to execute the inner product on the GPU since the timing results are unexpectedly reversed in my opinion.
EDIT:
Per everyone's suggestions I made the number of iterations configurable and changed the granularity of the timing as follows:
#include <stdio.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/complex.h>
#include <thrust/execution_policy.h>
int main(int argc, char **argv)
{
int vec_size = atoi(argv[1]);
int iterations = atoi(argv[2]);
float milliseconds = 0;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
thrust::host_vector< thrust::complex<float> > host_x( vec_size );
thrust::generate(host_x.begin(), host_x.end(), rand);
thrust::host_vector< thrust::complex<float> > host_y( vec_size );
thrust::generate(host_y.begin(), host_y.end(), rand);
printf("vector size = %lu bytes\n", vec_size * sizeof(thrust::complex<float>));
cudaEventRecord(start);
thrust::device_vector< thrust::complex<float> > device_x = host_x;
thrust::device_vector< thrust::complex<float> > device_y = host_y;
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
printf("copy (device)\t\t%f ms\n", milliseconds);
cudaEventRecord(start);
for(int i = 0; i < iterations; ++i)
{
thrust::inner_product(thrust::cuda::par, device_x.begin(), device_x.end(), device_y.begin(), thrust::complex<float>(0.0f,0.0f) );
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
printf("inner_product (device)\t%f ms\n", milliseconds/iterations);
cudaEventRecord(start);
for(int i = 0; i < iterations; ++i)
{
thrust::inner_product(thrust::host, host_x.begin(), host_x.end(), host_y.begin(), thrust::complex<float>(0.0f,0.0f) );
}
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
printf("inner_product (host)\t%f ms\n", milliseconds/iterations);
return 0;
}
On a Tegra K1 I got the following:
$ nvcc complex_inner_product.cu -O3 -arch=sm_32 -o cip
$ ./cip 3100000 1000
vector size = 24800000 bytes
copy (device) 45.741653 ms
inner_product (device) 10.595121 ms
inner_product (host) 1.807912 ms
On an Intel Core 2 Duo 2.4 GHz and GeForce 320M I got the following results:
$ nvcc complex_inner_product.cu -O3 -arch=sm_12 -o cip
$ ./cip 3100000 1000
vector size = 24800000 bytes
copy (device) 227.765213 ms
inner_product (device) 42.180416 ms
inner_product (host) 0.000018 ms
On an Intel Core i5 3.3 GHz and GeForce GT 755M:
$ nvcc complex_inner_product.cu -O3 -arch=sm_30 -o cip
$ ./cip 3100000 1000
vector size = 24800000 bytes
copy (device) 22.930016 ms
inner_product (device) 6.249663 ms
inner_product (host) 0.000003 ms
So no matter what compute capability or hardware I use, the host processor is at least 10x faster than the GPU. Any ideas?
There are a number of things to consider with your benchmarking approach. I'm not arguing whether your results are valid; that's a matter of opinion, based on what you consider important. But some things to consider are:
CUDA startup time is included in your measurement.
Data transfer times are included in your measurement.
You are doing only one measurement pass.
You are using a very low end GPU.
Your choice of function to test is not very compute-intensive (a few flops per float quantity).
If you just time the computation portion, I expect you'll find the GPU looking a little better. Here's a fully worked example:
$ cat t489.cu
#include <vector>
#include <numeric>
#include <complex>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/complex.h>
#include <time.h>
#include <sys/time.h>
#include <iostream>
int main(int argc, char **argv)
{
timeval tv1, tv2;
int vec_size = atoi(argv[1]);
std::vector< std::complex<float> > cpu_x( vec_size );
std::generate(cpu_x.begin(), cpu_x.end(), std::rand);
std::vector< std::complex<float> > cpu_y( vec_size );
std::generate(cpu_y.begin(), cpu_y.end(), std::rand);
gettimeofday(&tv1, 0);
std::complex<float> cpu_z = std::inner_product(cpu_x.begin(), cpu_x.end(), cpu_y.begin(), std::complex<float>(0.0f,0.0f) );
gettimeofday(&tv2, 0);
std::cout <<"CPU result: " << cpu_z.real() << "," << cpu_z.imag() << std::endl;
unsigned t2 = (tv2.tv_sec*1000000) + tv2.tv_usec;
unsigned t1 = (tv1.tv_sec*1000000) + tv1.tv_usec;
float et = (t2-t1)/(float) 1000;
std::cout << "CPU elapsed time: " << et << "ms" << std::endl;
thrust::host_vector< thrust::complex<float> > host_x( vec_size );
thrust::generate(host_x.begin(), host_x.end(), rand);
thrust::host_vector< thrust::complex<float> > host_y( vec_size );
thrust::generate(host_y.begin(), host_y.end(), rand);
thrust::device_vector< thrust::complex<float> > device_x = host_x;
thrust::device_vector< thrust::complex<float> > device_y = host_y;
gettimeofday(&tv1, 0);
thrust::complex<float> z = thrust::inner_product(device_x.begin(), device_x.end(), device_y.begin(), thrust::complex<float>(0.0f,0.0f) );
gettimeofday(&tv2, 0);
std::cout <<"GPU result: " << z.real() << "," << z.imag() << std::endl;
t2 = (tv2.tv_sec*1000000) + tv2.tv_usec;
t1 = (tv1.tv_sec*1000000) + tv1.tv_usec;
et = (t2-t1)/(float) 1000;
std::cout << "GPU elapsed time: " << et << "ms" << std::endl;
return 0;
}
$ nvcc -arch=sm_20 -O3 -o t489 t489.cu
$ ./t489 3000000
CPU result: 3.45238e+24,0
CPU elapsed time: 19.294ms
GPU result: 3.46041e+24,0
GPU elapsed time: 3.426ms
$
This was run with a Quadro5000 GPU (considerably more powerful than your GT320M), RHEL 5.5, CUDA 6.5RC, Thrust 1.8 (master branch)
So which numbers matter? That's up to you. If you were just intending to do this single inner product on the GPU, and no other computations or any activity on the GPU, it would be senseless to use the GPU. But in the context of a larger problem, where inner product is just one of the pieces, the GPU may well be faster than the CPU.
(The results don't match because the program is generating differing starting values in each case.)

Faster form for hamming distance in c++ (potentially taking advantage of standard library)?

I have two int vectors like a[100], b[100].
The simple way to calculate their hamming distance is:
std::vector<int> a(100);
std::vector<int> b(100);
double dist = 0;
for(int i = 0; i < 100; i++){
if(a[i] != b[i])
dist++;
}
dist /= a.size();
I would like to ask that is there a faster way to do this calculation in C++ or how to use STL to do the same job?
You asked for a faster way. This is a embarrassingly parallel problem, so, with C++ you can take advantage of that in two ways: thread parallelism, and vectorization through optimization.
//The following flags allow cpu specific vectorization optimizations on *my cpu*
//clang++ -march=corei7-avx hd.cpp -o hd -Ofast -pthread -std=c++1y
//g++ -march=corei7-avx hd.cpp -o hd -Ofast -pthread -std=c++1y
#include <vector>
#include <thread>
#include <future>
#include <numeric>
template<class T, class I1, class I2>
T hamming_distance(size_t size, I1 b1, I2 b2) {
return std::inner_product(b1, b1 + size, b2, T{},
std::plus<T>(), std::not_equal_to<T>());
}
template<class T, class I1, class I2>
T parallel_hamming_distance(size_t threads, size_t size, I1 b1, I2 b2) {
if(size < 1000)
return hamming_distance<T, I1, I2>(size, b1, b2);
if(threads > size)
threads = size;
const size_t whole_part = size / threads;
const size_t remainder = size - threads * whole_part;
std::vector<std::future<T>> bag;
bag.reserve(threads + (remainder > 0 ? 1 : 0));
for(size_t i = 0; i < threads; ++i)
bag.emplace_back(std::async(std::launch::async,
hamming_distance<T, I1, I2>,
whole_part,
b1 + i * whole_part,
b2 + i * whole_part));
if(remainder > 0)
bag.emplace_back(std::async(std::launch::async,
hamming_distance<T, I1, I2>,
remainder,
b1 + threads * whole_part,
b2 + threads * whole_part));
T hamming_distance = 0;
for(auto &f : bag) hamming_distance += f.get();
return hamming_distance;
}
#include <ratio>
#include <random>
#include <chrono>
#include <iostream>
#include <cinttypes>
int main() {
using namespace std;
using namespace chrono;
random_device rd;
mt19937 gen(rd());
uniform_int_distribution<> random_0_9(0, 9);
const auto size = 100 * mega::num;
vector<int32_t> v1(size);
vector<int32_t> v2(size);
for(auto &x : v1) x = random_0_9(gen);
for(auto &x : v2) x = random_0_9(gen);
cout << "naive hamming distance: ";
const auto naive_start = high_resolution_clock::now();
cout << hamming_distance<int32_t>(v1.size(), begin(v1), begin(v2)) << endl;
const auto naive_elapsed = high_resolution_clock::now() - naive_start;
const auto n = thread::hardware_concurrency();
cout << "parallel hamming distance: ";
const auto parallel_start = high_resolution_clock::now();
cout << parallel_hamming_distance<int32_t>(
n,
v1.size(),
begin(v1),
begin(v2)
)
<< endl;
const auto parallel_elapsed = high_resolution_clock::now() - parallel_start;
auto count_microseconds =
[](const high_resolution_clock::duration &elapsed) {
return duration_cast<microseconds>(elapsed).count();
};
cout << "naive delay: " << count_microseconds(naive_elapsed) << endl;
cout << "parallel delay: " << count_microseconds(parallel_elapsed) << endl;
}
notice that I'm not taking the division over the vector size
Results for my machine (which shows it didn't get much for a machine which only 2 physical cores...):
$ clang++ -march=corei7-avx hd.cpp -o hd -Ofast -pthread -std=c++1y -stdlib=libc++ -lcxxrt -ldl
$ ./hd
naive hamming distance: 89995190
parallel hamming distance: 89995190
naive delay: 52758
parallel delay: 47227
$ clang++ hd.cpp -o hd -O3 -pthread -std=c++1y -stdlib=libc++ -lcxxrt -ldl
$ ./hd
naive hamming distance: 90001042
parallel hamming distance: 90001042
naive delay: 53851
parallel delay: 46887
$ g++ -march=corei7-avx hd.cpp -o hd -Ofast -pthread -std=c++1y -Wl,--no-as-needed
$ ./hd
naive hamming distance: 90001825
parallel hamming distance: 90001825
naive delay: 55229
parallel delay: 49355
$ g++ hd.cpp -o hd -O3 -pthread -std=c++1y -Wl,--no-as-needed
$ ./hd
naive hamming distance: 89996171
parallel hamming distance: 89996171
naive delay: 54189
parallel delay: 44928
Also I see no effect from auto vectorization, may have to check the assembly...
For a sample about vectorization and compiler options check this blog post of mine.
There is a very simple way to optimize this.
int disti = 0;
for(int i = 0; i < n; i++) disti += (a[i] != b[i]);
double dist = 1.0*disti/a.size();
This skips the branch and uses the virtue that a conditional test returns 1 or 0. Additionally, it is auto-vectorized in GCC (-ftree-vectorizer-verbose=1 to check) while the version in the question is not.
Edit:
I went ahead and tested this out with the function in the question which I called hamming_distance the simple fix I suggested which I call hamming_distance_fix and a version which uses the fix as well as OpenMP which I call hamming_distance_fix_omp. Here are the times
hamming_distance 1.71 seconds
hamming_distance_fix 0.38 seconds //SIMD
hamming_distance_fix_omp 0.12 seconds //SIMD + MIMD
Here is the code. I did not use much syntactic candy but it should be very easy to convert this to use STL and so forth...You can see the results here http://coliru.stacked-crooked.com/a/31293bc88cff4794
//g++-4.8 -std=c++11 -O3 -fopenmp -msse2 -Wall -pedantic -pthread main.cpp && ./a.out
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
double hamming_distance(int* a, int*b, int n) {
double dist = 0;
for(int i=0; i<n; i++) {
if (a[i] != b[i]) dist++;
}
return dist/n;
}
double hamming_distance_fix(int* a, int* b, int n) {
int disti = 0;
for(int i=0; i<n; i++) {
disti += (a[i] != b[i]);
}
return 1.0*disti/n;
}
double hamming_distance_fix_omp(int* a, int* b, int n) {
int disti = 0;
#pragma omp parallel for reduction(+:disti)
for(int i=0; i<n; i++) {
disti += (a[i] != b[i]);
}
return 1.0*disti/n;
}
int main() {
const int n = 1<<16;
const int repeat = 10000;
int *a = new int[n];
int *b = new int[n];
for(int i=0; i<n; i++)
{
a[i] = rand()%10;
b[i] = rand()%10;
}
double dtime, dist;
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) dist = hamming_distance(a,b,n);
dtime = omp_get_wtime() - dtime;
printf("dist %f, time (s) %f\n", dist, dtime);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) dist = hamming_distance_fix(a,b,n);
dtime = omp_get_wtime() - dtime;
printf("dist %f, time (s) %f\n", dist, dtime);
dtime = omp_get_wtime();
for(int i=0; i<repeat; i++) dist = hamming_distance_fix_omp(a,b,n);
dtime = omp_get_wtime() - dtime;
printf("dist %f, time (s) %f\n", dist, dtime);
}
As an observation, working with double is very slow, even for increment. so you should use a int inside the for (incrementing), and then use a double for the division.
As a speed up, one way to test I could think of is to use SSE instructions:
Pseudocode:
distance = 0
SSE register e1
SSE register e2
for each 4 elements in vectors
load 4 members from a in e1
load 4 members from b in e2
if e1 == e2
continue
else
check each 4 members individually (using e1 and e2)
dist /= 4
In a real (not-pseudocode) program, this can be twitched so that the compiler can use cmov instructions instead of branches.
The main advantage here is that we have 4 times less reads from memory.
A disadvantage is that we have an extra check for each 4 checks we had previously.
Depending on how this gets implemented in assembly via cmoves or branches, this might be even faster for vectors that have many adjacent positions with the same value in the two vectors.
I really can't tell how this will perform comparing with the standard solution, but at the very least is worth testing.