Numpy matrix inverse faster than CLAPACK with Accelerate framework - c++

I tried to do some experiment about matrix inverse to understand the performance of matrix algorithm on different programming platform. For a 4000x4000 positive definite real matrix, the
import time
tic = time.perf_counter()
sts4098_inverse = np.linalg.inv(sts4098)
toc = time.perf_counter()
print(f"Numpy inversion in {toc - tic:0.4f} seconds")
Numpy inversion in 0.9607 seconds
uses around 1 second. I tried to use the CLAPACK interface contained in Apple's Accelerate framework like that
cout << "Using LAPACK to perform Cholesky Inversion... " << endl;
char* UPLO = "U";
int N, LDA, ret;
N = n;
LDA = n;
int len_chol = n * (n + 1) / 2;
double* ap = (double*)malloc(len_chol * sizeof(double));
system_clock::time_point t1, t2;
t1 = system_clock::now();
ret = dpotrf_(UPLO, &N, a, &LDA, &ret);
int ptr = 0;
// Column major expression in LAPACK!
for (int i = 0; i < n; i++) {
for (int j = 0; j <= i; j++) {
ap[ptr] = a[n * i + j];
ptr++;
}
}
ret = dpptri_(UPLO, &N, ap, &ret);
t2 = system_clock::now();
duration<double, milli> ms = t2 - t1;
double time = ms.count() / 1000;
cout << fixed;
cout << "LAPACK Cholesky Inversion takes " << time << "s" << endl;
return ap;
LAPACK Cholesky Inversion takes 5.539074s
And the algorithm uses more than 5s according to chrono::system_clock. I think this result is impossible since I used the compiling option "-Ofast", "-mcpu=apple-m1". My laptop is an M1 Max model.
Is there anything that I missed? How is it possible that numpy is that much faster?

Related

Slow std::vector vs [] in C++ - Why?

I am a bit rusty with C++ - having used it 20 years ago. I am trying to understand why std::vector is so much slower than native arrays in the following code. Can anyone explain it to me? I would much prefer using the standard libraries but not at the cost of this performance penalty:
Vector:
const int grid_e_rows = 50;
const int grid_e_cols = 50;
int H(std::vector<std::vector<int>> &sigma) {
int h = 0;
for (int r = 0; r < grid_e_rows; ++r) {
int r2 = (r + 1) % grid_e_rows;
for (int c = 0; c < grid_e_cols; ++c) {
int c2 = (c + 1) % grid_e_cols;
h += 1 * sigma[r][c] * sigma[r][c2] + 1 * sigma[r][c] * sigma[r2][c];
}
}
return -h;
}
int main() {
auto start = std::chrono::steady_clock::now();
std::vector<std::vector<int>> sigma_a(grid_e_rows, std::vector<int>(grid_e_cols));
for (int i=0;i<600000;i++)
H(sigma_a);
auto end = std::chrono::steady_clock::now();
std::cout << "Calculation completed in " << std::chrono::duration_cast<std::chrono::seconds>(end - start).count()
<< " seconds";
return 0;
}
Output is:
Calculation completed in 23 seconds
Array:
const int grid_e_rows = 50;
const int grid_e_cols = 50;
typedef int (*Sigma)[grid_e_rows][grid_e_cols];
int H(Sigma sigma) {
int h = 0;
for (int r = 0; r < grid_e_rows; ++r) {
int r2 = (r + 1) % grid_e_rows;
for (int c = 0; c < grid_e_cols; ++c) {
int c2 = (c + 1) % grid_e_cols;
h += 1 * (*sigma)[r][c] * (*sigma)[r][c2] + 1 * (*sigma)[r][c] * (*sigma)[r2][c];
}
}
return -h;
}
int main() {
auto start = std::chrono::steady_clock::now();
int sigma_a[grid_e_rows][grid_e_cols];
for (int i=0;i<600000;i++)
H(&sigma_a);
auto end = std::chrono::steady_clock::now();
std::cout << "Calculation completed in " << std::chrono::duration_cast<std::chrono::seconds>(end - start).count()
<< " seconds";
return 0;
}
Output is:
Calculation completed in 6 seconds
Any help would be appreciated.
First, you're timing the initialization. For the array case, there is none (the array is completely uninitialized). In the vector case, the vector is initialized to zero and then copied into each row.
But the primary reason is cache locality. The array case is a single block of 50*50 integers which are all continuous in memory, and they can trivially fit in L1D cache. In the vector case, each row is allocated dynamically which means their addresses are almost certainly not contiguous and are instead spread all over the program's address space. Accessing one does not pull the adjacent rows into the cache.
Also, because the rows are relatively small, cache space is wasted on adjacent unrelated data, meaning even after you've touched everything to pull it into memory it may not fit in L1 anymore. And lastly, the access pattern is a lot less linear, and it may be beyond the capability of a hardware prefetcher to predict.
You are not compiling with optimizations.
Compare:
With vector of vector
With array
To give you a small taste of what the optimizer might be doing for you, consider the following modification to your H() function for the vector of vector case.
int H(std::vector<std::vector<int>> &arg) {
int h = 0;
auto sigma = arg.data();
for (int r = 0; r < grid_e_rows; ++r) {
int r2 = (r + 1) % grid_e_rows;
auto sr = sigma[r].data();
auto sr2 = sigma[r2].data();
for (int c = 0; c < grid_e_cols; ++c) {
int c2 = (c + 1) % grid_e_cols;
h += 1 * sr[c] * sr[c2] + 1 * sr[c] * sr2[c];
}
}
return -h;
}
You will find that without optimizations, this version will run closer to the performance of your array version.

Compiling c++ OpenACC parallel CPU code using GCC (G++)

When trying to compile OpenACC code with GCC-9.3.0 (g++) configured with --enable-languages=c,c++,lto --disable-multilib the following code does not use multiple cores, whereas if the same code is compiled with the pgc++ compiler it does use multiple cores.
g++ compilation: g++ -lgomp -Ofast -o jsolve -fopenacc jsolvec.cpp
pgc++ compilation: pgc++ -o jsolvec.exe jsolvec.cpp -fast -Minfo=opt -ta=multicore
Code from OpenACC Tutorial1/solver https://github.com/OpenACCuserGroup/openacc-users-group.git:
// Jacobi iterative method for solving a system of linear equations
// This is guaranteed to converge if the matrix is diagonally dominant,
// so we artificially force the matrix to be diagonally dominant.
// See https://en.wikipedia.org/wiki/Jacobi_method
//
// We solve for vector x in Ax = b
// Rewrite the matrix A as a
// lower triangular (L),
// upper triangular (U),
// and diagonal matrix (D).
//
// Ax = (L + D + U)x = b
//
// rearrange to get: Dx = b - (L+U)x --> x = (b-(L+U)x)/D
//
// we can do this iteratively: x_new = (b-(L+U)x_old)/D
// build with TYPE=double (default) or TYPE=float
// build with TOLERANCE=0.001 (default) or TOLERANCE= any other value
// three arguments:
// vector size
// maximum iteration count
// frequency of printing the residual (every n-th iteration)
#include <cmath>
#include <omp.h>
#include <cstdlib>
#include <iostream>
#include <iomanip>
using std::cout;
#ifndef TYPE
#define TYPE double
#endif
#define TOLERANCE 0.001
void
init_simple_diag_dom(int nsize, TYPE* A)
{
int i, j;
// In a diagonally-dominant matrix, the diagonal element
// is greater than the sum of the other elements in the row.
// Scale the matrix so the sum of the row elements is close to one.
for (i = 0; i < nsize; ++i) {
TYPE sum;
sum = (TYPE)0;
for (j = 0; j < nsize; ++j) {
TYPE x;
x = (rand() % 23) / (TYPE)1000;
A[i*nsize + j] = x;
sum += x;
}
// Fill diagonal element with the sum
A[i*nsize + i] += sum;
// scale the row so the final matrix is almost an identity matrix
for (j = 0; j < nsize; j++)
A[i*nsize + j] /= sum;
}
} // init_simple_diag_dom
int
main(int argc, char **argv)
{
int nsize; // A[nsize][nsize]
int i, j, iters, max_iters, riter;
double start_time, elapsed_time;
TYPE residual, err, chksum;
TYPE *A, *b, *x1, *x2, *xnew, *xold, *xtmp;
// set matrix dimensions and allocate memory for matrices
nsize = 0;
if (argc > 1)
nsize = atoi(argv[1]);
if (nsize <= 0)
nsize = 1000;
max_iters = 0;
if (argc > 2)
max_iters = atoi(argv[2]);
if (max_iters <= 0)
max_iters = 5000;
riter = 0;
if (argc > 3)
riter = atoi(argv[3]);
if (riter <= 0)
riter = 200;
cout << "nsize = " << nsize << ", max_iters = " << max_iters << "\n";
A = new TYPE[nsize*nsize];
b = new TYPE[nsize];
x1 = new TYPE[nsize];
x2 = new TYPE[nsize];
// generate a diagonally dominant matrix
init_simple_diag_dom(nsize, A);
// zero the x vectors, random values to the b vector
for (i = 0; i < nsize; i++) {
x1[i] = (TYPE)0.0;
x2[i] = (TYPE)0.0;
b[i] = (TYPE)(rand() % 51) / 100.0;
}
start_time = omp_get_wtime();
//
// jacobi iterative solver
//
residual = TOLERANCE + 1.0;
iters = 0;
xnew = x1; // swap these pointers in each iteration
xold = x2;
while ((residual > TOLERANCE) && (iters < max_iters)) {
++iters;
// swap input and output vectors
xtmp = xnew;
xnew = xold;
xold = xtmp;
#pragma acc parallel loop
for (i = 0; i < nsize; ++i) {
TYPE rsum = (TYPE)0;
#pragma acc loop reduction(+:rsum)
for (j = 0; j < nsize; ++j) {
if (i != j) rsum += A[i*nsize + j] * xold[j];
}
xnew[i] = (b[i] - rsum) / A[i*nsize + i];
}
//
// test convergence, sqrt(sum((xnew-xold)**2))
//
residual = 0.0;
#pragma acc parallel loop reduction(+:residual)
for (i = 0; i < nsize; i++) {
TYPE dif;
dif = xnew[i] - xold[i];
residual += dif * dif;
}
residual = sqrt((double)residual);
if (iters % riter == 0 ) cout << "Iteration " << iters << ", residual is " << residual << "\n";
}
elapsed_time = omp_get_wtime() - start_time;
cout << "\nConverged after " << iters << " iterations and " << elapsed_time << " seconds, residual is " << residual << "\n";
//
// test answer by multiplying my computed value of x by
// the input A matrix and comparing the result with the
// input b vector.
//
err = (TYPE)0.0;
chksum = (TYPE)0.0;
for (i = 0; i < nsize; i++) {
TYPE tmp;
xold[i] = (TYPE)0.0;
for (j = 0; j < nsize; j++)
xold[i] += A[i*nsize + j] * xnew[j];
tmp = xold[i] - b[i];
chksum += xnew[i];
err += tmp * tmp;
}
err = sqrt((double)err);
cout << "Solution error is " << err << "\n";
if (err > TOLERANCE)
cout << "****** Final Solution Out of Tolerance ******\n" << err << " > " << TOLERANCE << "\n";
delete A;
delete b;
delete x1;
delete x2;
return 0;
}
It's not yet supported in GCC to use OpenACC to schedule parallel loops onto multicore CPUs. Using OpenMP works for that, of course, and you can have code with mixed OpenACC (for GPU offloading, as already present in your code) and OpenMP directives (for CPU parallelization, not yet present in your code), so that the respective mechanism will be used depending on whether compiling with -fopenacc vs. -fopenmp.
Like PGI are doing, it certainly can be supported in GCC; we'll certainly be able to implement that, but it has not yet been scheduled, has not yet been funded for GCC.

10 dimensional Monte Carlo integration with openmp

I am trying to learn parallelization with openmp. I have written a c++ script which calculates 10 dimensional integration through MC for the function:
F = x1+ x2 + x3 +...+x10
now I am trying to convert it to work with openmp with 4 threads. my serial code gives intelligible output, so I am kind of convinced that it works fine.
here is my serial code:
I want to output for every 4^k iterations for N= number of sample points.
/* compile with
$ g++ -o monte ND_MonteCarlo.cpp
$ ./monte N
unsigned long long int for i, N
Maximum value for UNSIGNED LONG LONG INT 18446744073709551615
*/
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <ctime>
using namespace std;
//define multivariate function F(x1, x2, ...xk)
double f(double x[], int n)
{
double y;
int j;
y = 0.0;
for (j = 0; j < n; j = j+1)
{
y = y + x[j];
}
y = y;
return y;
}
//define function for Monte Carlo Multidimensional integration
double int_mcnd(double(*fn)(double[],int),double a[], double b[], int n, int m)
{
double r, x[n], v;
int i, j;
r = 0.0;
v = 1.0;
// step 1: calculate the common factor V
for (j = 0; j < n; j = j+1)
{
v = v*(b[j]-a[j]);
}
// step 2: integration
for (i = 1; i <= m; i=i+1)
{
// calculate random x[] points
for (j = 0; j < n; j = j+1)
{
x[j] = a[j] + (rand()) /( (RAND_MAX/(b[j]-a[j])));
}
r = r + fn(x,n);
}
r = r*v/m;
return r;
}
double f(double[], int);
double int_mcnd(double(*)(double[],int), double[], double[], int, int);
int main(int argc, char **argv)
{
/* define how many integrals */
const int n = 10;
double b[n] = {5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0,5.0};
double a[n] = {-5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0,-5.0};
double result, mean;
int m;
unsigned long long int i, N;
// initial seed value (use system time)
srand(time(NULL));
cout.precision(6);
cout.setf(ios::fixed | ios::showpoint);
// current time in seconds (begin calculations)
time_t seconds_i;
seconds_i = time (NULL);
m = 4; // initial number of intervals
// convert command-line input to N = number of points
N = atoi( argv[1] );
for (i=0; i <=N/pow(4,i); i++)
{
result = int_mcnd(f, a, b, n, m);
mean = result/(pow(10,10));
cout << setw(30) << m << setw(30) << result << setw(30) << mean <<endl;
m = m*4;
}
// current time in seconds (end of calculations)
time_t seconds_f;
seconds_f = time (NULL);
cout << endl << "total elapsed time = " << seconds_f - seconds_i << " seconds" << endl << endl;
return 0;
}
and output:
N integral mean_integral
4 62061079725.185936 6.206108
16 33459275100.477665 3.345928
64 -2204654740.788784 -0.220465
256 4347440045.990804 0.434744
1024 -1265056243.116922 -0.126506
4096 681660387.953380 0.068166
16384 -799507050.896809 -0.079951
65536 -462592561.594820 -0.046259
262144 50902035.836772 0.005090
1048576 -91104861.129695 -0.009110
4194304 3746742.588701 0.000375
16777216 -32967862.853915 -0.003297
67108864 17730924.602974 0.001773
268435456 -416824.977687 -0.00004
1073741824 2843188.477219 0.000284
But I think my parallel code is not working at all. I know I'm doing something silly of course .As my number of threads are 4, I wanted to divide results by 4, and the output is ridiculous.
here is a parallel version of the same code:
/* compile with
$ g++ -fopenmp -Wunknown-pragmas -std=c++11 -o mcOMP parallel_ND_MonteCarlo.cpp -lm
$ ./mcOMP N
unsigned long long int for i, N
Maximum value for UNSIGNED LONG LONG INT 18446744073709551615
*/
#include <iostream>
#include <fstream>
#include <iomanip>
#include <cmath>
#include <cstdlib>
#include <ctime>
#include <omp.h>
using namespace std;
//define multivariate function F(x1, x2, ...xk)
double f(double x[], int n)
{
double y;
int j;
y = 0.0;
for (j = 0; j < n; j = j+1)
{
y = y + x[j];
}
y = y;
return y;
}
//define function for Monte Carlo Multidimensional integration
double int_mcnd(double(*fn)(double[],int),double a[], double b[], int n, int m)
{
double r, x[n], v;
int i, j;
r = 0.0;
v = 1.0;
// step 1: calculate the common factor V
#pragma omp for
for (j = 0; j < n; j = j+1)
{
v = v*(b[j]-a[j]);
}
// step 2: integration
#pragma omp for
for (i = 1; i <= m; i=i+1)
{
// calculate random x[] points
for (j = 0; j < n; j = j+1)
{
x[j] = a[j] + (rand()) /( (RAND_MAX/(b[j]-a[j])));
}
r = r + fn(x,n);
}
r = r*v/m;
return r;
}
double f(double[], int);
double int_mcnd(double(*)(double[],int), double[], double[], int, int);
int main(int argc, char **argv)
{
/* define how many integrals */
const int n = 10;
double b[n] = {5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0};
double a[n] = {-5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0,-5.0};
double result, mean;
int m;
unsigned long long int i, N;
int NumThreads = 4;
// initial seed value (use system time)
srand(time(NULL));
cout.precision(6);
cout.setf(ios::fixed | ios::showpoint);
// current time in seconds (begin calculations)
time_t seconds_i;
seconds_i = time (NULL);
m = 4; // initial number of intervals
// convert command-line input to N = number of points
N = atoi( argv[1] );
#pragma omp parallel private(result, mean) shared(N, m) num_threads(NumThreads)
for (i=0; i <=N/pow(4,i); i++)
{
result = int_mcnd(f, a, b, n, m);
mean = result/(pow(10,10));
#pragma omp master
cout << setw(30) << m/4 << setw(30) << result/4 << setw(30) << mean/4 <<endl;
m = m*4;
}
// current time in seconds (end of calculations)
time_t seconds_f;
seconds_f = time (NULL);
cout << endl << "total elapsed time = " << seconds_f - seconds_i << " seconds" << endl << endl;
return 0;
}
I want only the master thread to output the values.
I compiled with:
g++ -fopenmp -Wunknown-pragmas -std=c++11 -o mcOMP parallel_ND_MonteCarlo.cpp -lm
your help and suggestion to fix the code is most appreciated. thanks a lot.
Let's see what your program does. At omp parallel, your threads are spawned and they will execute the remaining code in parallel. Operations like:
m = m * 4;
Are undefined (and make no sense generally, as they are executed four times per iteration).
Further, when those threads encounter a omp for, they will share the work of the loop, i.e. each iteration will be executed only once by some thread. Since int_mcnd is executed within a parallel region, all it's local variables are private. You have no construct in your code to actually collect those private results (also result and mean are private).
The correct approach is to use a parallel for loop with reduction clause, indicating that there is a variable (r/v) that is being aggregated throughout the execution of the loop.
To allow this, the reduction variables need to be declared as shared, outside of the scope of the parallel region. The easiest solution is to move the parallel region inside of int_mcnd. This also avoid the race condition for m.
There is one more hurdle: rand is using global state and at least my implementation is locked. Since most of the time is spent into rand, your code would scale horribly. The solution is to use an explicit threadprivate state via rand_r. (See also this question).
Piecing it together, the modified code looks like this:
double int_mcnd(double (*fn)(double[], int), double a[], double b[], int n, int m)
{
// Reduction variables need to be shared
double r = 0.0;
double v = 1.0;
#pragma omp parallel
// All variables declared inside are private
{
// step 1: calculate the common factor V
#pragma omp for reduction(* : v)
for (int j = 0; j < n; j = j + 1)
{
v = v * (b[j] - a[j]);
}
// step 2: integration
unsigned int private_seed = omp_get_thread_num();
#pragma omp for reduction(+ : r)
for (int i = 1; i <= m; i = i + 1)
{
// Note: X MUST be private, otherwise, you have race-conditions again
double x[n];
// calculate random x[] points
for (int j = 0; j < n; j = j + 1)
{
x[j] = a[j] + (rand_r(&private_seed)) / ((RAND_MAX / (b[j] - a[j])));
}
r = r + fn(x, n);
}
}
r = r * v / m;
return r;
}
double f(double[], int);
double int_mcnd(double (*)(double[], int), double[], double[], int, int);
int main(int argc, char** argv)
{
/* define how many integrals */
const int n = 10;
double b[n] = { 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0, 5.0 };
double a[n] = { -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0, -5.0 };
int m;
unsigned long long int i, N;
int NumThreads = 4;
// initial seed value (use system time)
srand(time(NULL));
cout.precision(6);
cout.setf(ios::fixed | ios::showpoint);
// current time in seconds (begin calculations)
time_t seconds_i;
seconds_i = time(NULL);
m = 4; // initial number of intervals
// convert command-line input to N = number of points
N = atoi(argv[1]);
for (i = 0; i <= N / pow(4, i); i++)
{
double result = int_mcnd(f, a, b, n, m);
double mean = result / (pow(10, 10));
cout << setw(30) << m << setw(30) << result << setw(30) << mean << endl;
m = m * 4;
}
// current time in seconds (end of calculations)
time_t seconds_f;
seconds_f = time(NULL);
cout << endl << "total elapsed time = " << seconds_f - seconds_i << " seconds" << endl << endl;
return 0;
}
Note that I removed the division by four, and also the output is done outside of the parallel region. The results should be similar (except for randomness of course) than the serial version.
I observe perfect 16x speedup on a 16 core system with -O3.
A few more remarks:
Declare variables as locally as possible.
If thread overhead would be a problem, you could move the parallel region outside, but then you need to think more carefully about the parallel execution, and find a solution for the shared reduction variables. Given the embarrassingly parallel nature of Monte Carlo codes, you could stick more closely with your initial solution by removing the omp for directives - which then means each thread executes all loop iterations. Then you could manually sum up the result variable and print that. But I don't really see the point.
I will not go into details but will give some pointers where to look at
Take for example this part of the code:
// step 1: calculate the common factor V
#pragma omp for
for (j = 0; j < n; j = j+1)
{
v = v*(b[j]-a[j]);
}
If you look at variable v there is clear case of race condition. That is you have to declare v private to the thread (maybe call it local_v) and then through reduction operation gather all the values into a global_v value for example.
In general I would advise you to look for race condition, critical regions, concepts of shared and private memory for openmp.

Simple Thrust code performs about half as fast as my naive cuda kernel. Am I using Thrust wrong?

I'm pretty new to Cuda and Thrust, but my impression was that Thrust, when used well, is supposed to offer better performance than naively written Cuda kernels. Am I using Thrust in a sub-optimal way? Below is a complete, minimal example that takes an array u of length N+2, and for each i between 1 and N computes the average 0.5*(u[i-1] + u[i+1]) and puts the result in uNew[i]. (uNew[0] is set to u[0] and u[N+1] is set to u[N+1] so that the boundary terms don't change). The code performs this averaging a large number of times to get reasonable times for timing tests. On my hardware, the Thrust computation takes roughly twice as long as the naive code. Is there a way to improve my Thrust code?
#include <iostream>
#include <thrust/device_vector.h>
#include <boost/timer.hpp>
#include <thrust/device_malloc.h>
typedef double numtype;
template <typename T> class NeighborAverageFunctor{
int N;
public:
NeighborAverageFunctor(int _N){
N = _N;
}
template <typename Tuple>
__host__ __device__ void operator()(Tuple t){
T uL = thrust::get<0>(t);
T uR = thrust::get<1>(t);
thrust::get<2>(t) = 0.5*(uL + uR);
}
int getN(){
return N;
}
};
template <typename T> void thrust_sweep(thrust::device_ptr<T> u, thrust::device_ptr<T> uNew, NeighborAverageFunctor<T>& op){
int N = op.getN();
thrust::for_each(thrust::make_zip_iterator(thrust::make_tuple(u, u + 2, uNew + 1)), thrust::make_zip_iterator(thrust::make_tuple(u + N, u + N+2, uNew + N+1)), op);
// Propagate boundary values without changing them
uNew[0] = u[0];
uNew[N+1] = u[N+1];
}
template <typename T> __global__ void initialization_kernel(int n, T* u){
const int i = blockIdx.x * blockDim.x + threadIdx.x;
if(i < n+2){
if(i == 0){
u[i] = 1.0;
}
else{
u[i] = 0.0;
}
}
}
template <typename T> __global__ void sweep_kernel(int n, T, T* u, T* uNew){
const int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i >= 1 && i < n-1){
uNew[i] = 0.5*(u[i+1] + u[i-1]);
}
else if(i == 0 || i == n+1){
uNew[i] = u[i];
}
}
int main(void){
int sweeps = 2000;
int N = 4096*2048;
numtype h = 1.0/N;
numtype hSquared = pow(h, 2);
NeighborAverageFunctor<numtype> op(N);
thrust::device_ptr<numtype> u_d = thrust::device_malloc<numtype>(N+2);
thrust::device_ptr<numtype> uNew_d = thrust::device_malloc<numtype>(N+2);
thrust::device_ptr<numtype> uTemp_d;
thrust::fill(u_d, u_d + (N+2), 0.0);
u_d[0] = 1.0;
boost::timer::timer timer1;
for(int k = 0; k < sweeps; k++){
thrust_sweep<numtype>(u_d, uNew_d, op);
uTemp_d = u_d;
u_d = uNew_d;
uNew_d = uTemp_d;
}
double thrust_time = timer1.elapsed();
thrust::host_vector<numtype> u_h(N+2);
thrust::copy(u_d, u_d + N+2, u_h.begin());
for(int i = 0; i < 10; i++){
std::cout << u_h[i] << " ";
}
std::cout << std::endl;
thrust::device_free(u_d);
thrust::device_free(uNew_d);
numtype * u_raw_d, * uNew_raw_d, * uTemp_raw_d;
cudaMalloc(&u_raw_d, (N+2)*sizeof(numtype));
cudaMalloc(&uNew_raw_d, (N+2)*sizeof(numtype));
numtype * u_raw_h = (numtype*)malloc((N+2)*sizeof(numtype));
int block_size = 256;
int grid_size = ((N+2) + block_size - 1) / block_size;
initialization_kernel<numtype><<<grid_size, block_size>>>(N, u_raw_d);
boost::timer::timer timer2;
for(int k = 0; k < sweeps; k++){
sweep_kernel<numtype><<<grid_size, block_size>>>(N+2, hSquared, u_raw_d, uNew_raw_d);
uTemp_raw_d = u_raw_d;
u_raw_d = uNew_raw_d;
uNew_raw_d = uTemp_raw_d;
}
double raw_time = timer2.elapsed();
cudaMemcpy(u_raw_h, u_raw_d, (N+2)*sizeof(numtype), cudaMemcpyDeviceToHost);
for(int i = 0; i < 10; i++){
std::cout << u_raw_h[i] << " ";
}
std::cout << std::endl;
std::cout << "Thrust: " << thrust_time << " s" << std::endl;
std::cout << "Raw: " << raw_time << " s" << std::endl;
free(u_raw_h);
cudaFree(u_raw_d);
cudaFree(uNew_raw_d);
return 0;
}
According to my testing, these lines:
uNew[0] = u[0];
uNew[N+1] = u[N+1];
are killing your thrust performance relative to the kernel method. When I eliminate them, the results don't seem to be any different. Compared to how your kernel is handling the boundary cases, the thrust code is using a very expensive method (cudaMemcpy operations, under the hood) to perform the boundary handling.
Since your thrust functor never actually writes to the boundary positions, it should be sufficient to write these values only once, rather than in a loop.
You can speed up your thrust performance significantly by doing a better job of handling the boundary cases.

Why can I not view the run time (nanoseconds)?

I am trying to view what the run-time on my code is. The code is my attempt at Project Euler Problem 5. When I try to output the run time it gives 0ns.
#define MAX_DIVISOR 20
bool isDivisible(long, int);
int main() {
auto begin = std::chrono::high_resolution_clock::now();
int d = 2;
long inc = 1;
long i = 1;
while (d < (MAX_DIVISOR + 1)) {
if ((i % d) == 0) {
inc = i;
i = inc;
d++;
}
else {
i += inc;
}
}
auto end = std::chrono::high_resolution_clock::now();
printf("Run time: %llu ns\n", (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count())); // Gives 0 here.
std::cout << "ANS: " << i << std::endl;
system("pause");
return 0;
}
The timing resolulution of std::chrono::high_resolution_clock::now() is system dependent.
You can find out an order of magnitude with the small piece of code here (edit: here you have a more accurate version):
chrono::nanoseconds mn(1000000000); // asuming the resolution is higher
for (int i = 0; i < 5; i++) {
using namespace std::chrono;
nanoseconds dt;
long d = 1000 * pow(10, i);
for (long e = 0; e < 10; e++) {
long j = d + e*pow(10, i)*100;
cout << j << " ";
auto begin = high_resolution_clock::now();
while (j>0)
k = ((j-- << 2) + 1) % (rand() + 100);
auto end = high_resolution_clock::now();
dt = duration_cast<nanoseconds>(end - begin);
cout << dt.count() << "ns = "
<< duration_cast<milliseconds>(dt).count() << " ms" << endl;
if (dt > nanoseconds(0) && dt < mn)
mn = dt;
}
}
cout << "Minimum resolution observed: " << mn.count() << "ns\n";
where k is a global volatile long k; in order to avoid optimizer to interfere too much.
Under windows, I obtain here 15ms. Then you have platform specific alternatives. For windows, there is a high performance cloeck that enables you to measure timebelow 10µs range (see here http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408%28v=vs.85%29.aspx) but still not in the nanosecond range.
If you want to time your code very accurately, you could reexecute it a big loop, and dividint the total time by the number of iterations.
Estimation you are going to do is not precise, better approach is to measure CPU time consumption of you program (because other processes are also running concurrently with you process, so time that you are trying to measure can be greatly affected if CPU intensitive tasks are running in parallel with you process).
So my advise use already implemented profilers if you want to estimate your code performance.
Considering your task, OS if doesn`t provide needed precision for time, you need to increase total time your are trying to estimate, the esiest way - run program n times & calculate the avarage, this method provides such advantage that by avareging - you can eleminate errors that arose from CPU intensitive tasks running concurrently with you process.
Here is code snippet of how I see the possible implementation:
#include <iostream>
using namespace std;
#define MAX_DIVISOR 20
bool isDivisible(long, int);
void doRoutine()
{
int d = 2;
long inc = 1;
long i = 1;
while (d < (MAX_DIVISOR + 1))
{
if (isDivisible(i, d))
{
inc = i;
i = inc;
d++;
}
else
{
i += inc;
}
}
}
int main() {
auto begin = std::chrono::high_resolution_clock::now();
const int nOfTrials = 1000000;
for (int i = 0; i < nOfTrials; ++i)
doRoutine();
auto end = std::chrono::high_resolution_clock::now();
printf("Run time: %llu ns\n", (std::chrono::duration_cast<std::chrono::nanoseconds>(end - begin).count()/ nOfTrials)); // Gives 0 here.
std::cout << "ANS: " << i << std::endl;
system("pause");
return 0;