Assertion error in a simple C++ program using boost:odeint - c++

I'm sorry if this is immediately obvious, but I am very new to C++ coming from a Python / MATLAB / Mathematica background. I've written a simple solver for the classic 1D heat equation using a finite difference spatial discretization in order to play around with the capabilities of the Odeint library and compare the performance with other libraries. The code should be quite self-explanatory:
#include <iostream>
#include <boost/math/constants/constants.hpp>
#include <boost/array.hpp>
#include <boost/numeric/odeint.hpp>
using namespace std;
using namespace boost::numeric::odeint;
const double a_sq = 1;
const int p = 10;
const double h = 1 / p;
double pi = boost::math::constants::pi<double>();
typedef boost::array<double, p> state_type;
void heat_equation(const state_type &x, state_type &dxdt, double t)
{
int i;
for (i=1; i<p; i++)
{
dxdt[i] = a_sq * (dxdt[i+1] - 2*dxdt[i] + dxdt[i-1]) / h / h;
}
dxdt[0] = 0;
dxdt[p] = 0;
}
void initial_conditions(state_type &x)
{
int i;
for (i=0; i<=p; i++)
{
x[i] = sin(pi*i*h);
}
}
void write_heat_equation(const state_type &x, const double t)
{
cout << t << '\t' << x[0] << '\t' << x[p] << '\t' << endl;
}
int main()
{
state_type x;
initial_conditions(x);
integrate(heat_equation, x, 0.0, 10.0, 0.1, write_heat_equation);
}
This compiles just fine on Ubuntu 14.04 using g++ 4.8.2 and the latest boost library from the Ubuntu repository. When I run the resulting executable, however, I get the following error:
***** Internal Program Error - assertion (i < N) failed in T& boost::array<T, N>::operator[](boost::array<T, N>::size_type) [with T = double; long unsigned int N = 10ul; boost::array<T, N>::reference = double&; boost::array<T, N>::size_type = long unsigned int]:
/usr/include/boost/array.hpp(123): out of range
Aborted (core dumped)
Unfortunately, this isn't particularly helpful to my novice brain and I'm at a loss as to how to fix this. What's causing the error?

Counting the elements of an array or a vector of N-elements start by zero. The last element has index N-1. So, you need to change your for-loop to iterate from i to p-1 and you need to modify the line dxdt[p] = 0; to dxdt[p-1] = 0:
for (i=1; i<p-1; i++)
{
dxdt[i] = a_sq * (dxdt[i+1] - 2*dxdt[i] + dxdt[i-1]) / h / h;
}
dxdt[0] = 0;
dxdt[p-1] = 0;

Related

Armadillo C++ bad performance ifft

I have current test code
#include <iostream>
#define ARMA_DONT_USE_WRAPPER
#include <armadillo>
using namespace std::complex_literals;
int main()
{
arma::cx_mat testMat { };
testMat.set_size(40, 19586);
auto nPositions = static_cast<arma::sword>(floor(19586/2));
arma::cx_rowvec a_vec {19586, arma::fill::randu};
arma::cx_rowvec b_vec {19586, arma::fill::randu};
arma::cx_rowvec c_vec {19586, arma::fill::randu};
for (size_t nCo=0; nCo < 3; nCo++) {
arma::rowvec d {19586, arma::fill::randu};
for(size_t iDop = 0; iDop < 40; ++iDop)
{
arma::cx_rowvec signalFi = (b_vec % arma::exp(-1i*M_PI*a_vec));
testMat.row(iDop) += arma::ifft(arma::shift(arma::fft(signalFi), nPositions).eval() % c_vec).eval();
}
}
return 0;
}
I am trying to perform some computation.
StopWatch shared performance for each iteration around : 300 ms, which is bad performance for my needs.
Is someone which can explain what i am doing wrong or some tricks how can i increase the performance.
I used .eval() to perform 'eager' evaluation.
gcc 11.2
armadillo 10.8.2
Release Mode -O3
Updated Version. Is possible to redesign the ifft function ?
Test Code
#include <iostream>
#include <fftw3.h>
#include <armadillo>
#include "StopWatch.h"
using namespace std;
inline arma::cx_mat ifftshift(arma::cx_mat const &axx)
{
return arma::shift(axx, -ceil(axx.n_rows/2), 0);
}
void ifft(arma::cx_mat &inMat, arma::cx_mat &outMat)
{
size_t N = inMat.n_rows;
size_t n_cols = inMat.n_cols;
for (size_t index = 0; index < n_cols; ++index)
{
fftw_complex *in1 = reinterpret_cast<fftw_complex *>(inMat.colptr(index));
fftw_complex *out1 = reinterpret_cast<fftw_complex *>(outMat.colptr(index));
fftw_plan pl_ifft_cx1 = fftw_plan_dft_1d(N, in1, out1, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute_dft(pl_ifft_cx1, in1, out1);
}
outMat /= N;
}
int main()
{
arma::cx_mat B;
B << std::complex<double>(+1.225e-01,+8.247e-01) << std::complex<double>(+4.078e-01,+5.632e-01) << std::complex<double>(+8.866e-01,+8.386e-01) << arma::endr
<< std::complex<double>(+5.958e-01,+1.015e-01) << std::complex<double>(+7.857e-01,+4.267e-01) << std::complex<double>(+7.997e-01,+9.176e-01) << arma::endr
<< std::complex<double>(+1.877e-01,+3.378e-01) << std::complex<double>(+2.921e-01,+9.651e-01) << std::complex<double>(+1.056e-01,+6.901e-01) << arma::endr
<< std::complex<double>(+2.322e-01,+6.990e-01) << std::complex<double>(+1.547e-01,+4.256e-01) << std::complex<double>(+9.094e-01,+1.194e-01) << arma::endr
<< std::complex<double>(+3.917e-01,+3.886e-01) << std::complex<double>(+2.166e-01,+4.962e-01) << std::complex<double>(+9.777e-01,+4.464e-01) << arma::endr;
arma::cx_mat output(5,3);
arma::cx_mat shifted = ifftshift(B);
arma::cx_mat arma_result = arma::ifft(shifted);
B.print("B");
arma_result.print("arma_result");
ifft(shifted, output);
output.print("output");
return 0;
}
I just tried a similar operation with my own library and, according to my measurements, you are correct that each iteration of the loop shouldn't take more than 1 millisecond (instead of 300 ms).
This is the equivalent code, sorry that this is not an Armadillo answer, I am just pointing out what are the concrete goals for minimizing operations and allocations.
#include<multi/adaptors/fftw.hpp>
#include<multi/array.hpp>
namespace fftw = multi::fftw;
int main() {
multi::array<std::complex<double>, 1> const arr = n_random_complex<double>(19586);
multi::array<std::complex<double>, 1> res(arr.extensions()); // output allocated only once
fftw::plan fdft{arr, res, fftw::forward}; // fftw plan and internal buffers allocated only once
auto const N = 40;
for(int i = 0; i != N; ++i) { // each iteration takes ~1ms in an intel-i7
fdft(arr.base(), res.base()); // fft operation with precalculated plan
std::rotate(res.begin(), res.begin() + res.size()/2, res.end()); // rotation (shift on size/2) done in place, no allocation either
}
}
The full code and library is here: https://gitlab.com/correaa/boost-multi/-/blob/master/adaptors/fftw/test/shift.cpp#L45-58 (the extra code is for the timing measurement).
What is also telling is that I tried to do all the possible mistakes to pessimize the code.
To try to mimic what I think Armadillo is doing "wrong"... allocating inside the loop and making copies all the time. But what I get is that each iteration take 1.5 milliseconds.
My conclusion is that something is terribly wrong in your Armadillo usage or in the library itself.
multi::array<std::complex<double>, 1> const arr = n_random_complex<double>(19586); BOOST_REQUIRE(arr.size() == 19586);
auto const N = 40;
for(int i = 0; i != N; ++i) {
multi::array<std::complex<double>, 1> res(arr.extensions(), 0.);
fftw::plan fdft{arr, res, fftw::forward};
fdft(arr.base(), res.base());
multi::array<std::complex<double>, 1> res_copy(arr.extensions(), 0.);
std::rotate_copy(res.begin(), res.begin() + res.size()/2, res.end(), res_copy.begin());
}

Checking which of the modules is the closest

Welcome. My problem is that I have given an array of numbers which I need to calculate the average (that part I did), but then I have to find the array element (module), which is closer to the average. Below paste the code (a form of main () imposed)
#include <iostream>
using namespace std;
double* aver(double* arr, size_t size, double& average){
double count;
for(int p = 0; p < size; p++)
count += arr[p];
count /= size;
double * pointer;
pointer = &count;
average = *pointer;
}
int main() {
double arr[] = {1,2,3,4,5,7};
size_t size = sizeof(arr)/sizeof(arr[0]);
double average = 0;
double* p = aver(arr,size,average);
cout << p << " " << average << endl;
}
The program should give a result
4 3.66667
I have no idea how to check which element is nearest to another, and substitute it into *p
I will be very grateful for any help.
Okay, this is not the answer to your problem, since you already got couple of them
How about trying something new ?
Use std::accumulate, std::sort and std::partition to achieve same goal.
#include<algorithm>
//...
struct comp
{
double avg;
comp(double x):avg(x){}
bool operator()(const double &x) const
{
return x < avg;
}
};
std::sort(arr,arr+size);
average =std::accumulate(arr, arr+size, 0.0) / size;
double *p= std::partition(arr, arr+size, comp(average));
std::cout<<"Average :"<<average <<" Closest : "<<*p<<std::endl;
This algorithm is based on the fact that std::map keeps its elements sorted (using operator<):
#include <map>
#include <iostream>
#include <math.h>
using namespace std;
double closest_to_avg(double* arr, size_t size, double avg) {
std::map<double,double> disturbances;
for(int p = 0; p < size; p++) {
disturbances[fabs(avg-arr[p])]=arr[p]; //if two elements are equally
} //distant from avg we take
return disturbances.begin()->second; //a new one
}
Since everybody is doing the kids homework...
#include <iostream>
using namespace std;
double min(double first, double second){
return first < second ? first : second;
}
double abs(double first){
return 0 < first ? first : -first;
}
double* aver(double* arr, size_t size, double& average){
double count;
for(int p = 0; p < size; p++)
count += arr[p];
average = count/size;
int closest_index = 0;
for(int p = 0; p < size; p++)
if( abs(arr[p] - average) <
abs(arr[closest_index] - average) )
closest_index = p;
return &arr[closest_index];
}
int main() {
double arr[] = {1,2,3,4,5,7};
size_t size = sizeof(arr)/sizeof(arr[0]);
double average = 0;
double* p = aver(arr,size,average);
cout << *p << " " << average << endl;
//Above ^^ gives the expected behavior,
//Without it you'll get nothing but random memory
}
I insist that you need the * before the p, it gives the value that the pointer is pointing too. Without the * then the value is the address of the memory location, which is indeterminate in this case. Ask your professor/teacher whether the specification is correct, because it isn't.
Try and understand the style and functions involved - it isn't complicated, and writing like this can go a long ways to making your graders job easier.
Also that interface is a very leaky one, in real work - consider some of the standard library algorithms and containers instead.

Rounding errors giving incorrect tesults in DFT?

I have been beating my head against the wall on this DFT. It should print out: 8,0,0,0,0,0,0,0 but instead I get 8 and then very very tiny numbers. Are these rounding errors? Is there anything I can do? My Radix2 FFT gives correct results, it seems silly a DFT could not also work.
I started with complex numbers so I know there is a good bit missing, I tried to strip it down to illustrate the problem.
#include <cstdlib>
#include <math.h>
#include <iostream>
#include <complex>
#include <cassert>
#define SIZE 8
#define M_PI 3.14159265358979323846
void fft(const double src[], double dst[], const unsigned int n)
{
for(int i=0; i < SIZE; i++)
{
const double ph = -(2*M_PI) / n;
const int gid = i;
double res = 0.0f;
for (int k = 0; k < n; k++) {
double t = src[k];
const double val = ph * k * gid;
double cs = cos(val);
double sn = sin(val);
res += ((t * cs) - (t * sn));
int a = 1;
}
dst[i] = res;
std::cout << dst[i] << std::endl;
}
}
int main(void)
{
double array1[SIZE];
double array2[SIZE];
for(int i=0; i < SIZE; i++){
array1[i] = 1;
array2[i] = 0;
}
fft(array1, array2, SIZE);
return 666;
}
An FFT can actually produce more accurate results than a straight DFT calculation, as the fewer arithmetic ops usually allow fewer opportunities for arithmetic quantization errors to accumulate. There's a paper by one of the FFTW authors on this topic.
Since the DFT/FFT deal with a transcendental basis function, the results will never (except perhaps in a few special cases, or by lucky accident) be exactly correct using any non-symbolic and finite computer number format. So values very close (within a few LSB) to zero should simply be ignored as noise, or considered to be the same as zero.

Multi-threaded Simulated Annealing

I wrote a multithreaded simulated annealing program but its not running. I am not sure if the code is correct or not. The code is able to compile but when i run the code it crashes. Its just a run time error.
#include <stdio.h>
#include <time.h>
#include <iostream>
#include <stdlib.h>
#include <math.h>
#include <string>
#include <vector>
#include <algorithm>
#include <fstream>
#include <ctime>
#include <windows.h>
#include <process.h>
using namespace std;
typedef vector<double> Layer; //defines a vector type
typedef struct {
Layer Solution1;
double temp1;
double coolingrate1;
int MCL1;
int prob1;
}t;
//void SA(Layer Solution, double temp, double coolingrate, int MCL, int prob){
double Rand_NormalDistri(double mean, double stddev) {
//Random Number from Normal Distribution
static double n2 = 0.0;
static int n2_cached = 0;
if (!n2_cached) {
// choose a point x,y in the unit circle uniformly at random
double x, y, r;
do {
// scale two random integers to doubles between -1 and 1
x = 2.0*rand()/RAND_MAX - 1;
y = 2.0*rand()/RAND_MAX - 1;
r = x*x + y*y;
} while (r == 0.0 || r > 1.0);
{
// Apply Box-Muller transform on x, y
double d = sqrt(-2.0*log(r)/r);
double n1 = x*d;
n2 = y*d;
// scale and translate to get desired mean and standard deviation
double result = n1*stddev + mean;
n2_cached = 1;
return result;
}
} else {
n2_cached = 0;
return n2*stddev + mean;
}
}
double FitnessFunc(Layer x, int ProbNum)
{
int i,j,k;
double z;
double fit = 0;
double sumSCH;
if(ProbNum==1){
// Ellipsoidal function
for(j=0;j< x.size();j++)
fit+=((j+1)*(x[j]*x[j]));
}
else if(ProbNum==2){
// Schwefel's function
for(j=0; j< x.size(); j++)
{
sumSCH=0;
for(i=0; i<j; i++)
sumSCH += x[i];
fit += sumSCH * sumSCH;
}
}
else if(ProbNum==3){
// Rosenbrock's function
for(j=0; j< x.size()-1; j++)
fit += 100.0*(x[j]*x[j] - x[j+1])*(x[j]*x[j] - x[j+1]) + (x[j]-1.0)*(x[j]-1.0);
}
return fit;
}
double probl(double energychange, double temp){
double a;
a= (-energychange)/temp;
return double(min(1.0,exp(a)));
}
int random (int min, int max){
int n = max - min + 1;
int remainder = RAND_MAX % n;
int x;
do{
x = rand();
}while (x >= RAND_MAX - remainder);
return min + x % n;
}
//void SA(Layer Solution, double temp, double coolingrate, int MCL, int prob){
void SA(void *param){
t *args = (t*) param;
Layer Solution = args->Solution1;
double temp = args->temp1;
double coolingrate = args->coolingrate1;
int MCL = args->MCL1;
int prob = args->prob1;
double Energy;
double EnergyNew;
double EnergyChange;
Layer SolutionNew(50);
Energy = FitnessFunc(Solution, prob);
while (temp > 0.01){
for ( int i = 0; i < MCL; i++){
for (int j = 0 ; j < SolutionNew.size(); j++){
SolutionNew[j] = Rand_NormalDistri(5, 1);
}
EnergyNew = FitnessFunc(SolutionNew, prob);
EnergyChange = EnergyNew - Energy;
if(EnergyChange <= 0){
Solution = SolutionNew;
Energy = EnergyNew;
}
if(probl(EnergyChange ,temp ) > random(0,1)){
//cout<<SolutionNew[i]<<endl;
Solution = SolutionNew;
Energy = EnergyNew;
cout << temp << "=" << Energy << endl;
}
}
temp = temp * coolingrate;
}
}
int main ()
{
srand ( time(NULL) ); //seed for getting different numbers each time the prog is run
Layer SearchSpace(50); //declare a vector of 20 dimensions
//for(int a = 0;a < 10; a++){
for (int i = 0 ; i < SearchSpace.size(); i++){
SearchSpace[i] = Rand_NormalDistri(5, 1);
}
t *arg1;
arg1 = (t *)malloc(sizeof(t));
arg1->Solution1 = SearchSpace;
arg1->temp1 = 1000;
arg1->coolingrate1 = 0.01;
arg1->MCL1 = 100;
arg1->prob1 = 3;
//cout << "Test " << ""<<endl;
_beginthread( SA, 0, (void*) arg1);
Sleep( 100 );
//SA(SearchSpace, 1000, 0.01, 100, 3);
//}
return 0;
}
Please help.
Thanks
Avinesh
As leftaroundabout pointed out, you're using malloc in C++ code. This is the source of your crash.
Malloc will allocate a block of memory, but since it was really designed for C, it doesn't call any C++ constructors. In this case, the vector<double> is never properly constructed. When
arg1->Solution1 = SearchSpace;
Is called, the member variable "Solution1" has an undefined state and the assignment operator crashes.
Instead of malloc try
arg1 = new t;
This will accomplish roughly the same thing but the "new" keyword also calls any necessary constructors to ensure the vector<double> is properly initialized.
This also brings up another minor issue, that this memory you've newed also needs to be deleted somewhere. In this case, since arg1 is passed to another thread, it should probably be cleaned up like
delete args;
by your "SA" function after its done with the args variable.
While I don't know the actual cause for your crashes I'm not really surprised that you end up in trouble. For instance, those "cached" static variables in Rand_NormalDistri are obviously vulnerable to data races. Why don't you use std::normal_distribution? It's almost always a good idea to use standard library routines when they're available, and even more so when you need to consider multithreading trickiness.
Even worse, you're heavily mixing C and C++. malloc is something you should virtually never use in C++ code – it doesn't know about RAII, which is one of the few intrinsically safe things you can cling onto in C++.

Generating random numbers with uniform distribution using Thrust

I need to generate a vector with random numbers between 0.0 and 1.0 using Thrust. The only documented example I could find produces very large random numbers (thrust::generate(myvector.begin(), myvector.end(), rand).
I'm sure the answer is simple, but I would appreciate any suggestions.
Thrust has random generators you can use to produce sequences of random numbers. To use them with a device vector you will need to create a functor which returns a different element of the random generator sequence. The most straightforward way to do this is using a transformation of a counting iterator. A very simple complete example (in this case generating random single precision numbers between 1.0 and 2.0) could look like:
#include <thrust/random.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
struct prg
{
float a, b;
__host__ __device__
prg(float _a=0.f, float _b=1.f) : a(_a), b(_b) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
int main(void)
{
const int N = 20;
thrust::device_vector<float> numbers(N);
thrust::counting_iterator<unsigned int> index_sequence_begin(0);
thrust::transform(index_sequence_begin,
index_sequence_begin + N,
numbers.begin(),
prg(1.f,2.f));
for(int i = 0; i < N; i++)
{
std::cout << numbers[i] << std::endl;
}
return 0;
}
In this example, the functor prg takes the lower and upper bounds of the random number as an argument, with (0.f,1.f) as the default. Note that in order to have a different vector each time you call the transform operation, you should used a counting iterator initialised to a different starting value.
It might not be a direct answer to your question but, cuRand library is quite powerful in this concept. You may both generate random numbers at GPU and CPU, and it contains many distribution functions (normal distribution etc).
Search for the title: "An NVIDIA CURAND implementation" on this link: http://adnanboz.wordpress.com/tag/nvidia-curand/
//Create a new generator
curandCreateGenerator(&m_prng, CURAND_RNG_PSEUDO_DEFAULT);
//Set the generator options
curandSetPseudoRandomGeneratorSeed(m_prng, (unsigned long) mainSeed);
//Generate random numbers
curandGenerateUniform(m_prng, d_randomData, dataCount);
One note is that, do not generate the generator again and again, it makes some precalculations. Calling curandGenerateUniform is quite fast and produces values between 0.0 and 1.0.
The approach suggested by #talonmies has a number of useful characteristics. Here's another approach that mimics the example you quoted:
#include <thrust/host_vector.h>
#include <thrust/generate.h>
#include <iostream>
#define DSIZE 5
__host__ static __inline__ float rand_01()
{
return ((float)rand()/RAND_MAX);
}
int main(){
thrust::host_vector<float> h_1(DSIZE);
thrust::generate(h_1.begin(), h_1.end(), rand_01);
std::cout<< "Values generated: " << std::endl;
for (unsigned i=0; i<DSIZE; i++)
std::cout<< h_1[i] << " : ";
std::cout<<std::endl;
return 0;
}
similar to the example you quoted, this uses rand(), and therefore can only be used to generate host vectors. Likewise it will produce the same sequence each time unless you re-seed rand() appropriately.
There are already satisfactory answers to this questions. In particular, the OP and Robert Crovella have dealt with thrust::generate while talonmies has proposed using thrust::transform.
I think there is another possibility, namely, using thrust::for_each, so I'm posting a fully worked example using such a primitive, just for the record.
I'm also timing the different solutions.
THE CODE
#include <iostream>
#include <thrust\host_vector.h>
#include <thrust\generate.h>
#include <thrust\for_each.h>
#include <thrust\execution_policy.h>
#include <thrust\random.h>
#include "TimingCPU.h"
/**************************************************/
/* RANDOM NUMBERS GENERATION STRUCTS AND FUNCTION */
/**************************************************/
template<typename T>
struct rand_01 {
__host__ T operator()(T& VecElem) const { return (T)rand() / RAND_MAX; }
};
template<typename T>
struct rand_01_for_each {
__host__ void operator()(T& VecElem) const { VecElem = (T)rand() / RAND_MAX; }
};
template<typename T>
__host__ T rand_01_fcn() { return ((T)rand() / RAND_MAX); }
struct prg
{
float a, b;
__host__ __device__
prg(float _a = 0.f, float _b = 1.f) : a(_a), b(_b) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
/********/
/* MAIN */
/********/
int main() {
TimingCPU timerCPU;
const int N = 2 << 18;
//const int N = 64;
const int numIters = 50;
thrust::host_vector<double> h_v1(N);
thrust::host_vector<double> h_v2(N);
thrust::host_vector<double> h_v3(N);
thrust::host_vector<double> h_v4(N);
printf("N = %d\n", N);
double timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::transform(thrust::host, h_v1.begin(), h_v1.end(), h_v1.begin(), rand_01<double>());
timing = timing + timerCPU.GetCounter();
}
printf("Timing using transform = %f\n", timing / numIters);
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::counting_iterator<unsigned int> index_sequence_begin(0);
thrust::transform(index_sequence_begin,
index_sequence_begin + N,
h_v2.begin(),
prg(0.f, 1.f));
timing = timing + timerCPU.GetCounter();
}
printf("Timing using transform and internal Thrust random generator = %f\n", timing / numIters);
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::for_each(h_v3.begin(), h_v3.end(), rand_01_for_each<double>());
timing = timing + timerCPU.GetCounter();
}
timerCPU.StartCounter();
printf("Timing using for_each = %f\n", timing / numIters);
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N; k++)
// std::cout << h_v3[k] << " : ";
//std::cout << std::endl;
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::generate(h_v4.begin(), h_v4.end(), rand_01_fcn<double>);
timing = timing + timerCPU.GetCounter();
}
timerCPU.StartCounter();
printf("Timing using generate = %f\n", timing / numIters);
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N; k++)
// std::cout << h_v4[k] << " : ";
//std::cout << std::endl;
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N * 2; k++)
// std::cout << h_v[k] << " : ";
//std::cout << std::endl;
return 0;
}
On a laptop Core i5 platform, I had the following timings
N = 2097152
Timing using transform = 33.202298
Timing using transform and internal Thrust random generator = 264.508662
Timing using for_each = 33.155237
Timing using generate = 35.309399
The timings are equivalent, apart from the second one which uses Thrust's internal random number generator instead of rand().
Please, note that, differently from the other solutions, the one thrust::generate is somewhat more rigid since the function used to generate the random numbers cannot have input parameters. So, for example, it is not possible to scale the input arguments by a constant.