CUDA: how to do a matrix multiplication using thrust?

CUDA: how to do a matrix multiplication using thrust? - c++

I'm new to CUDA and Thrust and I'm trying to implement a matrix multiplication and I want to achieve this by only using the thrust algorithms, because I want to avoid calling a kernel manually.
Is there a way I can achieve this efficiently? (At least without Using 2 nested for loops)
Or do I have to resign and call a CUDA Kernel?
//My data
thrust::device_vector<float> data(n*m);
thrust::device_vector<float> other(m*r);
thrust::device_vector<float> result(n*r);
// To make indexing faster, not really needed
transpose(other);
// My current approach
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < r;++j)
{
result[i*r+ j] = thrust::inner_product(data.begin()+(i*m), data.begin()+((i+1)*m),other+(j*m), 0.0f);
}
}

If you are interested in performance (usually why people use GPUs for computing tasks) you should not use thrust and you should not call or write your own CUDA kernel. You should use the CUBLAS library. For a learning exercise, if you want to study your own CUDA kernel, you can refer to a first-level-optimized CUDA version in the CUDA programming guide in the shared memory section. If you really want to use thrust with a single thrust call, it is possible.
The basic idea is to use an element-wise operation like thrust::transform as described here. The per-output-array-element dot-product is computed with a functor consisting of a loop.
Here's a worked example considering 3 methods. Your original double-nested loop method (relatively slow), a single thrust call method (faster) and the cublas method (fastest, certainly for larger matrix sizes). The code below only runs method 1 for matrix side dimensions of 200 or less because it is so slow. Here is an example on Tesla P100:
$ cat t463.cu
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/inner_product.h>
#include <thrust/execution_policy.h>
#include <thrust/equal.h>
#include <thrust/iterator/constant_iterator.h>
#include <cublas_v2.h>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#include <cstdlib>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
struct dp
{
float *A, *B;
int m,n,r;
dp(float *_A, float *_B, int _m, int _n, int _r): A(_A), B(_B), m(_m), n(_n), r(_r) {};
__host__ __device__
float operator()(size_t idx){
float sum = 0.0f;
int row = idx/r;
int col = idx - (row*r); // cheaper modulo
for (int i = 0; i < m; i++)
sum += A[col + row*i] * B[col + row*i];
return sum;}
};
const int dsd = 200;
int main(int argc, char *argv[]){
int ds = dsd;
if (argc > 1) ds = atoi(argv[1]);
const int n = ds;
const int m = ds;
const int r = ds;
// data setup
thrust::device_vector<float> data(n*m,1);
thrust::device_vector<float> other(m*r,1);
thrust::device_vector<float> result(n*r,0);
// method 1
//let's pretend that other is (already) transposed for efficient memory access by thrust
// therefore each dot-product is formed using a row of data and a row of other
long long dt = dtime_usec(0);
if (ds < 201){
for (int i = 0; i < n; ++i)
{
for (int j = 0; j < r;++j)
{
result[i*r+ j] = thrust::inner_product(data.begin()+(i*m), data.begin()+((i+1)*m),other.begin()+(j*m), 0.0f);
}
}
cudaDeviceSynchronize();
dt = dtime_usec(dt);
if (thrust::equal(result.begin(), result.end(), thrust::constant_iterator<float>(m)))
std::cout << "method 1 time: " << dt/(float)USECPSEC << "s" << std::endl;
else
std::cout << "method 1 failure" << std::endl;
}
thrust::fill(result.begin(), result.end(), 0);
cudaDeviceSynchronize();
// method 2
//let's pretend that data is (already) transposed for efficient memory access by thrust
// therefore each dot-product is formed using a column of data and a column of other
dt = dtime_usec(0);
thrust::transform(thrust::counting_iterator<int>(0), thrust::counting_iterator<int>(n*r), result.begin(), dp(thrust::raw_pointer_cast(data.data()), thrust::raw_pointer_cast(other.data()), m, n, r));
cudaDeviceSynchronize();
dt = dtime_usec(dt);
if (thrust::equal(result.begin(), result.end(), thrust::constant_iterator<float>(m)))
std::cout << "method 2 time: " << dt/(float)USECPSEC << "s" << std::endl;
else
std::cout << "method 2 failure" << std::endl;
// method 3
// once again, let's pretend the data is ready to go for CUBLAS
cublasHandle_t h;
cublasCreate(&h);
thrust::fill(result.begin(), result.end(), 0);
float alpha = 1.0f;
float beta = 0.0f;
cudaDeviceSynchronize();
dt = dtime_usec(0);
cublasSgemm(h, CUBLAS_OP_T, CUBLAS_OP_T, n, r, m, &alpha, thrust::raw_pointer_cast(data.data()), n, thrust::raw_pointer_cast(other.data()), m, &beta, thrust::raw_pointer_cast(result.data()), n);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
if (thrust::equal(result.begin(), result.end(), thrust::constant_iterator<float>(m)))
std::cout << "method 3 time: " << dt/(float)USECPSEC << "s" << std::endl;
else
std::cout << "method 3 failure" << std::endl;
}
$ nvcc -o t463 t463.cu -lcublas
$ ./t463
method 1 time: 20.1648s
method 2 time: 6.3e-05s
method 3 time: 5.7e-05s
$ ./t463 1024
method 2 time: 0.008063s
method 3 time: 0.000458s
$
For the default dimension 200 case, the single thrust call and cublas method are fairly close, but are much faster than the loop method. For a side dimension of 1024, the cublas method is almost 20x faster than the single thrust call method.
Note that I have chosen "optimal" transpose configurations for all 3 methods. For method 1, the best case timing is when the inner_product is using a "row" from each input matrix (effectively the tranpose of the 2nd input matrix). For method 2, the best case timing is when the functor is traversing a "column" from each input matrix (effectively the transpose of the first input matrix). For method 3, the choice of CUBLAS_OP_T for both input matrices seems to be fastest. In reality, only the cublas method has the flexibility to be useful for a variety of input cases with good performance.

Related

Eigen very slow when chaining operations

While trying to compute the variance of row vectors in large matrices I've noticed an odd behavior with Eigen. If I chain all the required operations I get extremely slow performance, meanwhile computing a partial result then performing the exact same operations yields much faster results. This behavior seems to actually go against the Eigen docs/FAQ which says to avoid temporaries.
So my question is if there is some kind of known pitfall in the library I should perhaps avoid, and how to spot situations where this type of slow down might occur.
Here's the code I've used to test this. I've tried compiling it with MSVC (-O2 optimizations) and MinGW GCC (-O3) on windows. The "row variance with partial eval" version runs at around 560ms with GCC and 1s with MSVC, while the version without the partial takes around 90s with GCC and 104s with MSVC, a pretty absurd difference. I didn't try it but I imagine even a sequence of naive for loops would be a lot faster than 90 seconds...
#include <iostream>
#include <vector>
#include <chrono>
#include <random>
#include <functional>
#include "Eigen/Dense"
void printTimespan(std::chrono::nanoseconds timeSpan)
{
using namespace std::chrono;
std::cout << "Timing ended:\n"
<< "\t ms: " << duration_cast<milliseconds>(timeSpan).count() << '\n'
<< "\t us: " << duration_cast<microseconds>(timeSpan).count() << '\n'
<< "\t ns: " << timeSpan.count() << '\n';
}
class Timer
{
std::chrono::steady_clock::time_point start_;
public:
void start()
{
start_ = std::chrono::steady_clock::now();
}
void stop()
{
timings.push_back((std::chrono::steady_clock::now() - start_).count());
}
std::vector<long long> timings;
};
std::vector<float> buildBuffer(size_t rows, size_t cols)
{
std::vector<float> buffer;
buffer.reserve(rows * cols);
for (size_t i = 0; i < rows; i++)
{
for (size_t j = 0; j < cols; j++)
{
buffer.push_back(std::rand() % 1000);
}
}
return buffer;
}
using EigenArr = Eigen::Array<float, -1, -1, Eigen::RowMajor>;
using EigenMap = Eigen::Map<EigenArr>;
std::vector<float> benchmark(std::function<EigenArr(const EigenMap&)> func)
{
constexpr size_t rows = 2000, cols = 200, repetitions = 1000;
std::vector<float> buffer = buildBuffer(rows, cols);
EigenMap map(buffer.data(), rows, cols);
EigenArr res;
std::vector<float> means; //just to prevent the compiler from not computing anything because the results aren't used
Timer timer;
for (size_t i = 0; i < repetitions; i++)
{
timer.start();
res = func(map);
timer.stop();
means.push_back(res.mean());
}
Eigen::Map<Eigen::Vector<long long, -1>> timingsMap(timer.timings.data(), timer.timings.size());
printTimespan(std::chrono::nanoseconds(timingsMap.sum()));
return means;
}
int main()
{
std::cout << "mean center rows\n";
benchmark([](const EigenMap& map)
{
return (map.colwise() - map.rowwise().mean()).eval();
});
std::cout << "squared deviations\n";
benchmark([](const EigenMap& map)
{
return (map.colwise() - map.rowwise().mean()).square().eval();
});
std::cout << "row variance with partial eval\n";
benchmark([](const EigenMap& map)
{
EigenArr partial = (map.colwise() - map.rowwise().mean()).square().eval();
return (partial.rowwise().sum() / (map.cols() - 1)).eval();
});
std::cout << "row variance\n";
benchmark([](const EigenMap& map)
{
return ((map.colwise() - map.rowwise().mean()).square().rowwise().sum() / (map.cols() - 1)).eval();
});
}

I suspect it's the double rowwise() on the slower one.
A lot of operations in Eigen are computed on demand, and don't create temporaries. This is done to prevent unnecessary copies of the data. But I suspect that every time the outer rowwise() is being asked for an element, it's computing the inner portion, squaring the number of operations. By saving a copy once, it prevents each cell being evaluated multiple times.
You could also do it on one line by calling .eval() after the square().
The other possibility is just a cache issue, if it's being forced to skip around in memory a lot.

How to use thrust to accumulate array based on index?

I am trying to accumulate array based on index. My inputs are two vectors with same length. 1st vector is the index. 2nd vector are the value. My goal is to accumulate the value based on index. I have a similar code in c++. But I am new in thrust coding. Could I achieve this with thrust device code? Which function could I use? I found no "map" like functions. Is it more efficient than the CPU(host) code?
My c++ version mini sample code.
int a[10]={1,2,3,4,5,1,1,3,4,4};
vector<int> key(a,a+10);
double b[10]={1,2,3,4,5,1,2,3,4,5};
vector<double> val(b,b+10);
unordered_map<size_t,double> M;
for (size_t i = 0;i< 10 ;i++)
{
M[key[i]] = M[key[i]]+val[i];
}

As indicated in the comment, the canonical way to do this would be to reorder the data (keys, values) so that like keys are grouped together. You can do this with sort_by_key. reduce_by_key then solves.
It is possible, in a slightly un-thrust-like way, to also solve the problem without reordering, using a functor provided to for_each, that has an atomic.
The following illustrates both:
$ cat t27.cu
#include <thrust/reduce.h>
#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/for_each.h>
#include <thrust/copy.h>
#include <iostream>
#include <unordered_map>
#include <vector>
// this functor only needed for the non-reordering case
// requires compilation for a cc6.0 or higher GPU e.g. -arch=sm_60
struct my_func {
double *r;
my_func(double *_r) : r(_r) {};
template <typename T>
__host__ __device__
void operator()(T t) {
atomicAdd(r+thrust::get<0>(t)-1, thrust::get<1>(t)); // assumes consecutive keys starting at 1
}
};
int main(){
int a[10]={1,2,3,4,5,1,1,3,4,4};
std::vector<int> key(a,a+10);
double b[10]={1,2,3,4,5,1,2,3,4,5};
std::vector<double> val(b,b+10);
std::unordered_map<size_t,double> M;
for (size_t i = 0;i< 10 ;i++)
{
M[key[i]] = M[key[i]]+val[i];
}
for (int i = 1; i < 6; i++) std::cout << M[i] << " ";
std::cout << std::endl;
int size_a = sizeof(a)/sizeof(a[0]);
thrust::device_vector<int> d_a(a, a+size_a);
thrust::device_vector<double> d_b(b, b+size_a);
thrust::device_vector<double> d_r(5); //assumes only 5 keys, for illustration
thrust::device_vector<int> d_k(5); // assumes only 5 keys, for illustration
// method 1, without reordering
thrust::for_each_n(thrust::make_zip_iterator(thrust::make_tuple(d_a.begin(), d_b.begin())), size_a, my_func(thrust::raw_pointer_cast(d_r.data())));
thrust::host_vector<double> r = d_r;
thrust::copy(r.begin(), r.end(), std::ostream_iterator<double>(std::cout, " "));
std::cout << std::endl;
thrust::fill(d_r.begin(), d_r.end(), 0.0);
// method 2, with reordering
thrust::sort_by_key(d_a.begin(), d_a.end(), d_b.begin());
thrust::reduce_by_key(d_a.begin(), d_a.end(), d_b.begin(), d_k.begin(), d_r.begin());
thrust::copy(d_r.begin(), d_r.end(), r.begin());
thrust::copy(r.begin(), r.end(), std::ostream_iterator<double>(std::cout, " "));
std::cout << std::endl;
}
$ nvcc -o t27 t27.cu -std=c++14 -arch=sm_70
$ ./t27
4 2 6 13 5
4 2 6 13 5
4 2 6 13 5
$
I make no statements about relative performance of these approaches. It would probably depend on the actual data set size, and possibly the GPU being used and other factors.

How to fill a sparse matrix efficiently?

I use the eigen library to perform the sparse matrix operations, particularly, to fill a sparse matirx. But the rows and cols are very large in our case, which results in a long time for filling the sparse matrix. Is there any efficient way to do this (maybe by the other libraries)?
Below is the my code:
SparseMatrix mat(rows,cols);
mat.reserve(VectorXi::Constant(cols,6));
for each i,j such that v_ij != 0
mat.insert(i,j) = v_ij;
mat.makeCompressed();

The order in which a SparseMatrix is filled can make an enormous difference in computation time. To fill a SparseMatrix matrix quickly, the elements should be addressed in a sequence that corresponds to the storage order of the SparseMatrix. By default, the storage order in Eigen's SparseMatrix is column major, but it is easy to change this.
The following code demonstrates the time difference between a rowwise filling of two sparse matrices with different storage order. The square sparse matrices are relatively small and nominally identical. While the RowMajor matrix is almost instantly filled, it takes a much longer time (about 30 seconds on my desktop computer) in the case of ColMajor storage format.
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/SparseCore>
#include <random>
using namespace Eigen;
typedef SparseMatrix<double, RowMajor> SpMat_RM;
typedef SparseMatrix<double, ColMajor> SpMat_CM;
// compile with -std=c++11 -O3
int main() {
const int n = 1e4;
const int nnzpr = 50;
std::random_device rd;
std::mt19937 gen(rd());
std::uniform_int_distribution<> randInt(0, n-1);
SpMat_RM m_RM(n,n);
m_RM.reserve(n);
SpMat_CM m_CM(n,n);
m_CM.reserve(n);
std::cout << "Row-wise filling of [" << n << " x " << n << "] sparse matrix (RowMajor) ..." << std::flush;
for (int i = 0; i < n; ++i) {
for (int j = 0; j < nnzpr; ++j) {
int col = randInt(gen);
double val = 1. ; // v_ij
m_RM.coeffRef(i,col) = val ;
}
}
m_RM.makeCompressed();
std::cout << "done." << std::endl;
std::cout << "Row-wise filling of [" << n << " x " << n << "] sparse matrix (ColMajor) ..." << std::flush;
for (int i = 0; i < n; ++i) {
for (int j = 0; j < nnzpr; ++j) {
int col = randInt(gen);
double val = 1.; // v_ij
m_CM.coeffRef(i,col) = val ;
}
}
m_CM.makeCompressed();
std::cout << "done." << std::endl;
}

Call functor for all combinations in Cuda/Thrust

I have two index sets, one in the range [0, N], one in the range [0, M], where N != M. The indices are used to refer to values in different thrust::device_vectors.
Essentially, I want to create one GPU thread for every combination of these indices, so N*M threads. Each thread should compute a value based on the index-combination and store the result in another thrust::device_vector, at a unique index also based on the input combination.
This seems to be a fairly standard problem, but I was unable to find a way to do this in thrust. The documentation only ever mentions problems, where element i of a vector needs to compute something with element i of another vector. There is the thrust::permutation_iterator, but as far as I understand it only gives me the option to reorder data, and I have to specify the order as well.
Some code:
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>
int main()
{
// Initialize some data
const int N = 2;
const int M = 3;
thrust::host_vector<int> vec1_host(N);
thrust::host_vector<int> vec2_host(M);
vec1_host[0] = 1;
vec1_host[1] = 5;
vec2_host[0] = -3;
vec2_host[1] = 42;
vec2_host[2] = 9;
// Copy to device
thrust::device_vector<int> vec1_dev = vec1_host;
thrust::device_vector<int> vec2_dev = vec2_host;
// Allocate device memory to copy results to
thrust::device_vector<int> result_dev(vec1_host.size() * vec2_host.size());
// Create functor I want to call on every combination
struct myFunctor
{
thrust::device_vector<int> const& m_vec1;
thrust::device_vector<int> const& m_vec2;
thrust::device_vector<int>& m_result;
myFunctor(thrust::device_vector<int> const& vec1, thrust::device_vector<int> const& vec2, thrust::device_vector<int>& result)
: m_vec1(vec1), m_vec2(vec2), m_result(result)
{
}
__host__ __device__
void operator()(size_t i, size_t j) const
{
m_result[i + j * m_vec1.size()] = m_vec1[i] + m_vec1[j];
}
} func(vec1_dev, vec2_dev, result_dev);
// How do I create N*M threads, each of which calls func(i, j) ?
// Copy results back
thrust::host_vector<int> result_host = result_dev;
for(int i : result_host)
std::cout << i << ", ";
std::cout << std::endl;
// Expected output:
// -2, 2, 43, 47, 10, 14
return 0;
}
I'm fairly sure this is very easy to achieve, I guess I'm just missing the right search terms. Anyways, all help appreciated :)

Presumably in your functor operator instead of this:
m_result[i + j * m_vec1.size()] = m_vec1[i] + m_vec1[j];
^ ^
you meant this:
m_result[i + j * m_vec1.size()] = m_vec1[i] + m_vec2[j];
^ ^
I think there are probably many ways to tackle this, but so as to not argue about things that are not germane to the question, I'll try and stay as close to your given code as I can.
Operations like [] on a vector are not possible in device code. Therefore we must convert your functor to work on raw data pointers, rather than thrust vector operations directly.
With those caveats, and a slight modification in how we handle your i and j indices, I think what you're asking is not difficult.
The basic strategy is to create a result vector that is of length N*M just as you suggest, then create the indices i and j within the functor operator. In so doing, we need only pass one index to the functor, using e.g. thrust::transform or thrust::for_each to create our output:
$ cat t79.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
struct myFunctor
{
const int *m_vec1;
const int *m_vec2;
int *m_result;
size_t v1size;
myFunctor(thrust::device_vector<int> const& vec1, thrust::device_vector<int> const& vec2, thrust::device_vector<int>& result)
{
m_vec1 = thrust::raw_pointer_cast(vec1.data());
m_vec2 = thrust::raw_pointer_cast(vec2.data());
m_result = thrust::raw_pointer_cast(result.data());
v1size = vec1.size();
}
__host__ __device__
void operator()(const size_t x) const
{
size_t i = x%v1size;
size_t j = x/v1size;
m_result[i + j * v1size] = m_vec1[i] + m_vec2[j];
}
};
int main()
{
// Initialize some data
const int N = 2;
const int M = 3;
thrust::host_vector<int> vec1_host(N);
thrust::host_vector<int> vec2_host(M);
vec1_host[0] = 1;
vec1_host[1] = 5;
vec2_host[0] = -3;
vec2_host[1] = 42;
vec2_host[2] = 9;
// Copy to device
thrust::device_vector<int> vec1_dev = vec1_host;
thrust::device_vector<int> vec2_dev = vec2_host;
// Allocate device memory to copy results to
thrust::device_vector<int> result_dev(vec1_host.size() * vec2_host.size());
// How do I create N*M threads, each of which calls func(i, j) ?
thrust::for_each_n(thrust::device, thrust::counting_iterator<size_t>(0), (N*M), myFunctor(vec1_dev, vec2_dev, result_dev));
// Copy results back
thrust::host_vector<int> result_host = result_dev;
for(int i : result_host)
std::cout << i << ", ";
std::cout << std::endl;
// Expected output:
// -2, 2, 43, 47, 10, 14
return 0;
}
$ nvcc -std=c++11 -arch=sm_61 -o t79 t79.cu
$ ./t79
-2, 2, 43, 47, 10, 14,
$
In retrospect, I think this is more or less exactly what #eg0x20 was suggesting.

Generating random numbers with uniform distribution using Thrust

I need to generate a vector with random numbers between 0.0 and 1.0 using Thrust. The only documented example I could find produces very large random numbers (thrust::generate(myvector.begin(), myvector.end(), rand).
I'm sure the answer is simple, but I would appreciate any suggestions.

Thrust has random generators you can use to produce sequences of random numbers. To use them with a device vector you will need to create a functor which returns a different element of the random generator sequence. The most straightforward way to do this is using a transformation of a counting iterator. A very simple complete example (in this case generating random single precision numbers between 1.0 and 2.0) could look like:
#include <thrust/random.h>
#include <thrust/device_vector.h>
#include <thrust/transform.h>
#include <thrust/iterator/counting_iterator.h>
#include <iostream>
struct prg
{
float a, b;
__host__ __device__
prg(float _a=0.f, float _b=1.f) : a(_a), b(_b) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
int main(void)
{
const int N = 20;
thrust::device_vector<float> numbers(N);
thrust::counting_iterator<unsigned int> index_sequence_begin(0);
thrust::transform(index_sequence_begin,
index_sequence_begin + N,
numbers.begin(),
prg(1.f,2.f));
for(int i = 0; i < N; i++)
{
std::cout << numbers[i] << std::endl;
}
return 0;
}
In this example, the functor prg takes the lower and upper bounds of the random number as an argument, with (0.f,1.f) as the default. Note that in order to have a different vector each time you call the transform operation, you should used a counting iterator initialised to a different starting value.

It might not be a direct answer to your question but, cuRand library is quite powerful in this concept. You may both generate random numbers at GPU and CPU, and it contains many distribution functions (normal distribution etc).
Search for the title: "An NVIDIA CURAND implementation" on this link: http://adnanboz.wordpress.com/tag/nvidia-curand/
//Create a new generator
curandCreateGenerator(&m_prng, CURAND_RNG_PSEUDO_DEFAULT);
//Set the generator options
curandSetPseudoRandomGeneratorSeed(m_prng, (unsigned long) mainSeed);
//Generate random numbers
curandGenerateUniform(m_prng, d_randomData, dataCount);
One note is that, do not generate the generator again and again, it makes some precalculations. Calling curandGenerateUniform is quite fast and produces values between 0.0 and 1.0.

The approach suggested by #talonmies has a number of useful characteristics. Here's another approach that mimics the example you quoted:
#include <thrust/host_vector.h>
#include <thrust/generate.h>
#include <iostream>
#define DSIZE 5
__host__ static __inline__ float rand_01()
{
return ((float)rand()/RAND_MAX);
}
int main(){
thrust::host_vector<float> h_1(DSIZE);
thrust::generate(h_1.begin(), h_1.end(), rand_01);
std::cout<< "Values generated: " << std::endl;
for (unsigned i=0; i<DSIZE; i++)
std::cout<< h_1[i] << " : ";
std::cout<<std::endl;
return 0;
}
similar to the example you quoted, this uses rand(), and therefore can only be used to generate host vectors. Likewise it will produce the same sequence each time unless you re-seed rand() appropriately.

There are already satisfactory answers to this questions. In particular, the OP and Robert Crovella have dealt with thrust::generate while talonmies has proposed using thrust::transform.
I think there is another possibility, namely, using thrust::for_each, so I'm posting a fully worked example using such a primitive, just for the record.
I'm also timing the different solutions.
THE CODE
#include <iostream>
#include <thrust\host_vector.h>
#include <thrust\generate.h>
#include <thrust\for_each.h>
#include <thrust\execution_policy.h>
#include <thrust\random.h>
#include "TimingCPU.h"
/**************************************************/
/* RANDOM NUMBERS GENERATION STRUCTS AND FUNCTION */
/**************************************************/
template<typename T>
struct rand_01 {
__host__ T operator()(T& VecElem) const { return (T)rand() / RAND_MAX; }
};
template<typename T>
struct rand_01_for_each {
__host__ void operator()(T& VecElem) const { VecElem = (T)rand() / RAND_MAX; }
};
template<typename T>
__host__ T rand_01_fcn() { return ((T)rand() / RAND_MAX); }
struct prg
{
float a, b;
__host__ __device__
prg(float _a = 0.f, float _b = 1.f) : a(_a), b(_b) {};
__host__ __device__
float operator()(const unsigned int n) const
{
thrust::default_random_engine rng;
thrust::uniform_real_distribution<float> dist(a, b);
rng.discard(n);
return dist(rng);
}
};
/********/
/* MAIN */
/********/
int main() {
TimingCPU timerCPU;
const int N = 2 << 18;
//const int N = 64;
const int numIters = 50;
thrust::host_vector<double> h_v1(N);
thrust::host_vector<double> h_v2(N);
thrust::host_vector<double> h_v3(N);
thrust::host_vector<double> h_v4(N);
printf("N = %d\n", N);
double timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::transform(thrust::host, h_v1.begin(), h_v1.end(), h_v1.begin(), rand_01<double>());
timing = timing + timerCPU.GetCounter();
}
printf("Timing using transform = %f\n", timing / numIters);
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::counting_iterator<unsigned int> index_sequence_begin(0);
thrust::transform(index_sequence_begin,
index_sequence_begin + N,
h_v2.begin(),
prg(0.f, 1.f));
timing = timing + timerCPU.GetCounter();
}
printf("Timing using transform and internal Thrust random generator = %f\n", timing / numIters);
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::for_each(h_v3.begin(), h_v3.end(), rand_01_for_each<double>());
timing = timing + timerCPU.GetCounter();
}
timerCPU.StartCounter();
printf("Timing using for_each = %f\n", timing / numIters);
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N; k++)
// std::cout << h_v3[k] << " : ";
//std::cout << std::endl;
timing = 0.;
for (int k = 0; k < numIters; k++) {
timerCPU.StartCounter();
thrust::generate(h_v4.begin(), h_v4.end(), rand_01_fcn<double>);
timing = timing + timerCPU.GetCounter();
}
timerCPU.StartCounter();
printf("Timing using generate = %f\n", timing / numIters);
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N; k++)
// std::cout << h_v4[k] << " : ";
//std::cout << std::endl;
//std::cout << "Values generated: " << std::endl;
//for (int k = 0; k < N * 2; k++)
// std::cout << h_v[k] << " : ";
//std::cout << std::endl;
return 0;
}
On a laptop Core i5 platform, I had the following timings
N = 2097152
Timing using transform = 33.202298
Timing using transform and internal Thrust random generator = 264.508662
Timing using for_each = 33.155237
Timing using generate = 35.309399
The timings are equivalent, apart from the second one which uses Thrust's internal random number generator instead of rand().
Please, note that, differently from the other solutions, the one thrust::generate is somewhat more rigid since the function used to generate the random numbers cannot have input parameters. So, for example, it is not possible to scale the input arguments by a constant.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

CUDA: how to do a matrix multiplication using thrust? - c++

Related

Eigen very slow when chaining operations

How to use thrust to accumulate array based on index?

How to fill a sparse matrix efficiently?

Call functor for all combinations in Cuda/Thrust

Generating random numbers with uniform distribution using Thrust

Categories

Resources