fastest way converting multidimensional std::vector into one array

fastest way converting multidimensional std::vector into one array - c++

I want to copy as little as possible. At the moment I'm using num_t* array = new num_t[..] and then copying each value of the multidimensional vector into array in a for-loop.
I'd like to find a better way to do this.

For arithmetic types you can use function memcpy. For example
#include <iostream>
#include <vector>
#include <cstring>
int main()
{
std::vector<std::vector<int>> v =
{
{ 1 },
{ 1, 2 },
{ 1, 2, 3 },
{ 1, 2, 3, 4 }
};
for ( const auto &row : v )
{
for ( int x : row ) std::cout << x << ' ';
std::cout << std::endl;
}
std::cout << std::endl;
size_t n = 0;
for ( const auto &row : v ) n += row.size();
int *a = new int[n];
int *p = a;
for ( const auto &row : v )
{
std::memcpy( p, row.data(), row.size() * sizeof( int ) );
p += row.size();
}
for ( p = a; p != a + n; ++p ) std::cout << *p << ' ';
std::cout << std::endl;
delete []a;
}
The program output is
1
1 2
1 2 3
1 2 3 4
1 1 2 1 2 3 1 2 3 4

As you stated in the comments your inner vectors of your vector<vector<T>> structure are of the same size. So what you are actually trying to do is to store a m x n matrix.
Usually such matrices are not stored in multi-dimensional structures but in linear memory. The position (row, column) of a given element is then derived based on an indexing scheme of which row-major and column-major order are used most often.
Since you already state that you will copy this data on to a GPU, this copying is then simply done by copying the linear vector as a whole.
You will then use the same indexing scheme on the GPU and on the host.
If you are using CUDA, have a look at Thrust. It provides thrust::host_vector<T> and thrust::device_vector<T> and simplifies copying even further:
thrust::host_vector<int> hostVec(100); // 10 x 10 matrix
thrust::device_vector<int> deviceVec = hostVec; // copies hostVec to GPU

Related

How to do a reduction over one dimension of 2D data in Thrust

I'm new to CUDA and the thrust library. I'm learning and trying to implement a function that will have a for loop doing a thrust function. Is there a way to convert this loop into another thrust function? Or should I use a CUDA kernel to achieve this?
I have come up with code like this
// thrust functor
struct GreaterthanX
{
const float _x;
GreaterthanX(float x) : _x(x) {}
__host__ __device__ bool operator()(const float &a) const
{
return a > _x;
}
};
int main(void)
{
// fill a device_vector with
// 3 2 4 5
// 0 -2 3 1
// 9 8 7 6
int row = 3;
int col = 4;
thrust::device_vector<int> vec(row * col);
thrust::device_vector<int> count(row);
vec[0] = 3;
vec[1] = 2;
vec[2] = 4;
vec[3] = 5;
vec[4] = 0;
vec[5] = -2;
vec[6] = 3;
vec[7] = 1;
vec[8] = 9;
vec[9] = 8;
vec[10] = 7;
vec[11] = 6;
// Goal: For each row, count the number of elements greater than 2.
// And then find the row with the max count
// count the element greater than 2 in vec
for (int i = 0; i < row; i++)
{
count[i] = thrust::count_if(vec.begin(), vec.begin() + i * col, GreaterthanX(2));
}
thrust::device_vector<int>::iterator result = thrust::max_element(count.begin(), count.end());
int max_val = *result;
unsigned int position = result - count.begin();
printf("result = %d at position %d\r\n", max_val, position);
// result = 4 at position 2
return 0;
}
My goal is to find the row that has the most elements greater than 2. I'm struggling at how to do this without a loop. Any suggestions would be very appreciated. Thanks.

Solution using Thrust
Here is an implementation using thrust::reduce_by_key in conjunction with multiple "fancy iterators".
I also took the freedom to sprinkle in some const, auto and lambdas for elegance and readability. Due to the lambdas, you will need to use the -extended-lambda flag for nvcc.
#include <cassert>
#include <cstdio>
#include <thrust/reduce.h>
#include <thrust/device_vector.h>
#include <thrust/distance.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/iterator/discard_iterator.h>
#include <thrust/iterator/transform_iterator.h>
int main(void)
{
// fill a device_vector with
// 3 2 4 5
// 0 -2 3 1
// 9 8 7 6
int const row = 3;
int const col = 4;
thrust::device_vector<int> vec(row * col);
vec[0] = 3;
vec[1] = 2;
vec[2] = 4;
vec[3] = 5;
vec[4] = 0;
vec[5] = -2;
vec[6] = 3;
vec[7] = 1;
vec[8] = 9;
vec[9] = 8;
vec[10] = 7;
vec[11] = 6;
thrust::device_vector<int> count(row);
// Goal: For each row, count the number of elements greater than 2.
// And then find the row with the max count
// count the element greater than 2 in vec
// counting iterator avoids read from global memory, gives index into vec
auto keys_in_begin = thrust::make_counting_iterator(0);
auto keys_in_end = thrust::make_counting_iterator(row * col);
// transform vec on the fly
auto vals_in_begin = thrust::make_transform_iterator(
vec.cbegin(),
[] __host__ __device__ (int val) { return val > 2 ? 1 : 0; });
// discard to avoid write to global memory
auto keys_out_begin = thrust::make_discard_iterator();
auto vals_out_begin = count.begin();
// transform keys (indices) into row indices and then compare
// the divisions are one reason one might rather
// use MatX for higher dimensional data
auto binary_predicate = [col] __host__ __device__ (int i, int j){
return i / col == j / col;
};
// this function returns a new end for count
// b/c the final number of elements is often not known beforehand
auto new_ends = thrust::reduce_by_key(keys_in_begin, keys_in_end,
vals_in_begin,
keys_out_begin,
vals_out_begin,
binary_predicate);
// make sure that we didn't provide too small of an output vector
assert(thrust::get<1>(new_ends) == count.end());
auto const result = thrust::max_element(count.begin(), count.end());
int const max_val = *result;
auto const position = thrust::distance(count.begin(), result);
std::printf("result = %d at position %d\r\n", max_val, position);
// result = 4 at position 2
return 0;
}
Bonus solution using MatX
As mentioned in the comments NVIDIA has released a new high-level, C++17 library called MatX which targets problems involving (dense) multi-dimensional data (i.e. tensors). The library tries to unify multiple low-level libraries like CUFFT, CUSOLVER and CUTLASS in one python-/matlab-like interface. At the point of this writing (v0.2.2) the library is still in initial development and therefore probably doesn't guarantee a stable API. Due to this, the performance not being as optimized as with the more mature Thrust library and the documentation/samples not being quite exhaustive, MatX should not be used in production code yet. While constructing this solution I actually stumbled upon a bug which was instantly fixed. So this code will only work on the main branch and not with the current release v0.2.2 and some used features might not appear in the documentation yet.
A solution using MatX looks the following way:
#include <iostream>
#include <matx.h>
int main(void)
{
int const row = 3;
int const col = 4;
auto tensor = matx::make_tensor<int, 2>({row, col});
tensor.SetVals({{3, 2, 4, 5},
{0, -2, 3, 1},
{9, 8, 7, 6}});
// tensor.Print(0,0); // print full tensor
auto count = matx::make_tensor<int, 1>({row});
// count.Print(0); // print full count
// Goal: For each row, count the number of elements greater than 2.
// And then find the row with the max count
// the kind of reduction is determined through the shapes of tensor and count
matx::sum(count, matx::as_int(tensor > 2));
// A single value (scalar) is a tensor of rank 0:
auto result_idx = matx::make_tensor<matx::index_t>();
auto result = matx::make_tensor<int>();
matx::argmax(result, result_idx, count);
cudaDeviceSynchronize();
std::cout << "result = " << result()
<< " at position " << result_idx() << "\r\n";
// result = 4 at position 2
return 0;
}
As MatX employs deferred execution operators, matx::as_int(tensor > 2) is effectively fused into the kernel achieving the same as using a thrust::transform_iterator in Thrust.
Due to MatX knowing about the regularity of the problem while Thrust does not, the MatX solution could potentially be more performant than the Thrust solution. It certainly is more elegant. It is also possible to construct tensors in already allocated memory, so one can mix the libraries e.g. my constructing a tensor in the memory of a thrust::vector named vec via passing thrust::raw_pointer_cast(vec.data()) to the constructor of the tensor.

Eigen Map returns partly garbage after deleting original data

I'm trying to use Eigen::Map to convert a pointer to raw data to a matrix and then free the original data, but keep getting some weird results as if the data in the Eigen::Map itself is deleted. I thought Eigen::Map performed a deep copy, though maybe this only happens after you convert the Eigen::Map to a matrix?
Here is some test code:
#include <Eigen/Dense>
int main(int argc, char const *argv[])
{
double* data = new double[4];
data[0] = 1;
data[1] = 2;
data[2] = 3;
data[3] = 4;
Eigen::Map<Eigen::Matrix<double, 2, 2, Eigen::RowMajor>> M(data);
Eigen::Matrix<double, 2, 2, Eigen::RowMajor> N = M.matrix();
std::cout << M << std::endl;
std::cout << N << std::endl;
delete[] data;
std::cout << M << std::endl;
std::cout << N << std::endl;
return 0;
}
Which results in this for me:
1 2
3 4
1 2
3 4
0 4.67506e-310
3 4
1 2
3 4
Is there something I'm doing wrong to get this behaviour from M in my example? Or are you supposed to convert it to N, like I've done? And is this inefficient, or does Eigen handle the assignment N = M.matrix() in some smart way?

Reshaping flat array to complex Eigen type

How can I reshape data of size 1×2N to a complex form in Eigen to a form a P×Q complex matrix, with N complex numbers, P×Q=N? In data, the real and imaginary parts are right next to each other. I would like to dynamically reshape data as the data can have different sizes. I am trying to prevent copying and just map the data to complex type.
int N = 9;
int P = 3;
int Q = 6;
float *data = new float[2*N];
for(int i = 0; i < 2*N; i++)
data[i] = i + 1; // data = {1, 2, 3, 4, ..., 17, 18};
Eigen::Map<Eigen::MatrixXcf> A(data, P, Q); // trying to have something like this.
// Desired reshaping:
// A = [
// 1 + 2i 7 + 8i 13 + 14i
// 3 + 4i 9 + 10i 15 + 16i
// 5 + 6i 11 + 12i 17 + 18i
// ]
I tried to first convert data to a complex Eigen array (to ultimately convert to MatrixXcf), which does not work either:
Eigen::Map<Eigen::ArrayXf> Arr(data, N); // this works
Eigen::Map<Eigen::ArrayXcf> Arrc(A.data(), N); // trying to map data to an Eigen complex array.
Could stride in Eigen::Map be helpful?
The simplest solution is to loop through all the elements and convert data to an array of std::complex<float> *datac = new std::complex<float>[N];. I was wondering if Eigen can map data to datac. Thanks in advance.

Here is the MCVE answer (online example) with some extra examples of how you can use the stride to get different outcomes:
#include "Eigen/Core"
#include <iostream>
#include <complex>
int main()
{
int N = 9;
int P = 3;
int Q = 6;
float *data = new float[20*N];
for(int i = 0; i < 20*N; i++)
data[i] = i + 1; // data = {1, 2, 3, 4, ..., 170, 180};
// Produces the output of the "Desired reshaping"
Eigen::Map<Eigen::MatrixXcf>
A((std::complex<float>*)(data), P, P);
std::cout << A << "\n\n";
// Produces what you originally wrote (plus a cast so it works)
Eigen::Map<Eigen::MatrixXcf>
B((std::complex<float>*)(data), P, Q);
std::cout << B << "\n\n";
// Start each column at the 10xJ position
Eigen::Map<Eigen::MatrixXcf, 0, Eigen::OuterStride<>>
C((std::complex<float>*)(data), P, Q, Eigen::OuterStride<>(10));
std::cout << C << "\n\n";
// Skip every other value
Eigen::Map<Eigen::MatrixXcf, 0, Eigen::InnerStride<>>
D((std::complex<float>*)(data), P, Q, Eigen::InnerStride<>(2));
std::cout << D << "\n\n";
delete [] data;
return 0;
}
The output is:
(1,2) (7,8) (13,14)
(3,4) (9,10) (15,16)
(5,6) (11,12) (17,18)
(1,2) (7,8) (13,14) (19,20) (25,26) (31,32)
(3,4) (9,10) (15,16) (21,22) (27,28) (33,34)
(5,6) (11,12) (17,18) (23,24) (29,30) (35,36)
(1,2) (21,22) (41,42) (61,62) (81,82) (101,102)
(3,4) (23,24) (43,44) (63,64) (83,84) (103,104)
(5,6) (25,26) (45,46) (65,66) (85,86) (105,106)
(1,2) (13,14) (25,26) (37,38) (49,50) (61,62)
(5,6) (17,18) (29,30) (41,42) (53,54) (65,66)
(9,10) (21,22) (33,34) (45,46) (57,58) (69,70)

Adding elements to std::vector in a repeated way

I want to copy values from one vector to other one that will be stored in a specific order and the second vector will contain more elements than the first one.
For example:
vector<int> temp;
temp.push_back(2);
temp.push_back(0);
temp.push_back(1);
int size1 = temp.size();
int size2 = 4;
vector<int> temp2(size1 * size2);
And now I would like to fill temp2 like that: {2, 2, 2, 2, 0, 0, 0, 0, 1, 1, 1, 1}.
Is it possible to do this using only algorithms (e.g. fill)?

Yes, it is possible using std::generate_n algorithm:
int main() {
std::vector<int> base{1, 0, 2};
const int factor = 4;
std::vector<int> out{};
std::generate_n(std::back_inserter(out), base.size() * factor,
[&base, counter=0]() mutable {
return base[counter++ / factor];
});
for(const auto i : out) {
std::cout << i << ' ';
}
}
This code prints: 1 1 1 1 0 0 0 0 2 2 2 2
The key is the lambda used in std::generate_n. It operates on internal counter to know which values, based on base vector (and accessed depending on factor and counter values), to generate.

No, this is quite a specific use case, but you can trivially implement it yourself.
#include <vector>
#include <iostream>
std::vector<int> Elongate(const std::vector<int>& src, const size_t factor)
{
std::vector<int> result;
result.reserve(src.size() * factor);
for (const auto& el : src)
result.insert(result.end(), factor, el);
return result;
}
int main()
{
std::vector<int> temp{2, 0, 1};
std::vector<int> real = Elongate(temp, 4);
for (const auto& el : real)
std::cerr << el << ' ';
std::cerr << '\n';
}
(live demo)

Is there an example for accumarray() in C/C++

We are trying to understand accumarray function of MATLAB, wanted to write C/C++ code for the same for our understanding. Can someone help us with a sample/pseudo code?

According to the documentation,
The function processes the input as follows:
Find out how many unique indices there are in subs. Each unique index defines a bin in the output array. The maximum index value in
subs determines the size of the output array.
Find out how many times each index is repeated.
This determines how many elements of vals are going to be accumulated at each bin in the output array.
Create an output array. The output array is of size max(subs) or of size sz.
Accumulate the entries in vals into bins using the values of the indices in subs and apply fun to the entries in each bin.
Fill the values in the output for positions not referred to by subs. Default fill value is zero; use fillval to set a different
value.
So, translating to C++ (this is untested code),
template< typename sub_it, typename val_it, typename out_it,
typename fun = std::plus< typename std::iterator_traits< val_it >::value_type >,
typename T = typename fun::result_type >
out_it accumarray( sub_it first_index, sub_it last_index,
val_it first_value, // val_it last_value, -- 1 value per index
out_it first_out,
fun f = fun(), T fillval = T() ) {
std::size_t sz = std::max_element( first_index, last_index ); // 1. Get size.
std::vector< bool > used_indexes; // 2-3. remember which indexes are used
std::fill_n( first_out, sz, T() ); // 4. initialize output
while ( first_index != last_index ) {
std::size_t index = * first_index;
used_indexes[ index ] = true; // 2-3. remember that this index was used
first_out[ index ] = f( first_out[ index ], * first_value ); // 5. accumulate
++ first_value;
++ first_index;
}
// If fill is different from zero, reinitialize untouched values
if ( fillval != T() ) {
out_it fill_it = first_out;
for ( std::vector< bool >::iterator used_it = used_indexes.begin();
used_it != used_indexes.end(); ++ used_it ) {
if ( * used_it ) * fill_it = fillval;
}
}
return first_out + sz;
}
This has a few shortcomings, for example the accumulation function is called repeatedly instead of once with the entire column vector. The output is placed in pre-allocated storage referenced by first_out. The index vector must be the same size as the value vector. But most of the features should be captured pretty well.

Many thanks for your response. We were able to fully understand and implement the same in C++ (we used armadillo). Here is the code:
colvec TestProcessing::accumarray(icolvec cf, colvec T, double nf, int p)
{
/* ******* Description *******
here cf is the matrix of indices
T is the values whose data is to be
accumulted in the output array S.
if T is not given (or is scaler)then accumarray simply converts
to calculation of histogram of the input data
nf is the the size of output Array
nf >= max(cf)
so pass the argument accordingly
p is not used in the function
********************************/
colvec S; // output Array
S.set_size(int(nf)); // preallocate the output array
for(int i = 0 ; i < (int)nf ; i++)
{
// find the indices in cf corresponding to 1 to nf
// and store in unsigned integer array q1
uvec q1 = find(cf == (i+1));
vec q ;
double sum1 = 0 ;
if(!q1.is_empty())
{
q = T.elem(q1) ; // find the elements in T having indices in q1
// make sure q1 is not empty
sum1 = arma::sum(q); // calculate the sum and store in output array
S(i) = sum1;
}
// if q1 is empty array just put 0 at that particular location
else
{
S(i) = 0 ;
}
}
return S;
}
Hope this will help others too!
Thanks again to everybody who contributed :)

Here's what I came up with. Note: I went for readability (since you wanted to understand best), rather than being optimized. Oh, and I've never used MATLAB, I was just going off of this sample I saw just now:
val = 101:105;
subs = [1; 2; 4; 2; 4]
subs =
1
2
4
2
4
A = accumarray(subs, val)
A =
101 % A(1) = val(1) = 101
206 % A(2) = val(2)+val(4) = 102+104 = 206
0 % A(3) = 0
208 % A(4) = val(3)+val(5) = 103+105 = 208
Anyway, here's the code sample:
#include <iostream>
#include <stdio.h>
#include <vector>
#include <map>
class RangeValues
{
public:
RangeValues(int startValue, int endValue)
{
int range = endValue - startValue;
// Reserve all needed space up front
values.resize(abs(range) + 1);
unsigned int index = 0;
for ( int i = startValue; i != endValue; iterateByDirection(range, i), ++index )
{
values[index] = i;
}
}
std::vector<int> GetValues() const { return values; }
private:
void iterateByDirection(int range, int& value)
{
( range < 0 ) ? --value : ++value;
}
private:
std::vector<int> values;
};
typedef std::map<unsigned int, int> accumMap;
accumMap accumarray( const RangeValues& rangeVals )
{
accumMap aMap;
std::vector<int> values = rangeVals.GetValues();
unsigned int index = 0;
std::vector<int>::const_iterator itr = values.begin();
for ( itr; itr != values.end(); ++itr, ++index )
{
aMap[index] = (*itr);
}
return aMap;
}
int main()
{
// Our value range will be from -10 to 10
RangeValues values(-10, 10);
accumMap aMap = accumarray(values);
// Now iterate through and check out what values map to which indices.
accumMap::const_iterator itr = aMap.begin();
for ( itr; itr != aMap.end(); ++itr )
{
std::cout << "Index: " << itr->first << ", Value: " << itr->second << '\n';
}
//Or much like the MATLAB Example:
cout << aMap[5]; // -5, since out range was from -10 to 10
}

In addition to Vicky Budhiraja "armadillo" example, this one is a 2D version of accumarray using similar semantic than matlab function:
arma::mat accumarray (arma::mat& subs, arma::vec& val, arma::rowvec& sz)
{
arma::u32 ar = sz.col(0)(0);
arma::u32 ac = sz.col(1)(0);
arma::mat A; A.set_size(ar, ac);
for (arma::u32 r = 0; r < ar; ++r)
{
for (arma::u32 c = 0; c < ac; ++c)
{
arma::uvec idx = arma::find(subs.col(0) == r &&
subs.col(1) == c);
if (!idx.is_empty())
A(r, c) = arma::sum(val.elem(idx));
else
A(r, c) = 0;
}
}
return A;
}
The sz input is a two columns vector that contain : num rows / num cols for the output matrix A. The subs matrix is a 2 columns with same num rows of val. Num rows of val is basically sz.rows by sz.cols.
The sz (size) input is not really mandatory and can be deduced easily by searching the max in subs columns.
arma::u32 sz_rows = arma::max(subs.col(0)) + 1;
arma::u32 sz_cols = arma::max(subs.col(1)) + 1;
or
arma::u32 sz_rows = arma::max(subs.col(0)) + 1;
arma::u32 sz_cols = val.n_elem / sz_rows;
the output matrix is now :
arma::mat A (sz_rows, sz_cols);
the accumarray function become :
arma::mat accumarray (arma::mat& subs, arma::vec& val)
{
arma::u32 sz_rows = arma::max(subs.col(0)) + 1;
arma::u32 sz_cols = arma::max(subs.col(1)) + 1;
arma::mat A (sz_rows, sz_cols);
for (arma::u32 r = 0; r < sz_rows; ++r)
{
for (arma::u32 c = 0; c < sz_cols; ++c)
{
arma::uvec idx = arma::find(subs.col(0) == r &&
subs.col(1) == c);
if (!idx.is_empty())
A(r, c) = arma::sum(val.elem(idx));
else
A(r, c) = 0;
}
}
return A;
}
For example :
arma::vec val = arma::regspace(101, 106);
arma::mat subs;
subs << 0 << 0 << arma::endr
<< 1 << 1 << arma::endr
<< 2 << 1 << arma::endr
<< 0 << 0 << arma::endr
<< 1 << 1 << arma::endr
<< 3 << 0 << arma::endr;
arma::mat A = accumarray (subs, val);
A.raw_print("A =");
Produce this result :
A =
205 0
0 207
0 103
106 0
This example is found here : http://fr.mathworks.com/help/matlab/ref/accumarray.html?requestedDomain=www.mathworks.com
except for the indices of subs, armadillo is 0-based indice where matlab is 1-based.
Unfortunaly, the previous code is not suitable for big matrix. Two for-loop with a find in vector in between is really bad thing. The code is good to understand the concept but can be optimized as a single loop like this one :
arma::mat accumarray(arma::mat& subs, arma::vec& val)
{
arma::u32 ar = arma::max(subs.col(0)) + 1;
arma::u32 ac = arma::max(subs.col(1)) + 1;
arma::mat A(ar, ac);
A.zeros();
for (arma::u32 r = 0; r < subs.n_rows; ++r)
A(subs(r, 0), subs(r, 1)) += val(r);
return A;
}
The only change are :
init the output matrix with zero's.
loop over subs rows to get the output indice(s)
accumulate val to output (subs & val are row synchronized)
A 1-D version (vector) of the function can be something like :
arma::vec accumarray (arma::ivec& subs, arma::vec& val)
{
arma::u32 num_elems = arma::max(subs) + 1;
arma::vec A (num_elems);
A.zeros();
for (arma::u32 r = 0; r < subs.n_rows; ++r)
A(subs(r)) += val(r);
return A;
}
For testing 1D version :
arma::vec val = arma::regspace(101, 105);
arma::ivec subs;
subs << 0 << 2 << 3 << 2 << 3;
arma::vec A = accumarray(subs, val);
A.raw_print("A =");
The result is conform with matlab examples (see previous link)
A =
101
0
206
208
This is not a strict copy of matlab accumarray function. For example, the matlab function allow to output vec/mat with size defined by sz that is larger than the intrinsec size of the subs/val duo.
Maybe that can be a idea for addition to the armadillo api. Allowing a single interface for differents dimensions & types.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

fastest way converting multidimensional std::vector into one array - c++

I want to copy as little as possible. At the moment I'm using num_t* array = new num_t[..] and then copying each value of the multidimensional vector into array in a for-loop. I'd like to find a better way to do this.

Related

How to do a reduction over one dimension of 2D data in Thrust

Eigen Map returns partly garbage after deleting original data

Reshaping flat array to complex Eigen type

Adding elements to std::vector in a repeated way

Is there an example for accumarray() in C/C++

Categories

Resources