I am trying to accumulate array based on index. My inputs are two vectors with same length. 1st vector is the index. 2nd vector are the value. My goal is to accumulate the value based on index. I have a similar code in c++. But I am new in thrust coding. Could I achieve this with thrust device code? Which function could I use? I found no "map" like functions. Is it more efficient than the CPU(host) code?
My c++ version mini sample code.
int a[10]={1,2,3,4,5,1,1,3,4,4};
vector<int> key(a,a+10);
double b[10]={1,2,3,4,5,1,2,3,4,5};
vector<double> val(b,b+10);
unordered_map<size_t,double> M;
for (size_t i = 0;i< 10 ;i++)
{
M[key[i]] = M[key[i]]+val[i];
}
As indicated in the comment, the canonical way to do this would be to reorder the data (keys, values) so that like keys are grouped together. You can do this with sort_by_key. reduce_by_key then solves.
It is possible, in a slightly un-thrust-like way, to also solve the problem without reordering, using a functor provided to for_each, that has an atomic.
The following illustrates both:
$ cat t27.cu
#include <thrust/reduce.h>
#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/for_each.h>
#include <thrust/copy.h>
#include <iostream>
#include <unordered_map>
#include <vector>
// this functor only needed for the non-reordering case
// requires compilation for a cc6.0 or higher GPU e.g. -arch=sm_60
struct my_func {
double *r;
my_func(double *_r) : r(_r) {};
template <typename T>
__host__ __device__
void operator()(T t) {
atomicAdd(r+thrust::get<0>(t)-1, thrust::get<1>(t)); // assumes consecutive keys starting at 1
}
};
int main(){
int a[10]={1,2,3,4,5,1,1,3,4,4};
std::vector<int> key(a,a+10);
double b[10]={1,2,3,4,5,1,2,3,4,5};
std::vector<double> val(b,b+10);
std::unordered_map<size_t,double> M;
for (size_t i = 0;i< 10 ;i++)
{
M[key[i]] = M[key[i]]+val[i];
}
for (int i = 1; i < 6; i++) std::cout << M[i] << " ";
std::cout << std::endl;
int size_a = sizeof(a)/sizeof(a[0]);
thrust::device_vector<int> d_a(a, a+size_a);
thrust::device_vector<double> d_b(b, b+size_a);
thrust::device_vector<double> d_r(5); //assumes only 5 keys, for illustration
thrust::device_vector<int> d_k(5); // assumes only 5 keys, for illustration
// method 1, without reordering
thrust::for_each_n(thrust::make_zip_iterator(thrust::make_tuple(d_a.begin(), d_b.begin())), size_a, my_func(thrust::raw_pointer_cast(d_r.data())));
thrust::host_vector<double> r = d_r;
thrust::copy(r.begin(), r.end(), std::ostream_iterator<double>(std::cout, " "));
std::cout << std::endl;
thrust::fill(d_r.begin(), d_r.end(), 0.0);
// method 2, with reordering
thrust::sort_by_key(d_a.begin(), d_a.end(), d_b.begin());
thrust::reduce_by_key(d_a.begin(), d_a.end(), d_b.begin(), d_k.begin(), d_r.begin());
thrust::copy(d_r.begin(), d_r.end(), r.begin());
thrust::copy(r.begin(), r.end(), std::ostream_iterator<double>(std::cout, " "));
std::cout << std::endl;
}
$ nvcc -o t27 t27.cu -std=c++14 -arch=sm_70
$ ./t27
4 2 6 13 5
4 2 6 13 5
4 2 6 13 5
$
I make no statements about relative performance of these approaches. It would probably depend on the actual data set size, and possibly the GPU being used and other factors.
Related
I have a 2d boost matrix (boost::numeric::ublas::matrix) of shape (n,m), with the first column being the timestamp. However, the data I'm getting is out of order. How can I sort it with respect to the first column, and what would be the most efficient way to do so? Speed is critical in this particular application.
As I commented ublas::matrix might not be the most natural choice for a task like this. Trying the naive approach using matrix_row and some range magic:
Live on Coliru
#define _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS
#include <boost/numeric/ublas/io.hpp>
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/numeric/ublas/matrix_proxy.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/range/irange.hpp>
#include <boost/range/algorithm.hpp>
#include <iomanip>
#include <iostream>
using namespace boost::adaptors;
using Matrix = boost::numeric::ublas::matrix<float>;
using Row = boost::numeric::ublas::matrix_row<Matrix>;
static auto by_col0 = [](Row const& a, Row const& b) { return a(0) < b(0); };
int main()
{
constexpr int nrows = 3, ncols = 4;
Matrix m(nrows, ncols);
for (unsigned i = 0; i < m.size1(); ++i)
for (unsigned j = 0; j < m.size2(); ++j)
m(i, j) = (10 - 3.f * i) + j;
std::cout << "before: " << m << "\n";
auto getrow = [&](int i) { return Row(m, i); };
sort(boost::irange(nrows) | transformed(getrow), by_col0);
std::cout << "after: " << m << "\n";
}
Does sadly confirm that the abstraction of the proxy doesn't hold:
before: [3,4]((10,11,12,13),(7,8,9,10),(4,5,6,7))
after: [3,4]((10,11,12,13),(10,11,12,13),(10,11,12,13))|
Oops.
Analysis?
I can't say I know what's wrong. std::sort is defined in terms of ValueSwappable which at first glance seems to work fine for matrix_row:
auto r0 = Row(m, 0);
auto r1 = Row(m, 1);
using std::swap;
swap(r0, r1);
Prints Live On Coliru
Maybe this starting point gives you something helpful. Since it's tricky like this, I'd highly consider using another data structure that is more conducive to your task (boost::multi_array[_ref] comes to mind).
The objective is to return all pairs of integers from a given array of integers that have a difference of 2. The result array should be sorted in ascending order of values. Assume there are no duplicate integers in the array. The order of the integers in the input array should not matter.
My code here:
#include <utility>
#include <bits/stdc++.h>
#include <vector>
using namespace std;
std::vector<std::pair<int, int>> twos_difference(const std::vector<int> &vec) {
vector <int> v1 = vec;
vector <pair<int,int>> pairs;
sort(v1.begin(),v1.end(), greater<int>());
for (size_t i=0; i<vec.size()-1; i++){
pair <int,int> due;
for (size_t j=1; j<vec.size(); j++){
if (v1[i] - v1[j] == 2){
due.first = v1[j];
due.second = v1[i];
pairs.push_back(due);
break;
}
}
}
sort(pairs.begin(),pairs.end());
return pairs;""
}
Execution Timed Out (12000 ms) WHY??????
We are here to help.
I analysed the code and checked, what could be the issue here.
From this, you can derive measures for improvements.
Please check the below list, maybe this gives you and idea for further development.
You pass the input vector by reference, but then copy all the data. No need to do. Use the given input vector as is. Except, of course, if the source data is still neded
You define your resulting std::vector "pairs" but do not reserve memory for this. Using push_back on this std::vector will do a lot of memory reallocation and copying. Please use std::pairs.reserve(vec.size()/2);after the definition of "pairs".
No need to handover "greater" to the sort function. The sort function will do that by default
Double nested loops are not needed. Your values are already sorted. You can use a single loop and then compare vec[i] with [vec[i+1]. This will reduce the complexity from quadratic to linear.
At the end, you sort again the pairs. This is not necessary, as the values were already sorted before
Please compile with all optimizations on, like "Release Mode" or "-O3"
See the below example which creates fist 10.000.000 random and unique test values and then measures the execution of the operation for finding the required pairs.
Execution time is below 1 second on my machine.
#include <iostream>
#include <vector>
#include <algorithm>
#include <utility>
#include <random>
#include <unordered_set>
#include <limits>
#include <chrono>
#include <fstream>
constexpr size_t MaxValues = 10'000'000;
std::vector<int> createTestValues() {
std::random_device rd; // Will be used to obtain a seed for the random number engine
std::mt19937 gen(rd()); // Standard mersenne_twister_engine seeded with rd()
std::uniform_int_distribution<> distrib(std::numeric_limits<int>::min(), std::numeric_limits<int>::max());
std::unordered_set<int> values{};
while (values.size() < MaxValues)
values.insert(distrib(gen));
return { values.begin(), values.end() };
}
int main() {
std::vector values = createTestValues();
auto start = std::chrono::system_clock::now();
// Main algorithm ------------------------------------------
std::vector<std::pair<int, int>> result;
result.reserve(values.size() / 2);
std::sort(values.begin(), values.end());
for (size_t k{}; k < values.size() - 1; ++k)
if (values[k + 1] - values[k] == 2)
result.push_back({ values[k] , values[k + 1] });
// ------------------------------------------------------
auto elapsed = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start);
std::cout << "Elapsed time: " << elapsed.count() << " ms\n\n";
std::ofstream ofs{ "test.txt" };
for (const auto& [v1, v2] : result) ofs << v1 << '\t' << v2 << '\n';
}
I'm trying to perform a computation over a one-dimensional array A[] using Intel's TBBs. The problem is that, by default, an algorithm like tbb::parallel_for would cut the array in half recursively, sending each chunk to the task pool for the threads to steal.
However, I want all threads to "scan" the array in a linear way. For example, using 4 threads have them compute, in parallel, first A[0], A[1], A[2] and A[3] in any order. And then, compute the set A[4], A[5], A[6] and A[7], in any order.
Right now, parallel_for, after a couple of recursive splits would compute first A[0], A[2], A[4] and A[6] respectively. And then, A[1], A[3], A[5] and A[7] (or something similar).
I'm using C++14 and Intel's Threading Building Blocks. Algorithms like parallel_reduce or parallel_scan operate in a similar fashion, regarding the splitting of the iteration space, so they haven't been much of a help.
My guess is that I have do define my own iteration space object, but I can't figure out how exactly. The docs give this definition:
class R {
// True if range is empty
bool empty() const;
// True if range can be split into non-empty subranges
bool is_divisible() const;
// Splits r into subranges r and *this
R( R& r, split );
// Splits r into subranges r and *this in proportion p
R( R& r, proportional_split p );
// Allows usage of proportional splitting constructor
static const bool is_splittable_in_proportion = true;
...
};
It all boils down to this code:
#include <mutex>
#include <iostream>
#include <thread>
#include <tbb/parallel_for.h>
#include <tbb/task_scheduler_init.h>
std::mutex cout_mutex;
int main()
{
auto N = 8;
tbb::task_scheduler_init init(4);
tbb::parallel_for(tbb::blocked_range<int>(0, N),
[&](const tbb::blocked_range<int>& r)
{
for (int j = r.begin(); j < r.end(); ++j) {
// Compute A[j]
std::this_thread::sleep_for(std::chrono::seconds(1));
cout_mutex.lock();
std::cout << std::this_thread::get_id()<< ", " << j << std::endl;
cout_mutex.unlock();
}
}
);
}
The above code gives:
140455557347136, 0
140455526110976, 4
140455521912576, 2
140455530309376, 6
140455526110976, 5
140455557347136, 1
140455521912576, 3
140455530309376, 7
but I wanted something like:
140455557347136, 0
140455526110976, 1
140455521912576, 2
140455530309376, 3
140455526110976, 5
140455557347136, 4
140455521912576, 6
140455530309376, 7
Any suggestions on the iteration object or is there a better solution?
Consider using an external atomic, e.g. ( // !!! marks changed lines)
#include <mutex>
#include <iostream>
#include <thread>
#include <tbb/parallel_for.h>
#include <tbb/task_scheduler_init.h>
#include <atomic> // !!!
std::mutex cout_mutex;
int main()
{
auto N = 8;
tbb::task_scheduler_init init(4);
std::atomic<int> monotonic_begin{0}; // !!!
tbb::parallel_for(tbb::blocked_range<int>(0, N),
[&](const tbb::blocked_range<int>& r)
{
int s = static_cast<int>(r.size()); // !!!
int b = monotonic_begin.fetch_add(s); // !!!
int e = b + s; // !!!
for (int j = b; j < e; ++j) { // !!!
// Compute A[j]
std::this_thread::sleep_for(std::chrono::seconds(1));
cout_mutex.lock();
std::cout << std::this_thread::get_id() << ", " << j << std::endl;
cout_mutex.unlock();
}
}
);
}
The approach gives:
15084, 0
15040, 3
12400, 2
11308, 1
15084, 4
15040, 5
12400, 6
11308, 7
Why it is important to have monotonic behavior? You may want to consider parallel_pipeline or flow graph to specify calculation dependencies.
I am trying to compare the performance of std::sort (using std::vector of structs) vs intel ipp sort.
I am running this test on an Intel Xeon processor model name : Intel(R) Xeon(R) CPU X5670 # 2.93GHz
I am sorting a vector of length 20000 elements and sorting 200 times. I have tried 2 diferent ipp sort routines viz. ippsSortDescend_64f_I and ippsSortRadixDescend_64f_I. In all cases, ipp sort was at least 5 to 10 times slower than std::sort. I was expecting the ipp sort maybe slower for smaller arrays but otherwise it should generally be faster than std::sort. Am I missing something here? What am I doing wrong?
std::sort is consistently faster in all my test cases.
Here is my program
#include <array>
#include <iostream>
#include <algorithm>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#include <sys/timeb.h>
#include <vector>
#include <chrono>
#include "ipp.h"
using namespace std;
const int SIZE = 2000000;
const int ITERS = 200;
//Chrono typedefs
typedef std::chrono::high_resolution_clock Clock;
typedef std::chrono::microseconds microseconds;
//////////////////////////////////// std ///////////////////////////////////
typedef vector<double> myList;
void initialize(myList & l, Ipp64f* ptr)
{
double randomNum;
for (int i = 0; i < SIZE; i++)
{
randomNum = 1.0 * rand() / (RAND_MAX / 2) - 1;
l.push_back(randomNum);
ptr[i] = randomNum;
}
}
void test_sort()
{
array<myList, ITERS> list;
array<Ipp64f*, ITERS> ippList;
// allocate
for(int i=0; i<ITERS;i++)
{
list[i].reserve(SIZE);
ippList[i] = ippsMalloc_64f(SIZE);
}
// initialize
for(int i=0;i<ITERS;i++)
{
initialize(list[i], ippList[i]);
}
cout << "\n\nTest Case 1: std::sort\n";
cout << "========================\n";
// sort vector
Clock::time_point t0 = Clock::now();
for(int i=0; i<ITERS;i++)
{
std::sort(list[i].begin(), list[i].end());
}
Clock::time_point t1 = Clock::now();
microseconds ms = std::chrono::duration_cast<microseconds>(t1 - t0);
std::cout << ms.count() << " micros" << std::endl;
////////////////////////////////// IPP ////////////////////////////////////////
cout << "\n\nTest Case 2: ipp::sort\n";
cout << "========================\n";
// sort ipp
Clock::time_point t2 = Clock::now();
for(int i=0; i<ITERS;i++)
{
ippsSortAscend_64f_I(ippList[i], SIZE);
}
Clock::time_point t3 = Clock::now();
microseconds ms1 = std::chrono::duration_cast<microseconds>(t3 - t2);
std::cout << ms1.count() << " micros" << std::endl;
for(int i=0; i<ITERS;i++)
{
ippsFree( ippList[i] );
}
}
///////////////////////////////////////////////////////////////////////////////////////
int main()
{
srand (time(NULL));
cout << "Test for sorting an array of structures.\n" << endl;
cout << "Test case: \nSort an array of structs ("<<ITERS<<" iterations) with double of length "<<SIZE<<". \n";
IppStatus status=ippInit();
test_sort();
return 0;
}
/////////////////////////////////////////////////////////////////////////////
compilation command is:
/share/intel/bin/icc -O2 -I$(IPPROOT)/include sorting.cpp -lrt -L$(IPPROOT)/lib/intel64 -lippi -lipps -lippvm -lippcore -std=c++0x
Program output:
Test for sorting an array of structures.
Test case:
Sort an array of structs (200 iterations) with double of length 2000000.
Test Case 1: std::sort
========================
38117024 micros
Test Case 2: ipp::sort
========================
48917686 micros
I have run your code on my computer (Core i7 860).
std::sort 32,763,268 (~33s)
ippsSortAscend_64f_I 34,217,517 (~34s)
ippsSortRadixAscend_64f_I 15,319,053 (~15s)
These are the expected results. std::sort is inline and highly optimized, while ippsSort_* has function call overhead and a lot of inner checks performed by all ipp functions. This should explain the little slowdown for ippsSortAscend function. Radix sort is still twice faster as expected, since it is not a comparison based sorting.
for more accurate result you need to
compare sorting of exactly the same distributions of random numbers;
remove randomize from timing;
use ippsSort*32f functions, to sort 'float' (not 'double') in IPP case.
I guess you've forgotten to call ippInit() before the measuremen
I have two arrays values and keys both of the same length.
I want to sort-by-key the values array using the keys array as keys.
I have been told the boost's zip iterator is just the right tool for locking two arrays together and doing stuff to them at the same time.
Here is my attempt at using the boost::zip_iterator to solve sorting problem which fails to compile with gcc. Can someone help me fix this code?
The problem lies in the line
std::sort ( boost::make_zip_iterator( keys, values ), boost::make_zip_iterator( keys+N , values+N ));
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <ctime>
#include <vector>
#include <algorithm>
#include <boost/iterator/zip_iterator.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/tuple/tuple_comparison.hpp>
int main(int argc, char *argv[])
{
int N=10;
int keys[N];
double values[N];
int M=100;
//Create the vectors.
for (int i = 0; i < N; ++i)
{
keys[i] = rand()%M;
values[i] = 1.0*rand()/RAND_MAX;
}
//Now we use the boost zip iterator to zip the two vectors and sort them "simulatneously"
//I want to sort-by-key the keys and values arrays
std::sort ( boost::make_zip_iterator( keys, values ),
boost::make_zip_iterator( keys+N , values+N )
);
//The values array and the corresponding keys in ascending order.
for (int i = 0; i < N; ++i)
{
std::cout << keys[i] << "\t" << values[i] << std::endl;
}
return 0;
}
NOTE:Error message on compilation
g++ -g -Wall boost_test.cpp
boost_test.cpp: In function ‘int main(int, char**)’:
boost_test.cpp:37:56: error: no matching function for call to ‘make_zip_iterator(int [(((unsigned int)(((int)N) + -0x00000000000000001)) + 1)], double [(((unsigned int)(((int)N) + -0x00000000000000001)) + 1)])’
boost_test.cpp:38:64: error: no matching function for call to ‘make_zip_iterator(int*, double*)’
You can't sort a pair of zip_iterators.
Firstly, make_zip_iterator takes a tuple of iterators as input, so you could call:
boost::make_zip_iterator(boost::make_tuple( ... ))
but that won't compile either, because keys and keys+N doesn't have the same type. We need to force keys to become a pointer:
std::sort(boost::make_zip_iterator(boost::make_tuple(+keys, +values)),
boost::make_zip_iterator(boost::make_tuple(keys+N, values+N)));
this will compile, but the sorted result is still wrong, because a zip_iterator only models a Readable iterator, but std::sort also needs the input to be Writable as described here, so you can't sort using zip_iterator.
A very good discussion of this problem can be found here: https://web.archive.org/web/20120422174751/http://www.stanford.edu/~dgleich/notebook/2006/03/sorting_two_arrays_simultaneou.html
Here's a possible duplicate of this question: Sorting zipped (locked) containers in C++ using boost or the STL
The approach in the link above uses std::sort, and no extra space. It doesn't employ boost::zip_iterator, just boost tuples and the boost iterator facade. Std::tuples should also work if you have an up to date compiler.
If you are happy to have one extra vector (of size_t elements), then the following approach will work in ~ o(n log n) time average case. It's fairly simple, but there will be better approaches out there if you search for them.
#include <vector>
#include <iostream>
#include <algorithm>
#include <iterator>
using namespace std;
template <typename T1, typename T2>
void sortByPerm(vector<T1>& list1, vector<T2>& list2) {
const auto len = list1.size();
if (!len || len != list2.size()) throw;
// create permutation vector
vector<size_t> perms;
for (size_t i = 0; i < len; i++) perms.push_back(i);
sort(perms.begin(), perms.end(), [&](T1 a, T1 b){ return list1[a] < list1[b]; });
// order input vectors by permutation
for (size_t i = 0; i < len - 1; i++) {
swap(list1[i], list1[perms[i]]);
swap(list2[i], list2[perms[i]]);
// adjust permutation vector if required
if (i < perms[i]) {
auto d = distance(perms.begin(), find(perms.begin() + i, perms.end(), i));
swap(perms[i], perms[d]);
}
}
}
int main() {
vector<int> ints = {32, 12, 40, 8, 9, 15};
vector<double> doubles = {55.1, 33.3, 66.1, 11.1, 22.1, 44.1};
sortByPerm(ints, doubles);
copy(ints.begin(), ints.end(), ostream_iterator<int>(cout, " ")); cout << endl;
copy(doubles.begin(), doubles.end(), ostream_iterator<double>(cout, " ")); cout << endl;
}
After seeing another of your comments in another answer.
I though I would enlighten you to the std::map. This is a key value container, that preserves key order. (it is basically a binary tree, usually red black tree, but that isn't important).
size_t elements=10;
std::map<int, double> map_;
for (size_t i = 0; i < 10; ++i)
{
map_[rand()%M]=1.0*rand()/RAND_MAX;
}
//for every element in map, if you have C++11 this can be much cleaner
for (std::map<int,double>::const_iterator it=map_.begin();
it!=map_.end(); ++it)
{
std::cout << it->first << "\t" << it->second << std::endl;
}
untested, but any error should be simple syntax errors
boost::make_zip_iterator take a boost::tuple.
#include <boost/iterator/zip_iterator.hpp>
#include <boost/tuple/tuple.hpp>
#include <boost/tuple/tuple_comparison.hpp>
#include <iostream>
#include <iomanip>
#include <cstdlib>
#include <ctime>
#include <vector>
#include <algorithm>
int main(int argc, char *argv[])
{
std::vector<int> keys(10); //lets not waste time with arrays
std::vector<double> values(10);
const int M=100;
//Create the vectors.
for (size_t i = 0; i < values.size(); ++i)
{
keys[i] = rand()%M;
values[i] = 1.0*rand()/RAND_MAX;
}
//Now we use the boost zip iterator to zip the two vectors and sort them "simulatneously"
//I want to sort-by-key the keys and values arrays
std::sort ( boost::make_zip_iterator(
boost::make_tuple(keys.begin(), values.begin())),
boost::make_zip_iterator(
boost::make_tuple(keys.end(), values.end()))
);
//The values array and the corresponding keys in ascending order.
for (size_t i = 0; i < values.size(); ++i)
{
std::cout << keys[i] << "\t" << values[i] << std::endl;
}
return 0;
}