Non-recursive split iteration of array with tbb::paralllel_for

Non-recursive split iteration of array with tbb::paralllel_for - c++

I'm trying to perform a computation over a one-dimensional array A[] using Intel's TBBs. The problem is that, by default, an algorithm like tbb::parallel_for would cut the array in half recursively, sending each chunk to the task pool for the threads to steal.
However, I want all threads to "scan" the array in a linear way. For example, using 4 threads have them compute, in parallel, first A[0], A[1], A[2] and A[3] in any order. And then, compute the set A[4], A[5], A[6] and A[7], in any order.
Right now, parallel_for, after a couple of recursive splits would compute first A[0], A[2], A[4] and A[6] respectively. And then, A[1], A[3], A[5] and A[7] (or something similar).
I'm using C++14 and Intel's Threading Building Blocks. Algorithms like parallel_reduce or parallel_scan operate in a similar fashion, regarding the splitting of the iteration space, so they haven't been much of a help.
My guess is that I have do define my own iteration space object, but I can't figure out how exactly. The docs give this definition:
class R {
// True if range is empty
bool empty() const;
// True if range can be split into non-empty subranges
bool is_divisible() const;
// Splits r into subranges r and *this
R( R& r, split );
// Splits r into subranges r and *this in proportion p
R( R& r, proportional_split p );
// Allows usage of proportional splitting constructor
static const bool is_splittable_in_proportion = true;
...
};
It all boils down to this code:
#include <mutex>
#include <iostream>
#include <thread>
#include <tbb/parallel_for.h>
#include <tbb/task_scheduler_init.h>
std::mutex cout_mutex;
int main()
{
auto N = 8;
tbb::task_scheduler_init init(4);
tbb::parallel_for(tbb::blocked_range<int>(0, N),
[&](const tbb::blocked_range<int>& r)
{
for (int j = r.begin(); j < r.end(); ++j) {
// Compute A[j]
std::this_thread::sleep_for(std::chrono::seconds(1));
cout_mutex.lock();
std::cout << std::this_thread::get_id()<< ", " << j << std::endl;
cout_mutex.unlock();
}
}
);
}
The above code gives:
140455557347136, 0
140455526110976, 4
140455521912576, 2
140455530309376, 6
140455526110976, 5
140455557347136, 1
140455521912576, 3
140455530309376, 7
but I wanted something like:
140455557347136, 0
140455526110976, 1
140455521912576, 2
140455530309376, 3
140455526110976, 5
140455557347136, 4
140455521912576, 6
140455530309376, 7
Any suggestions on the iteration object or is there a better solution?

Consider using an external atomic, e.g. ( // !!! marks changed lines)
#include <mutex>
#include <iostream>
#include <thread>
#include <tbb/parallel_for.h>
#include <tbb/task_scheduler_init.h>
#include <atomic> // !!!
std::mutex cout_mutex;
int main()
{
auto N = 8;
tbb::task_scheduler_init init(4);
std::atomic<int> monotonic_begin{0}; // !!!
tbb::parallel_for(tbb::blocked_range<int>(0, N),
[&](const tbb::blocked_range<int>& r)
{
int s = static_cast<int>(r.size()); // !!!
int b = monotonic_begin.fetch_add(s); // !!!
int e = b + s; // !!!
for (int j = b; j < e; ++j) { // !!!
// Compute A[j]
std::this_thread::sleep_for(std::chrono::seconds(1));
cout_mutex.lock();
std::cout << std::this_thread::get_id() << ", " << j << std::endl;
cout_mutex.unlock();
}
}
);
}
The approach gives:
15084, 0
15040, 3
12400, 2
11308, 1
15084, 4
15040, 5
12400, 6
11308, 7
Why it is important to have monotonic behavior? You may want to consider parallel_pipeline or flow graph to specify calculation dependencies.

Related

How can I sort a Boost matrix by column?

I have a 2d boost matrix (boost::numeric::ublas::matrix) of shape (n,m), with the first column being the timestamp. However, the data I'm getting is out of order. How can I sort it with respect to the first column, and what would be the most efficient way to do so? Speed is critical in this particular application.

As I commented ublas::matrix might not be the most natural choice for a task like this. Trying the naive approach using matrix_row and some range magic:
Live on Coliru
#define _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS
#include <boost/numeric/ublas/io.hpp>
#include <boost/numeric/ublas/matrix.hpp>
#include <boost/numeric/ublas/matrix_proxy.hpp>
#include <boost/range/adaptors.hpp>
#include <boost/range/irange.hpp>
#include <boost/range/algorithm.hpp>
#include <iomanip>
#include <iostream>
using namespace boost::adaptors;
using Matrix = boost::numeric::ublas::matrix<float>;
using Row = boost::numeric::ublas::matrix_row<Matrix>;
static auto by_col0 = [](Row const& a, Row const& b) { return a(0) < b(0); };
int main()
{
constexpr int nrows = 3, ncols = 4;
Matrix m(nrows, ncols);
for (unsigned i = 0; i < m.size1(); ++i)
for (unsigned j = 0; j < m.size2(); ++j)
m(i, j) = (10 - 3.f * i) + j;
std::cout << "before: " << m << "\n";
auto getrow = [&](int i) { return Row(m, i); };
sort(boost::irange(nrows) | transformed(getrow), by_col0);
std::cout << "after: " << m << "\n";
}
Does sadly confirm that the abstraction of the proxy doesn't hold:
before: [3,4]((10,11,12,13),(7,8,9,10),(4,5,6,7))
after: [3,4]((10,11,12,13),(10,11,12,13),(10,11,12,13))|
Oops.
Analysis?
I can't say I know what's wrong. std::sort is defined in terms of ValueSwappable which at first glance seems to work fine for matrix_row:
auto r0 = Row(m, 0);
auto r1 = Row(m, 1);
using std::swap;
swap(r0, r1);
Prints Live On Coliru
Maybe this starting point gives you something helpful. Since it's tricky like this, I'd highly consider using another data structure that is more conducive to your task (boost::multi_array[_ref] comes to mind).

Erasing the first entry of a vector, after the maximum is reached

I have a vector in which i save coordinates.
I perform a series of calculations on each coordinate, thats why i have a limit for the vector size.
Right now i clear the vector, when the limit is reached.
I'm searching for a method, that let's me keep the previous values and only erases the very first value in the vector.
Simplified, something like this (if the maximum size of the vector would be 4).
vector<int> vec;
vec = {1,2,3,4}
vec.push_back(5);
vec = {2,3,4,5}
Is this possible?

As suggested by #paddy, you can use std::deque, it is most performant way to keep N elements if you .push_back(...) new (last) element, and .pop_front() first element.
std::deque gives O(1) complexity for such operations, unlike std::vector which gives O(N) complexity.
Try it online!
#include <deque>
#include <iostream>
int main() {
std::deque<int> d = {1, 2, 3, 4};
for (size_t i = 5; i <= 9; ++i) {
d.push_back(i);
d.pop_front();
// Print
for (auto x: d)
std::cout << x << " ";
std::cout << std::endl;
}
}
Output:
2 3 4 5
3 4 5 6
4 5 6 7
5 6 7 8
6 7 8 9

I think you should properly encapsulate this behaviour in your own vector class, as a std::vector wrapper. You could pass the max capacity as an argument to your constructor. And you could reimplement the methods that may cause "overflow" while just reusing the std::vector ones for the others.
To simplify what you pretend to achieve for the push_back case, using a function and a global variable, you could:
check against a max capacity and,
if that capacity is already reached, rotate your vector contents left by one position; then simply overwrite the last element;
otherwise do a normal push_back.
[Demo]
#include <algorithm> // rotate
#include <iostream> // cout
#include <vector>
const size_t max_capacity{4};
void push_back(std::vector<int>& v, int n)
{
if (v.size() == max_capacity)
{
// Rotate 1 left
std::rotate(std::begin(v), std::begin(v) + 1, std::end(v));
v[v.size() - 1] = n;
}
else
{
v.push_back(n);
}
}
int main()
{
std::vector<int> v{};
for (auto i{1}; i < 9; i++)
{
push_back(v, i);
for (auto&& n : v) { std::cout << n << " "; }
std::cout << "\n";
}
}

How to use thrust to accumulate array based on index?

I am trying to accumulate array based on index. My inputs are two vectors with same length. 1st vector is the index. 2nd vector are the value. My goal is to accumulate the value based on index. I have a similar code in c++. But I am new in thrust coding. Could I achieve this with thrust device code? Which function could I use? I found no "map" like functions. Is it more efficient than the CPU(host) code?
My c++ version mini sample code.
int a[10]={1,2,3,4,5,1,1,3,4,4};
vector<int> key(a,a+10);
double b[10]={1,2,3,4,5,1,2,3,4,5};
vector<double> val(b,b+10);
unordered_map<size_t,double> M;
for (size_t i = 0;i< 10 ;i++)
{
M[key[i]] = M[key[i]]+val[i];
}

As indicated in the comment, the canonical way to do this would be to reorder the data (keys, values) so that like keys are grouped together. You can do this with sort_by_key. reduce_by_key then solves.
It is possible, in a slightly un-thrust-like way, to also solve the problem without reordering, using a functor provided to for_each, that has an atomic.
The following illustrates both:
$ cat t27.cu
#include <thrust/reduce.h>
#include <thrust/sort.h>
#include <thrust/device_vector.h>
#include <thrust/iterator/zip_iterator.h>
#include <thrust/for_each.h>
#include <thrust/copy.h>
#include <iostream>
#include <unordered_map>
#include <vector>
// this functor only needed for the non-reordering case
// requires compilation for a cc6.0 or higher GPU e.g. -arch=sm_60
struct my_func {
double *r;
my_func(double *_r) : r(_r) {};
template <typename T>
__host__ __device__
void operator()(T t) {
atomicAdd(r+thrust::get<0>(t)-1, thrust::get<1>(t)); // assumes consecutive keys starting at 1
}
};
int main(){
int a[10]={1,2,3,4,5,1,1,3,4,4};
std::vector<int> key(a,a+10);
double b[10]={1,2,3,4,5,1,2,3,4,5};
std::vector<double> val(b,b+10);
std::unordered_map<size_t,double> M;
for (size_t i = 0;i< 10 ;i++)
{
M[key[i]] = M[key[i]]+val[i];
}
for (int i = 1; i < 6; i++) std::cout << M[i] << " ";
std::cout << std::endl;
int size_a = sizeof(a)/sizeof(a[0]);
thrust::device_vector<int> d_a(a, a+size_a);
thrust::device_vector<double> d_b(b, b+size_a);
thrust::device_vector<double> d_r(5); //assumes only 5 keys, for illustration
thrust::device_vector<int> d_k(5); // assumes only 5 keys, for illustration
// method 1, without reordering
thrust::for_each_n(thrust::make_zip_iterator(thrust::make_tuple(d_a.begin(), d_b.begin())), size_a, my_func(thrust::raw_pointer_cast(d_r.data())));
thrust::host_vector<double> r = d_r;
thrust::copy(r.begin(), r.end(), std::ostream_iterator<double>(std::cout, " "));
std::cout << std::endl;
thrust::fill(d_r.begin(), d_r.end(), 0.0);
// method 2, with reordering
thrust::sort_by_key(d_a.begin(), d_a.end(), d_b.begin());
thrust::reduce_by_key(d_a.begin(), d_a.end(), d_b.begin(), d_k.begin(), d_r.begin());
thrust::copy(d_r.begin(), d_r.end(), r.begin());
thrust::copy(r.begin(), r.end(), std::ostream_iterator<double>(std::cout, " "));
std::cout << std::endl;
}
$ nvcc -o t27 t27.cu -std=c++14 -arch=sm_70
$ ./t27
4 2 6 13 5
4 2 6 13 5
4 2 6 13 5
$
I make no statements about relative performance of these approaches. It would probably depend on the actual data set size, and possibly the GPU being used and other factors.

Call functor for all combinations in Cuda/Thrust

I have two index sets, one in the range [0, N], one in the range [0, M], where N != M. The indices are used to refer to values in different thrust::device_vectors.
Essentially, I want to create one GPU thread for every combination of these indices, so N*M threads. Each thread should compute a value based on the index-combination and store the result in another thrust::device_vector, at a unique index also based on the input combination.
This seems to be a fairly standard problem, but I was unable to find a way to do this in thrust. The documentation only ever mentions problems, where element i of a vector needs to compute something with element i of another vector. There is the thrust::permutation_iterator, but as far as I understand it only gives me the option to reorder data, and I have to specify the order as well.
Some code:
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <iostream>
int main()
{
// Initialize some data
const int N = 2;
const int M = 3;
thrust::host_vector<int> vec1_host(N);
thrust::host_vector<int> vec2_host(M);
vec1_host[0] = 1;
vec1_host[1] = 5;
vec2_host[0] = -3;
vec2_host[1] = 42;
vec2_host[2] = 9;
// Copy to device
thrust::device_vector<int> vec1_dev = vec1_host;
thrust::device_vector<int> vec2_dev = vec2_host;
// Allocate device memory to copy results to
thrust::device_vector<int> result_dev(vec1_host.size() * vec2_host.size());
// Create functor I want to call on every combination
struct myFunctor
{
thrust::device_vector<int> const& m_vec1;
thrust::device_vector<int> const& m_vec2;
thrust::device_vector<int>& m_result;
myFunctor(thrust::device_vector<int> const& vec1, thrust::device_vector<int> const& vec2, thrust::device_vector<int>& result)
: m_vec1(vec1), m_vec2(vec2), m_result(result)
{
}
__host__ __device__
void operator()(size_t i, size_t j) const
{
m_result[i + j * m_vec1.size()] = m_vec1[i] + m_vec1[j];
}
} func(vec1_dev, vec2_dev, result_dev);
// How do I create N*M threads, each of which calls func(i, j) ?
// Copy results back
thrust::host_vector<int> result_host = result_dev;
for(int i : result_host)
std::cout << i << ", ";
std::cout << std::endl;
// Expected output:
// -2, 2, 43, 47, 10, 14
return 0;
}
I'm fairly sure this is very easy to achieve, I guess I'm just missing the right search terms. Anyways, all help appreciated :)

Presumably in your functor operator instead of this:
m_result[i + j * m_vec1.size()] = m_vec1[i] + m_vec1[j];
^ ^
you meant this:
m_result[i + j * m_vec1.size()] = m_vec1[i] + m_vec2[j];
^ ^
I think there are probably many ways to tackle this, but so as to not argue about things that are not germane to the question, I'll try and stay as close to your given code as I can.
Operations like [] on a vector are not possible in device code. Therefore we must convert your functor to work on raw data pointers, rather than thrust vector operations directly.
With those caveats, and a slight modification in how we handle your i and j indices, I think what you're asking is not difficult.
The basic strategy is to create a result vector that is of length N*M just as you suggest, then create the indices i and j within the functor operator. In so doing, we need only pass one index to the functor, using e.g. thrust::transform or thrust::for_each to create our output:
$ cat t79.cu
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/for_each.h>
#include <thrust/iterator/counting_iterator.h>
#include <thrust/execution_policy.h>
#include <iostream>
struct myFunctor
{
const int *m_vec1;
const int *m_vec2;
int *m_result;
size_t v1size;
myFunctor(thrust::device_vector<int> const& vec1, thrust::device_vector<int> const& vec2, thrust::device_vector<int>& result)
{
m_vec1 = thrust::raw_pointer_cast(vec1.data());
m_vec2 = thrust::raw_pointer_cast(vec2.data());
m_result = thrust::raw_pointer_cast(result.data());
v1size = vec1.size();
}
__host__ __device__
void operator()(const size_t x) const
{
size_t i = x%v1size;
size_t j = x/v1size;
m_result[i + j * v1size] = m_vec1[i] + m_vec2[j];
}
};
int main()
{
// Initialize some data
const int N = 2;
const int M = 3;
thrust::host_vector<int> vec1_host(N);
thrust::host_vector<int> vec2_host(M);
vec1_host[0] = 1;
vec1_host[1] = 5;
vec2_host[0] = -3;
vec2_host[1] = 42;
vec2_host[2] = 9;
// Copy to device
thrust::device_vector<int> vec1_dev = vec1_host;
thrust::device_vector<int> vec2_dev = vec2_host;
// Allocate device memory to copy results to
thrust::device_vector<int> result_dev(vec1_host.size() * vec2_host.size());
// How do I create N*M threads, each of which calls func(i, j) ?
thrust::for_each_n(thrust::device, thrust::counting_iterator<size_t>(0), (N*M), myFunctor(vec1_dev, vec2_dev, result_dev));
// Copy results back
thrust::host_vector<int> result_host = result_dev;
for(int i : result_host)
std::cout << i << ", ";
std::cout << std::endl;
// Expected output:
// -2, 2, 43, 47, 10, 14
return 0;
}
$ nvcc -std=c++11 -arch=sm_61 -o t79 t79.cu
$ ./t79
-2, 2, 43, 47, 10, 14,
$
In retrospect, I think this is more or less exactly what #eg0x20 was suggesting.

Why is a C++ Lambda function much faster as compare function than an equivalent object

I am curious, why an implementation with a lambda function in my case is so much faster than an implementation with an equivalent object.
To give you an idea of the scale: with 10^4 values, the fast one takes much less than a second and the slow one takes tens of seconds. With 10^5 vales, the fast one still completes in under a second, but the slow one takes minutes.
I want to sort the values of two arrays in the same way as if I would sort one of them. It's easier to understand with an example:
[5 1 2 0] becomes [0 1 2 5]
[3 5 6 7] to [7 5 6 3]
There are various ways around the internet how to do that, but that's not what I want to ask.
I did two implementations: one using an object with overloaded operator() and one with a lambda function as "Compare".
The code below has the lambda function version uncommented. To use the compare object, just comment out what is in "compare using lambda function" and uncomment "compare using compare object".
#include <iostream>
#include <vector>
#include <algorithm>
#include <cstdlib>
#include <ctime>
void sortTwoVectorsByFirstVector(std::vector< float >& sortBySelf, std::vector< float >& sortByOther)
{
// init sort indices
std::vector < uint32_t > sortIndices(sortBySelf.size());
for (uint32_t i = 0; i < sortIndices.size(); ++i) {
sortIndices[i] = i;
}
//******** begin: compare using compare object
// struct CompareClass {
// std::vector< float > m_values;
// inline bool operator()(size_t i, size_t j)
// {
// return (m_values[i] < m_values[j]);
// }
// } compareObject { sortBySelf };
// std::sort(sortIndices.begin(), sortIndices.end(), compareObject);
//******* end: compare using compare object
//******** begin: compare using lambda function
std::sort(sortIndices.begin(), sortIndices.end(), [&sortBySelf](size_t i, size_t j) {return sortBySelf[i] < sortBySelf[j];});
//******** end: compare using lambda function
// collect the sorted elements using the indices
std::vector< float > sortedBySelf_sorted;
std::vector< float > sortByOther_sorted;
sortedBySelf_sorted.resize(sortBySelf.size());
sortByOther_sorted.resize(sortBySelf.size());
for (uint32_t i = 0; i < sortBySelf.size(); ++i) {
sortedBySelf_sorted[i] = sortBySelf[sortIndices[i]];
sortByOther_sorted[i] = sortByOther[sortIndices[i]];
}
sortBySelf.swap(sortedBySelf_sorted);
sortByOther.swap(sortByOther_sorted);
}
float RandomNumber()
{
return std::rand();
}
int main()
{
int vectorSize = 100000;
std::vector< float > a(vectorSize);
std::vector< float > b(vectorSize);
std::srand(100);
std::generate(a.begin(), a.end(), RandomNumber);
std::generate(b.begin(), b.end(), RandomNumber);
std::cout << "started" << std::endl;
sortTwoVectorsByFirstVector(a, b);
std::cout << "finished" << std::endl;
}
It would be cool, if someone could make clear, where this huge performance gap comes from.

Your manually written class copies the vector:
std::vector< float > m_values; //<< By value
The lambda expression merely references it:
[&sortBySelf](size_t i, size_t j) {return sortBySelf[i] < sortBySelf[j];}
If you took sortBySelf by copy (without the &) then they would likely have similar performance.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Non-recursive split iteration of array with tbb::paralllel_for - c++

Related

How can I sort a Boost matrix by column?

Erasing the first entry of a vector, after the maximum is reached

How to use thrust to accumulate array based on index?

Call functor for all combinations in Cuda/Thrust

Why is a C++ Lambda function much faster as compare function than an equivalent object

Categories

Resources