Given a block size N and a vector of integers of length k * N which can be viewed as k blocks of N integers, I want to create a new vector of length k whose elements are the sums of the blocks of the original vector.
E.g. block size 2, vector {1,2,3,4,5,6} would give a result of {3,7,11}.
E.g. block size 3, vector {0,0,0,1,1,1} would give a result of {0,3}.
A simple approach that works:
std::vector<int> sum_blocks(int block_size, const std::vector<int>& input){
std::vector<int> ret(input.size() / block_size, 0);
for (unsigned int i = 0; i < ret.size(); ++i)
{
for(unsigned int j=0; j < block_size; ++j)
ret[i] += input[block_size * i + j];
}
return ret;
}
However I'd be interested in knowing if there is a neater or more efficient way of doing this, possibly using the algorithm library.
If you can use the range-v3 library, you could write the function like this:
namespace rv = ranges::views;
namespace rs = ranges;
auto sum_blocks(int block_size, std::vector<int> const & input)
{
return input
| rv::chunk(block_size)
| rv::transform([](auto const & block) {
return rs::accumulate(block, 0);
})
| rs::to<std::vector<int>>;
}
which is quite self-explanatory, and avoids doing any arithmetic like block_size * i + j, which is error prone.
demo.
Neat is often orthognal to efficient, if not antagonising. Many pedal-to-the-metal approaches may end up looking incomprehensible compared to an initial solution, e.g. you could use Duff's device to unroll the inner loops based on your block size, or use vector instructions to speed up additions.
That been said, your problem falls "neatly" into a stencil pattern; put simply, when the result space is some combination of neighbors in the origin space then you have a stencil. In this case your stencil is a 1-D "window" of size=block_size and the "combination" operation is an addition. This is the bread and butter of GPU computing and depending on the problem size it may be beneficial to treat it as a stencil in CPU/C++ world as well.
Even though the sizes mentioned in the comments are hardly large enough to benefit from parallelization, here's an example of how to do this:
// A stencil captures the src and result arrays
// Each element of the result depends on block size elements of the source
auto stencil = [&src, &ret, blockSz](size_t start, size_t end){
for (size_t i(start); i < end; ++i)
{
ret[i] = std::accumulate(
src.begin() + i * blockSz,
src.begin() + i * blockSz + blockSz, 0);
}
};
Now having such a utility means that you can divide your work up to workers, assigning a range of the result to each one of them. Here's a dummy example of how to do this, note that having a thread pool would be much more efficient than on-the-fly assignment of work via std::async:
auto block_reduce(std::vector<int> src, size_t blockSz, size_t nWorkers)
{
std::vector<int> ret(src.size() / blockSz, 0);
auto stencil = [&src, &ret, blockSz](size_t start, size_t end){
for (size_t i(start); i < end; ++i)
{
ret[i] = std::accumulate(
src.begin() + i * blockSz,
src.begin() + i * blockSz + blockSz, 0);
}
};
size_t chunk = ret.size() / nWorkers;
size_t iStart = 0, iEnd = chunk;
std::vector<std::future<void>> results;
while (iEnd < ret.size())
{
results.push_back(std::async(stencil, iStart, iEnd));
iStart = iEnd;
iEnd += chunk;
}
stencil(iStart, ret.size());
while (!results.empty())
{
results.back().wait();
results.pop_back();
}
return ret;
}
Demo
Note that "industrial" implementations would mandate careful consideration of corner cases (e.g. requested workers > actual data or runtime decision of workers number) like in this GPU inspired version
I'd first of all like to note that Nikos posted a great answer regarding efficieny. This problem is "embarrassingly" parallel, as you can start summing one block, regardless of the state of the summation of other blocks. Basically, you can sum all blocks independently. So, if performance is of essence, your GPU is about to sweat.
Anyways, here's a more C++ beginner-friendly answer, that uses a single for-loop and the numeric library and it's transform function. Sadly OP I couldn't see a simple way to use the algorithm library.
std::vector<int> sum_blocks(int block_size, const std::vector<int> &input){
std::vector<int> result(input.size() / block_size, 0);
for(int i = 0; i < result.size(); i++){
result[i] = std::accumulate(input.begin() + i*block_size, input.begin() + (i+1)*block_size, 0);
}
return result;
}
I've also written a quick example that uses iterators only:
std::vector<int> sum_blocks(int block_size, const std::vector<int> &input){
std::vector<int> result(input.size() / block_size, 0);
auto it1 = result.begin();
auto it2 = input.begin();
while(it1 != result.end()){
*it1 = std::accumulate(it2, it2 + block_size, 0);
std::advance(it1, 1); // same as it1++;
std::advance(it2, block_size); // same as it2 += block_size;
}
return result;
}
In the example I've used the function advance to keep the code more uniform, but it's identical to the code in the comments.
Once again, this answer is oriented towards the neat-based request, it is not the efficient way (especially not compared to using using parallelization techniques).
Related
I have a vector of observations and an equal length vector of offsets assigning observations to a set of bins. The value of each bin should be the sum of all observations assigned to that bin, and I'm wondering if there's a vectorized method to do the reduction.
A naive implementation is below:
const int N_OBS = 100`000`000;
const int N_BINS = 16;
double obs[N_OBS]; // Observations
int8_t offsets[N_OBS];
double acc[N_BINS] = {0};
for (int i = 0; i < N_OBS; ++i) {
acc[offsets[i]] += obs[i]; // accumulate obs value into its assigned bin
}
Is this possible using simd/avx intrinsics? Something similar to the above will be run millions of times. I've looked at scatter/gather approaches, but can't seem to figure out a good way to get it done.
Modern CPUs are surprisingly good running your naïve version. On AMD Zen3, I’m getting 48ms for 100M random numbers on input, that’s 18 GB/sec RAM read bandwidth. That’s like 35% of the hard bandwidth limit on my computer (dual-channel DDR4-3200).
No SIMD gonna help, I’m afraid. Still, the best version I got is the following. Compile with OpenMP support, the switch depends on your C++ compiler.
void computeHistogramScalarOmp( const double* rsi, const int8_t* indices, size_t length, double* rdi )
{
// Count of OpenMP threads = CPU cores to use
constexpr int ompThreadsCount = 4;
// Use independent set of accumulators per thread, otherwise concurrency gonna corrupt data.
// Aligning by 64 = cache line, we want to assign cache lines to CPU cores, sharing them is extremely expensive
alignas( 64 ) double accumulators[ 16 * ompThreadsCount ];
memset( &accumulators, 0, sizeof( accumulators ) );
// Minimize OMP overhead by dispatching very few large tasks
#pragma omp parallel for schedule(static, 1)
for( int i = 0; i < ompThreadsCount; i++ )
{
// Grab a slice of the output buffer
double* const acc = &accumulators[ i * 16 ];
// Compute a slice of the source data for this thread
const size_t first = i * length / ompThreadsCount;
const size_t last = ( i + 1 ) * length / ompThreadsCount;
// Accumulate into thread-local portion of the buffer
for( size_t i = first; i < last; i++ )
{
const int8_t idx = indices[ i ];
acc[ idx ] += rsi[ i ];
}
}
// Reduce 16*N scalars to 16 with a few AVX instructions
for( int i = 0; i < 16; i += 4 )
{
__m256d v = _mm256_load_pd( &accumulators[ i ] );
for( int j = 1; j < ompThreadsCount; j++ )
{
__m256d v2 = _mm256_load_pd( &accumulators[ i + j * 16 ] );
v = _mm256_add_pd( v, v2 );
}
_mm256_storeu_pd( rdi + i, v );
}
}
The above version results in 20.5ms time, translates to 88% of RAM bandwidth limit.
P.S. I have no idea why the optimal threads count is 4 here, I have 8 cores/16 threads in the CPU. Both lower and higher values decrease the bandwidth. The constant is probably CPU-specific.
If indeed the offsets do not change for thousands (probably even tens) of times, it is likely worthwile to "transpose" them, i.e., to store all indices which need to be added to acc[0], then all indices which need to be added to acc[1], etc.
Essentially, what you are doing originally is a sparse-matrix times dense-vector product with the matrix in compressed-column-storage format (without explicitly storing the 1-values).
As shown in this answer sparse GEMV products are usually faster if the matrix is stored in compressed-row-storage (even without AVX2's gather instruction, you don't need to load and store the accumulated value every time).
Untested example implementation:
using sparse_matrix = std::vector<std::vector<int> >;
// call this once:
sparse_matrix transpose(uint8_t const* offsets, int n_bins, int n_obs){
sparse_matrix res;
res.resize(n_bins);
// count entries for each bin:
for(int i=0; i<n_obs; ++i) {
// assert(offsets[i] < n_bins);
res[offsets[i]].push_back(i);
}
return res;
}
void accumulate(double acc[], sparse_matrix const& indexes, double const* obs){
for(std::size_t row=0; row<indexes.size(); ++row) {
double sum = 0;
for(int col : indexes[row]) {
// you can manually vectorize this using _mm256_i32gather_pd,
// but clang/gcc should autovectorize this with -ffast-math -O3 -march=native
sum += obs[col];
}
acc[row] = sum;
}
}
I currently have a bool mask vector generated in Eigen. I would like to use this binary mask similar as in Python numpy, where depending on the True value, i get a sub-matrix or a sub-vector, where i can further do some calculations on these.
To achieve this in Eigen, i currently "convert" the mask vector into another vector containing the indices by simply iterating over the mask:
Eigen::Array<bool, Eigen::Dynamic, 1> mask = ... // E.G.: [0, 1, 1, 1, 0, 1];
Eigen::Array<uint32_t, Eigen::Dynamic, 1> mask_idcs(mask.count(), 1);
int z_idx = 0;
for (int z = 0; z < mask.rows(); z++) {
if (mask(z)) {
mask_idcs(z_idx++) = z;
}
}
// do further calculations on vector(mask_idcs)
// E.G.: vector(mask_idcs)*3 + another_vector
However, i want to further optimize this and am wondering if Eigen3 provides a more elegant solution for this, something like vector(from_bin_mask(mask)), which may benefit from the libraries optimization.
There are already some questions here in SO, but none seems to answer this simple use-case
(1, 2). Some refer to the select-function, which returns an equally sized vector/matrix/array, but i want to discard elements via a mask and only work further with a smaller vector/matrix/array.
Is there a way to do this in a more elegant way? Can this be optimized otherwise?
(I am using the Eigen::Array-type since most of the calculations are element-wise in my use-case)
As far as I'm aware, there is no "out of the shelf" solution using Eigen's methods. However it is interesting to notice that (at least for Eigen versions greater or equal than 3.4.0), you can using a std::vector<int> for indexing (see this section). Therefore the code you've written could simplified to
Eigen::Array<bool, Eigen::Dynamic, 1> mask = ... // E.G.: [0, 1, 1, 1, 0, 1];
std::vector<int> mask_idcs;
for (int z = 0; z < mask.rows(); z++) {
if (mask(z)) {
mask_idcs.push_back(z);
}
}
// do further calculations on vector(mask_idcs)
// E.G.: vector(mask_idcs)*3 + another_vector
If you're using c++20, you could use an alternative implementation using std::ranges without using raw for-loops:
int const N = mask.size();
auto c = iota(0, N) | filter([&mask](auto const& i) { return mask[i]; });
auto masked_indices = std::vector(begin(c), end(c));
// ... Use it as vector(masked_indices) ...
I've implemented some minimal examples in compiler explorer in case you'd like to check out. I honestly wished there was a simpler way to initialize the std::vector from the raw range, but it's currently not so simple. Therefore I'd suggest you to wrap the code into a helper function, for example
auto filtered_indices(auto const& mask) // or as you've suggested from_bin_mask(auto const& mask)
{
using std::ranges::begin;
using std::ranges::end;
using std::views::filter;
using std::views::iota;
int const N = mask.size();
auto c = iota(0, N) | filter([&mask](auto const& i) { return mask[i]; });
return std::vector(begin(c), end(c));
}
and then use it as, for example,
Eigen::ArrayXd F(5);
F << 0.0, 1.1548, 0.0, 0.0, 2.333;
auto mask = (F > 1e-15).eval();
auto D = (F(filtered_indices(mask)) + 3).eval();
It's not as clean as in numpy, but it's something :)
I have found another way, which seems to be more elegant then comparing each element if it equals to 0:
Eigen::SparseMatrix<bool> mask_sparse = mask.matrix().sparseView();
for (uint32_t k = 0; k<mask.outerSize(); ++k) {
for (Eigen::SparseMatrix<bool>::InnerIterator it(mask_sparse, k); it; ++it) {
std::cout << it.row() << std::endl; // row index
std::cout << it.col() << std::endl; // col index
// Do Stuff or built up an array
}
}
Here we can at least build up a vector (or multiple vectors, if we have more dimensions) and then later use it to "mask" a vector or matrix. (This is taken from the documentation).
So applied to this specific usecase, we simply do:
Eigen::Array<uint32_t, Eigen::Dynamic, 1> mask_idcs(mask.count(), 1);
Eigen::SparseVector<bool> mask_sparse = mask.matrix().sparseView();
int z_idx = 0;
for (Eigen::SparseVector<bool>::InnerIterator it(mask_sparse); it; ++it) {
mask_idcs(z_idx++) = it.index()
}
// do Stuff like vector(mask_idcs)*3 + another_vector
However, i do not know which version is faster for large masks containing thousands of elements.
I am given a filled array of size WxH and need to create a new array by scaling both the width and the height by a power of 2. For example, 2x3 becomes 8x12 when scaled by 4, 2^2. My goal is to make sure all the old values in the array are placed in the new array such that 1 value in the old array fills up multiple new corresponding parts in the scaled array. For example:
old_array = [[1,2],
[3,4]]
becomes
new_array = [[1,1,2,2],
[1,1,2,2],
[3,3,4,4],
[3,3,4,4]]
when scaled by a factor of 2. Could someone explain to me the logic on how I would go about programming this?
It's actually very simple. I use a vector of vectors for simplicity noting that 2D matrixes are not efficient. However, any 2D matrix class using [] indexing syntax can, and should be for efficiency, substituted.
#include <vector>
using std::vector;
int main()
{
vector<vector<int>> vin{ {1,2},{3,4},{5,6} };
size_t scaleW = 2;
size_t scaleH = 3;
vector<vector<int>> vout(scaleH * vin.size(), vector<int>(scaleW * vin[0].size()));
for (size_t i = 0; i < vout.size(); i++)
for (size_t ii = 0; ii < vout[0].size(); ii++)
vout[i][ii] = vin[i / scaleH][ii / scaleW];
auto x = vout[8][3]; // last element s/b 6
}
Here is my take. It is very similar to #Tudor's but I figure between our two, you can pick what you like or understand best.
First, let's define a suitable 2D array type because C++'s standard library is very lacking in this regard. I've limited myself to a rather simple struct, in case you don't feel comfortable with object oriented programming.
#include <vector>
// using std::vector
struct Array2d
{
unsigned rows, cols;
std::vector<int> data;
};
This print function should give you an idea how the indexing works:
#include <cstdio>
// using std::putchar, std::printf, std::fputs
void print(const Array2d& arr)
{
std::putchar('[');
for(std::size_t row = 0; row < arr.rows; ++row) {
std::putchar('[');
for(std::size_t col = 0; col < arr.cols; ++col)
std::printf("%d, ", arr.data[row * arr.cols + col]);
std::fputs("]\n ", stdout);
}
std::fputs("]\n", stdout);
}
Now to the heart, the array scaling. The amount of nesting is … bothersome.
Array2d scale(const Array2d& in, unsigned rowfactor, unsigned colfactor)
{
Array2d out;
out.rows = in.rows * rowfactor;
out.cols = in.cols * colfactor;
out.data.resize(std::size_t(out.rows) * out.cols);
for(std::size_t inrow = 0; inrow < in.rows; ++inrow) {
for(unsigned rowoff = 0; rowoff < rowfactor; ++rowoff) {
std::size_t outrow = inrow * rowfactor + rowoff;
for(std::size_t incol = 0; incol < in.cols; ++incol) {
std::size_t in_idx = inrow * in.cols + incol;
int inval = in.data[in_idx];
for(unsigned coloff = 0; coloff < colfactor; ++coloff) {
std::size_t outcol = incol * colfactor + coloff;
std::size_t out_idx = outrow * out.cols + outcol;
out.data[out_idx] = inval;
}
}
}
}
return out;
}
Let's pull it all together for a little demonstration:
int main()
{
Array2d in;
in.rows = 2;
in.cols = 3;
in.data.resize(in.rows * in.cols);
for(std::size_t i = 0; i < in.rows * in.cols; ++i)
in.data[i] = static_cast<int>(i);
print(in);
print(scale(in, 3, 2));
}
This prints
[[0, 1, 2, ]
[3, 4, 5, ]
]
[[0, 0, 1, 1, 2, 2, ]
[0, 0, 1, 1, 2, 2, ]
[0, 0, 1, 1, 2, 2, ]
[3, 3, 4, 4, 5, 5, ]
[3, 3, 4, 4, 5, 5, ]
[3, 3, 4, 4, 5, 5, ]
]
To be honest, i'm incredibly bad at algorithms but i gave it a shot.
I am not sure if this can be done using only one matrix, or if it can be done in less time complexity.
Edit: You can estimate the number of operations this will make with W*H*S*S where Sis the scale factor, W is width and H is height of input matrix.
I used 2 matrixes m and r, where m is your input and r is your result/output. All that needs to be done is to copy each element from m at positions [i][j] and turn it into a square of elements with the same value of size scale_factor inside r.
Simply put:
int main()
{
Matrix<int> m(2, 2);
// initial values in your example
m[0][0] = 1;
m[0][1] = 2;
m[1][0] = 3;
m[1][1] = 4;
m.Print();
// pick some scale factor and create the new matrix
unsigned long scale = 2;
Matrix<int> r(m.rows*scale, m.columns*scale);
// i know this is bad but it is the most
// straightforward way of doing this
// it is also the only way i can think of :(
for(unsigned long i1 = 0; i1 < m.rows; i1++)
for(unsigned long j1 = 0; j1 < m.columns; j1++)
for(unsigned long i2 = i1*scale; i2 < (i1+1)*scale; i2++)
for(unsigned long j2 = j1*scale; j2 < (j1+1)*scale; j2++)
r[i2][j2] = m[i1][j1];
// the output in your example
std::cout << "\n\n";
r.Print();
return 0;
}
I do not think it is relevant for the question, but i used a class Matrix to store all the elements of the extended matrix. I know it is a distraction but this is still C++ and we have to manage memory. And what you are trying to achieve with this algorithm needs a lot of memory if the scale_factor is big so i wrapped it up using this:
template <typename type_t>
class Matrix
{
private:
type_t** Data;
public:
// should be private and have Getters but
// that would make the code larger...
unsigned long rows;
unsigned long columns;
// 2d Arrays get big pretty fast with what you are
// trying to do.
Matrix(unsigned long rows, unsigned long columns)
{
this->rows = rows;
this->columns = columns;
Data = new type_t*[rows];
for(unsigned long i = 0; i < rows; i++)
Data[i] = new type_t[columns];
}
// It is true, a copy constructor is needed
// as HolyBlackCat pointed out
Matrix(const Matrix& m)
{
rows = m.rows;
columns = m.columns;
Data = new type_t*[rows];
for(unsigned long i = 0; i < rows; i++)
{
Data[i] = new type_t[columns];
for(unsigned long j = 0; j < columns; j++)
Data[i][j] = m[i][j];
}
}
~Matrix()
{
for(unsigned long i = 0; i < rows; i++)
delete [] Data[i];
delete [] Data;
}
void Print()
{
for(unsigned long i = 0; i < rows; i++)
{
for(unsigned long j = 0; j < columns; j++)
std::cout << Data[i][j] << " ";
std::cout << "\n";
}
}
type_t* operator [] (unsigned long row)
{
return Data[row];
}
};
First of all, having a suitable 2D matrix class is presumed but not the question. But I don't know the API of yours, so I'll illustrate with something typical:
struct coord {
size_t x; // x position or column count
size_t y; // y position or row count
};
template <typename T>
class Matrix2D {
⋮ // implementation details
public:
⋮ // all needed special members (ctors dtor, assignment)
Matrix2D (coord dimensions);
coord dimensions() const; // return height and width
const T& cell (coord position) const; // read-only access
T& cell (coord position); // read-write access
// handy synonym:
const T& operator[](coord position) const { return cell(position); }
T& operator[](coord position) { return cell(position); }
};
I just showed the public members I need: create a matrix with a given size, query the size, and indexed access to the individual elements.
So, given that, your problem description is:
template<typename T>
Matrix2D<T> scale_pow2 (const Matrix2D& input, size_t pow)
{
const auto scale_factor= 1 << pow;
const auto size_in = input.dimensions();
Matrix2D<T> result ({size_in.x*scale_factor,size_in.y*scale_factor});
⋮
⋮ // fill up result
⋮
return result;
}
OK, so now the problem is precisely defined: what code goes in the big blank immediately above?
Each cell in the input gets put into a bunch of cells in the output. So you can either iterate over the input and write a clump of cells in the output all having the same value, or you can iterate over the output and each cell you need the value for is looked up in the input.
The latter is simpler since you don't need a nested loop (or pair of loops) to write a clump.
for (coord outpos : /* ?? every cell of the output ?? */) {
coord frompos {
outpos.x >> pow,
outpos.y >> pow };
result[outpos] = input[frompos];
}
Now that's simple!
Calculating the from position for a given output must match the way the scale was defined: you will have pow bits giving the position relative to this clump, and the higher bits will be the index of where that clump came from
Now, we want to set outpos to every legal position in the output matrix indexes. That's what I need. How to actually do that is another sub-problem and can be pushed off with top-down decomposition.
a bit more advanced
Maybe nested loops is the easiest way to get that done, but I won't put those directly into this code, pushing my nesting level even deeper. And looping 0..max is not the simplest thing to write in bare C++ without libraries, so that would just be distracting. And, if you're working with matrices, this is something you'll have a general need for, including (say) printing out the answer!
So here's the double-loop, put into its own code:
struct all_positions {
coord current {0,0};
coord end;
all_positions (coord end) : end{end} {}
bool next() {
if (++current.x < end.x) return true; // not reached the end yet
current.x = 0; // reset to the start of the row
if (++current.y < end.y) return true;
return false; // I don't have a valid position now.
}
};
This does not follow the iterator/collection API that you could use in a range-based for loop. For information on how to do that, see my article on Code Project or use the Ranges stuff in the C++20 standard library.
Given this "old fashioned" iteration helper, I can write the loop as:
all_positions scanner {output.dimensions}; // starts at {0,0}
const auto& outpos= scanner.current;
do {
⋮
} while (scanner.next());
Because of the simple implementation, it starts at {0,0} and advancing it also tests at the same time, and it returns false when it can't advance any more. Thus, you have to declare it (gives the first cell), use it, then advance&test. That is, a test-at-the-end loop. A for loop in C++ checks the condition before each use, and advances at the end, using different functions. So, making it compatible with the for loop is more work, and surprisingly making it work with the ranged-for is not much more work. Separating out the test and advance the right way is the real work; the rest is just naming conventions.
As long as this is "custom", you can further modify it for your needs. For example, add a flag inside to tell you when the row changed, or that it's the first or last of a row, to make it handy for pretty-printing.
summary
You need a bunch of things working in addition to the little piece of code you actually want to write. Here, it's a usable Matrix class. Very often, it's prompting for input, opening files, handling command-line options, and that kind of stuff. It distracts from the real problem, so get that out of the way first.
Write your code (the real code you came for) in its own function, separate from any other stuff you also need in order to house it. Get it elsewhere if you can; it's not part of the lesson and just serves as a distraction. Worse, it may be "hard" in ways you are not prepared for (or to do well) as it's unrelated to the actual lesson being worked on.
Figure out the algorithm (flowchart, pseudocode, whatever) in a general way before translating that to legal syntax and API on the objects you are using. If you're just learning C++, don't get bogged down in the formal syntax when you are trying to figure out the logic. Until you naturally start to think in C++ when doing that kind of planning, don't force it. Use whiteboard doodles, tinkertoys, whatever works for you.
Get feedback and review of the idea, the logic of how to make it happen, from your peers and mentors if available, before you spend time coding. Why write up an idea that doesn't work? Fix the logic, not the code.
Finally, sketch the needed control flow, functions and data structures you need. Use pseudocode and placeholder notes.
Then fill in the placeholders and replace the pseudo with the legal syntax. You already planned it out, so now you can concentrate on learning the syntax and library details of the programming language. You can concentrate on "how do I express (some tiny detail) in C++" rather than keeping the entire program in your head. More generally, isolate a part that you will be learning; be learning/practicing one thing without worrying about the entire edifice.
To a large extent, some of those ideas translate to the code as well. Top-Down Design means you state things at a high level and then implement that elsewhere, separately. It makes code readable and maintainable, as well as easier to write in the first place. Functions should be written this way: the function explains how to do (what it does) as a list of details that are just one level of detail further down. Each of those steps then becomes a new function. Functions should be short and expressed at one semantic level of abstraction. Don't dive down into the most primitive details inside the function that explains the task as a set of simpler steps.
Good luck, and keep it up!
I am trying to write a sorting function and a summation function in OpenCL/C++. However, while both functions work fine on smaller datasets, neither work on any dataset of any notable length. The dataset I'm trying to use is about 2 million entries long, but the functions stop working consistently at about 500. Any help on why this is would be appreciated. OpenCL code below.
EDIT: Only the code fully relevant to the sum is now shown (as per request).
kernel void sum(global const double* A, global double* B) {
int id = get_global_id(0);
int N = get_global_size(0);
B[id] = A[id];
barrier(CLK_GLOBAL_MEM_FENCE);
for (int i = 1; i < N/2; i *= 2) { //i is a stride
if (!(id % (i * 2)) && ((id + i) < N))
B[id] += B[id + i];
barrier(CLK_GLOBAL_MEM_FENCE);
}
}
And the C++ code:
std::vector<double> temps(100000, 1);
// Load functions
cl::Kernel kernel_sum = cl::Kernel(program, "sum");
// Set up variables
size_t elements = temps.size();
size_t size = temps.size() * sizeof(double);
size_t workgroup_size = 10;
size_t padding_size = elements % workgroup_size;
// Sum
if (padding_size) {
std::vector<double> temps_padding(workgroup_size - padding_size, 0);
temps.insert(temps.end(), temps_padding.begin(), temps_padding.end());
}
std::vector<double> temps_sum(elements);
size_t output_size = temps_sum.size() * sizeof(double);
cl::Buffer sum_buffer_1(context, CL_MEM_READ_ONLY, size);
cl::Buffer sum_buffer_2(context, CL_MEM_READ_WRITE, output_size);
queue.enqueueWriteBuffer(sum_buffer_1, CL_TRUE, 0, size, &temps[0]);
queue.enqueueFillBuffer(sum_buffer_2, 0, 0, output_size);
kernel_sum.setArg(0, sum_buffer_1);
kernel_sum.setArg(1, sum_buffer_2);
queue.enqueueNDRangeKernel(kernel_sum, cl::NullRange, cl::NDRange(elements), cl::NDRange(workgroup_size));
queue.enqueueReadBuffer(sum_buffer_2, CL_TRUE, 0, output_size, &temps_sum[0]);
double summed = temps_sum[0];
std::cout << "SUMMED: " << summed << std::endl;
I have tried looking around everywhere but I'm completely stuck.
You're trying to use barriers for synchronisation across work groups. This won't work. Barriers are for synchronising within work groups.
Work groups don't run in a well defined order relative to one another; you can only use this sort of reduction algorithm within a workgroup. You will probably need to use a second kernel pass to combine results from individual workgroups, or do this part on the host CPU. (Or modify your algorithm to use atomics in some way, etc.)
I'm working with OpenGL at the moment, creating a 'texture cache' which handles loading images and buffering them with OpenGL. In the event an image file can't be loaded it needs to fall back to a default texture which I've hard-coded in the constructor.
What I basically need to do is create a texture of a uniform colour. This is not too difficult, it's just an array of size Pixels * Colour Channels.
I am currently using a std::vector to hold the initial data before I upload it OpenGL. The problem I'm having is that I can't find any information on the best way to initialize a vector with a repeating pattern.
The first way that occurred to me was to use a loop.
std::vector<unsigned char> blue_texture;
for (int iii = 0; iii < width * height; iii++)
{
blue_texture.push_back(0);
blue_texture.push_back(0);
blue_texture.push_back(255);
blue_texture.push_back(255);
}
However, this seems inefficient since the vector will have to resize itself numerous times. Even if I reserve space first and perform the loop it's still not efficient since the contents will be zeroed before the loop which means two writes for each unsigned char.
Currently I'm using the following method:
struct colour {unsigned char r; unsigned char g; unsigned char b; unsigned char a;};
colour blue = {0, 0, 255, 255};
std::vector<colour> texture((width * height), blue);
I then extract the data using:
reinterpret_cast<unsigned char*>(texture.data());
Is there a better way than this? I'm new to C/C++ and I'll be honest, casting pointers scares me.
Your loop solution is the right way to go in my opinion. To make it efficient by removing repeated realloc calls, use blue_texture.reserve(width * height * 4)
The reserve call will increase the allocation, aka capacity to that size without zero-filling it. (Note that the operating system may still zero it, if it pulls the memory from mmap for example.) It does not change the size of the vector, so push_back and friends still work the same way.
You can use reserve to pre-allocate the vector; this will avoid the reallocations. You can also define a small sequence (probably a C style vector:
char const init[] = { 0, 0, 255, 255 };
and loop inserting that into the end of the vector:
for ( int i = 0; i < pixelCount; ++ i ) {
v.insert( v.end(), std::begin( init ), std::end( init ) );
}
this is only marginally more efficient than using the four push_back in the loop, but is more succinct, and perhaps makes it clearer what you're doing, albeit only marginally: the big advantage might be being able to give a name to the initialization sequence (eg something like defaultBackground).
The most efficient way is the way that does least work.
Unfortunately, push_back(), insert() and the like have to maintain the size() of the vector as they work, which are redundant operations when performed in a tight loop.
Therefore the most efficient way is allocate the memory once and then copy data directly into it without maintaining any other variables.
It's done like this:
#include <iostream>
#include <array>
#include <vector>
using colour_fill = std::array<uint8_t, 4>;
using pixel_map = std::vector<uint8_t>;
pixel_map make_colour_texture(size_t width, size_t height, colour_fill colour)
{
// allocate the buffer
std::vector<uint8_t> pixels(width * height * sizeof(colour_fill));
auto current = pixels.data();
auto last = current + pixels.size();
while (current != last) {
current = std::copy(begin(colour), end(colour), current);
}
return pixels;
}
auto main() -> int
{
colour_fill blue { 0, 0, 255, 255 };
auto blue_bits = make_colour_texture(100, 100, blue);
return 0;
}
I would reserve the entire size that you need and then use the insert function to repeatedly add the pattern into the vector.
std::array<unsigned char, 4> pattern{0, 0, 255, 255};
std::vector<unsigned char> blue_texture;
blue_texture.reserve(width * height * 4);
for (int i = 0; i < (width * height); ++i)
{
blue_texture.insert(blue_texture.end(), pattern.begin(), pattern.end());
}
I made this template function which will modify its input container to contain count times what it already contains.
#include <iostream>
#include <vector>
#include <algorithm>
template<typename Container>
void repeat_pattern(Container& data, std::size_t count) {
auto pattern_size = data.size();
if(count == 0 or pattern_size == 0) {
return;
}
data.resize(pattern_size * count);
const auto pbeg = data.begin();
const auto pend = std::next(pbeg, pattern_size);
auto it = std::next(data.begin(), pattern_size);
for(std::size_t k = 1; k < count; ++k) {
std::copy(pbeg, pend, it);
std::advance(it, pattern_size);
}
}
template<typename Container>
void show(const Container& data) {
for(const auto & item : data) {
std::cout << item << " ";
}
std::cout << std::endl;
}
int main() {
std::vector<int> v{1, 2, 3, 4};
repeat_pattern(v, 3);
// should show three repetitions of times 1, 2, 3, 4
show(v);
}
Output (compiled as g++ example.cpp -std=c++14 -Wall -Wextra):
1 2 3 4 1 2 3 4 1 2 3 4