Adding 3D vectors using SIMD intrinsics - c++

I've got two streams of 3D vectors which I'd like to add using x86 AVX2 intrinsics. I'm using the GNU compiler 11.1.0. Hopefully, the code illustrates what I want to do:
// Example program
#include <utility> // std::size_t
#include <immintrin.h>
struct v3
{
float data[3] = {};
};
void add(const v3* a, const v3* b, v3* c, const std::size_t& n)
{
// c <- a + b
for (auto i = std::size_t{}; i < n; i += 2) // 2 vector3s at a time ~6 data
{
// masking
// [95:0] of a[i] move into [255:128], [95:0] of a[i+1] move into [255:128] of *another* 256-bit register
// ^same with b[i]
static const auto p1_mask = _mm256_setr_epi32(-1, -1, -1, 0, 0, 0, 0, 0);
static const auto p2_mask = _mm256_setr_epi32(0, 0, 0, -1, -1, -1, 0, 0);
const auto p1_leftop_packed = _mm256_maskload_ps(a[i].data, p1_mask);
const auto p2_lefttop_packed = _mm256_maskload_ps(a[i].data, p2_mask);
const auto p1_rightop_packed = _mm256_maskload_ps(b[i].data, p1_mask);
const auto p2_rightop_packed = _mm256_maskload_ps(b[i].data, p2_mask);
// addition is being done inefficiently with 2 AVX2 instructions!
const auto result1_packed = _mm256_add_ps(p1_leftop_packed, p1_rightop_packed);
const auto result2_packed = _mm256_add_ps(p2_leftop_packed, p2_rightop_packed);
// store them back
_mm256_maskstore_ps(c[i].data, p1_mask, result1_packed);
_mm256_maskstore_ps(c[i].data, p2_mask, result2_packed);
}
}
int main()
{
// data
const auto n = std::size_t{1000};
v3 a[n] = {};
v3 b[n] = {};
v3 c[n] = {};
// run
add(a, b, c, n);
return 0;
}
The above code works but the performance is quite terrible. To correct it, I think I need a version which looks approximately like the following:
// c <- a + b
for (auto i = std::size_t{}; i < n; i += 2) // 2 vector3s at a time ~6 data
{
// masking
// [95:0] of a[i] move into [255:128], [95:0] of a[i+1] in [127:0]
const auto leftop_packed = /*code required here*/;
const auto rightop_packed = /*code required here*/;
// addition is being done with only 1 AVX2 instruction
const auto result_packed = _mm256_add_ps(leftop_packed, rightop_packed);
// store them back
// [95:0] of result_packed move into c[i], [223:128] of result_packed into c[i+1]
/*code required here*/
}
How do I achieve this? I will gladly provide any additional information when needed. Any help would be much appreciated.

The two following comments say the same. They are good. Do as they say.
I think you can just load 8 floats at a time and then if you have anything left over at the end you can do a masked store (not sure about this part). – LHLaurini
Use char*, float*, or __m256* to work in 32-byte or 8-float chunks, ignoring vector boundaries since you're just doing pure vertical addition. float* should be good for cleanup of the last up-to-7 floats – Peter Cordes

The Eigen library supports vectorization. It also has a lot of the vector/matrix math algorithms already implemented, and quite efficiently too. If you can, I'd recommend looking into using it instead of rolling your own logic.

Related

More efficient way to get indices of a binary mask in Eigen3?

I currently have a bool mask vector generated in Eigen. I would like to use this binary mask similar as in Python numpy, where depending on the True value, i get a sub-matrix or a sub-vector, where i can further do some calculations on these.
To achieve this in Eigen, i currently "convert" the mask vector into another vector containing the indices by simply iterating over the mask:
Eigen::Array<bool, Eigen::Dynamic, 1> mask = ... // E.G.: [0, 1, 1, 1, 0, 1];
Eigen::Array<uint32_t, Eigen::Dynamic, 1> mask_idcs(mask.count(), 1);
int z_idx = 0;
for (int z = 0; z < mask.rows(); z++) {
if (mask(z)) {
mask_idcs(z_idx++) = z;
}
}
// do further calculations on vector(mask_idcs)
// E.G.: vector(mask_idcs)*3 + another_vector
However, i want to further optimize this and am wondering if Eigen3 provides a more elegant solution for this, something like vector(from_bin_mask(mask)), which may benefit from the libraries optimization.
There are already some questions here in SO, but none seems to answer this simple use-case
(1, 2). Some refer to the select-function, which returns an equally sized vector/matrix/array, but i want to discard elements via a mask and only work further with a smaller vector/matrix/array.
Is there a way to do this in a more elegant way? Can this be optimized otherwise?
(I am using the Eigen::Array-type since most of the calculations are element-wise in my use-case)
As far as I'm aware, there is no "out of the shelf" solution using Eigen's methods. However it is interesting to notice that (at least for Eigen versions greater or equal than 3.4.0), you can using a std::vector<int> for indexing (see this section). Therefore the code you've written could simplified to
Eigen::Array<bool, Eigen::Dynamic, 1> mask = ... // E.G.: [0, 1, 1, 1, 0, 1];
std::vector<int> mask_idcs;
for (int z = 0; z < mask.rows(); z++) {
if (mask(z)) {
mask_idcs.push_back(z);
}
}
// do further calculations on vector(mask_idcs)
// E.G.: vector(mask_idcs)*3 + another_vector
If you're using c++20, you could use an alternative implementation using std::ranges without using raw for-loops:
int const N = mask.size();
auto c = iota(0, N) | filter([&mask](auto const& i) { return mask[i]; });
auto masked_indices = std::vector(begin(c), end(c));
// ... Use it as vector(masked_indices) ...
I've implemented some minimal examples in compiler explorer in case you'd like to check out. I honestly wished there was a simpler way to initialize the std::vector from the raw range, but it's currently not so simple. Therefore I'd suggest you to wrap the code into a helper function, for example
auto filtered_indices(auto const& mask) // or as you've suggested from_bin_mask(auto const& mask)
{
using std::ranges::begin;
using std::ranges::end;
using std::views::filter;
using std::views::iota;
int const N = mask.size();
auto c = iota(0, N) | filter([&mask](auto const& i) { return mask[i]; });
return std::vector(begin(c), end(c));
}
and then use it as, for example,
Eigen::ArrayXd F(5);
F << 0.0, 1.1548, 0.0, 0.0, 2.333;
auto mask = (F > 1e-15).eval();
auto D = (F(filtered_indices(mask)) + 3).eval();
It's not as clean as in numpy, but it's something :)
I have found another way, which seems to be more elegant then comparing each element if it equals to 0:
Eigen::SparseMatrix<bool> mask_sparse = mask.matrix().sparseView();
for (uint32_t k = 0; k<mask.outerSize(); ++k) {
for (Eigen::SparseMatrix<bool>::InnerIterator it(mask_sparse, k); it; ++it) {
std::cout << it.row() << std::endl; // row index
std::cout << it.col() << std::endl; // col index
// Do Stuff or built up an array
}
}
Here we can at least build up a vector (or multiple vectors, if we have more dimensions) and then later use it to "mask" a vector or matrix. (This is taken from the documentation).
So applied to this specific usecase, we simply do:
Eigen::Array<uint32_t, Eigen::Dynamic, 1> mask_idcs(mask.count(), 1);
Eigen::SparseVector<bool> mask_sparse = mask.matrix().sparseView();
int z_idx = 0;
for (Eigen::SparseVector<bool>::InnerIterator it(mask_sparse); it; ++it) {
mask_idcs(z_idx++) = it.index()
}
// do Stuff like vector(mask_idcs)*3 + another_vector
However, i do not know which version is faster for large masks containing thousands of elements.

Having a hard time figuring out logic behind array manipulation

I am given a filled array of size WxH and need to create a new array by scaling both the width and the height by a power of 2. For example, 2x3 becomes 8x12 when scaled by 4, 2^2. My goal is to make sure all the old values in the array are placed in the new array such that 1 value in the old array fills up multiple new corresponding parts in the scaled array. For example:
old_array = [[1,2],
[3,4]]
becomes
new_array = [[1,1,2,2],
[1,1,2,2],
[3,3,4,4],
[3,3,4,4]]
when scaled by a factor of 2. Could someone explain to me the logic on how I would go about programming this?
It's actually very simple. I use a vector of vectors for simplicity noting that 2D matrixes are not efficient. However, any 2D matrix class using [] indexing syntax can, and should be for efficiency, substituted.
#include <vector>
using std::vector;
int main()
{
vector<vector<int>> vin{ {1,2},{3,4},{5,6} };
size_t scaleW = 2;
size_t scaleH = 3;
vector<vector<int>> vout(scaleH * vin.size(), vector<int>(scaleW * vin[0].size()));
for (size_t i = 0; i < vout.size(); i++)
for (size_t ii = 0; ii < vout[0].size(); ii++)
vout[i][ii] = vin[i / scaleH][ii / scaleW];
auto x = vout[8][3]; // last element s/b 6
}
Here is my take. It is very similar to #Tudor's but I figure between our two, you can pick what you like or understand best.
First, let's define a suitable 2D array type because C++'s standard library is very lacking in this regard. I've limited myself to a rather simple struct, in case you don't feel comfortable with object oriented programming.
#include <vector>
// using std::vector
struct Array2d
{
unsigned rows, cols;
std::vector<int> data;
};
This print function should give you an idea how the indexing works:
#include <cstdio>
// using std::putchar, std::printf, std::fputs
void print(const Array2d& arr)
{
std::putchar('[');
for(std::size_t row = 0; row < arr.rows; ++row) {
std::putchar('[');
for(std::size_t col = 0; col < arr.cols; ++col)
std::printf("%d, ", arr.data[row * arr.cols + col]);
std::fputs("]\n ", stdout);
}
std::fputs("]\n", stdout);
}
Now to the heart, the array scaling. The amount of nesting is … bothersome.
Array2d scale(const Array2d& in, unsigned rowfactor, unsigned colfactor)
{
Array2d out;
out.rows = in.rows * rowfactor;
out.cols = in.cols * colfactor;
out.data.resize(std::size_t(out.rows) * out.cols);
for(std::size_t inrow = 0; inrow < in.rows; ++inrow) {
for(unsigned rowoff = 0; rowoff < rowfactor; ++rowoff) {
std::size_t outrow = inrow * rowfactor + rowoff;
for(std::size_t incol = 0; incol < in.cols; ++incol) {
std::size_t in_idx = inrow * in.cols + incol;
int inval = in.data[in_idx];
for(unsigned coloff = 0; coloff < colfactor; ++coloff) {
std::size_t outcol = incol * colfactor + coloff;
std::size_t out_idx = outrow * out.cols + outcol;
out.data[out_idx] = inval;
}
}
}
}
return out;
}
Let's pull it all together for a little demonstration:
int main()
{
Array2d in;
in.rows = 2;
in.cols = 3;
in.data.resize(in.rows * in.cols);
for(std::size_t i = 0; i < in.rows * in.cols; ++i)
in.data[i] = static_cast<int>(i);
print(in);
print(scale(in, 3, 2));
}
This prints
[[0, 1, 2, ]
[3, 4, 5, ]
]
[[0, 0, 1, 1, 2, 2, ]
[0, 0, 1, 1, 2, 2, ]
[0, 0, 1, 1, 2, 2, ]
[3, 3, 4, 4, 5, 5, ]
[3, 3, 4, 4, 5, 5, ]
[3, 3, 4, 4, 5, 5, ]
]
To be honest, i'm incredibly bad at algorithms but i gave it a shot.
I am not sure if this can be done using only one matrix, or if it can be done in less time complexity.
Edit: You can estimate the number of operations this will make with W*H*S*S where Sis the scale factor, W is width and H is height of input matrix.
I used 2 matrixes m and r, where m is your input and r is your result/output. All that needs to be done is to copy each element from m at positions [i][j] and turn it into a square of elements with the same value of size scale_factor inside r.
Simply put:
int main()
{
Matrix<int> m(2, 2);
// initial values in your example
m[0][0] = 1;
m[0][1] = 2;
m[1][0] = 3;
m[1][1] = 4;
m.Print();
// pick some scale factor and create the new matrix
unsigned long scale = 2;
Matrix<int> r(m.rows*scale, m.columns*scale);
// i know this is bad but it is the most
// straightforward way of doing this
// it is also the only way i can think of :(
for(unsigned long i1 = 0; i1 < m.rows; i1++)
for(unsigned long j1 = 0; j1 < m.columns; j1++)
for(unsigned long i2 = i1*scale; i2 < (i1+1)*scale; i2++)
for(unsigned long j2 = j1*scale; j2 < (j1+1)*scale; j2++)
r[i2][j2] = m[i1][j1];
// the output in your example
std::cout << "\n\n";
r.Print();
return 0;
}
I do not think it is relevant for the question, but i used a class Matrix to store all the elements of the extended matrix. I know it is a distraction but this is still C++ and we have to manage memory. And what you are trying to achieve with this algorithm needs a lot of memory if the scale_factor is big so i wrapped it up using this:
template <typename type_t>
class Matrix
{
private:
type_t** Data;
public:
// should be private and have Getters but
// that would make the code larger...
unsigned long rows;
unsigned long columns;
// 2d Arrays get big pretty fast with what you are
// trying to do.
Matrix(unsigned long rows, unsigned long columns)
{
this->rows = rows;
this->columns = columns;
Data = new type_t*[rows];
for(unsigned long i = 0; i < rows; i++)
Data[i] = new type_t[columns];
}
// It is true, a copy constructor is needed
// as HolyBlackCat pointed out
Matrix(const Matrix& m)
{
rows = m.rows;
columns = m.columns;
Data = new type_t*[rows];
for(unsigned long i = 0; i < rows; i++)
{
Data[i] = new type_t[columns];
for(unsigned long j = 0; j < columns; j++)
Data[i][j] = m[i][j];
}
}
~Matrix()
{
for(unsigned long i = 0; i < rows; i++)
delete [] Data[i];
delete [] Data;
}
void Print()
{
for(unsigned long i = 0; i < rows; i++)
{
for(unsigned long j = 0; j < columns; j++)
std::cout << Data[i][j] << " ";
std::cout << "\n";
}
}
type_t* operator [] (unsigned long row)
{
return Data[row];
}
};
First of all, having a suitable 2D matrix class is presumed but not the question. But I don't know the API of yours, so I'll illustrate with something typical:
struct coord {
size_t x; // x position or column count
size_t y; // y position or row count
};
template <typename T>
class Matrix2D {
⋮ // implementation details
public:
⋮ // all needed special members (ctors dtor, assignment)
Matrix2D (coord dimensions);
coord dimensions() const; // return height and width
const T& cell (coord position) const; // read-only access
T& cell (coord position); // read-write access
// handy synonym:
const T& operator[](coord position) const { return cell(position); }
T& operator[](coord position) { return cell(position); }
};
I just showed the public members I need: create a matrix with a given size, query the size, and indexed access to the individual elements.
So, given that, your problem description is:
template<typename T>
Matrix2D<T> scale_pow2 (const Matrix2D& input, size_t pow)
{
const auto scale_factor= 1 << pow;
const auto size_in = input.dimensions();
Matrix2D<T> result ({size_in.x*scale_factor,size_in.y*scale_factor});
⋮
⋮ // fill up result
⋮
return result;
}
OK, so now the problem is precisely defined: what code goes in the big blank immediately above?
Each cell in the input gets put into a bunch of cells in the output. So you can either iterate over the input and write a clump of cells in the output all having the same value, or you can iterate over the output and each cell you need the value for is looked up in the input.
The latter is simpler since you don't need a nested loop (or pair of loops) to write a clump.
for (coord outpos : /* ?? every cell of the output ?? */) {
coord frompos {
outpos.x >> pow,
outpos.y >> pow };
result[outpos] = input[frompos];
}
Now that's simple!
Calculating the from position for a given output must match the way the scale was defined: you will have pow bits giving the position relative to this clump, and the higher bits will be the index of where that clump came from
Now, we want to set outpos to every legal position in the output matrix indexes. That's what I need. How to actually do that is another sub-problem and can be pushed off with top-down decomposition.
a bit more advanced
Maybe nested loops is the easiest way to get that done, but I won't put those directly into this code, pushing my nesting level even deeper. And looping 0..max is not the simplest thing to write in bare C++ without libraries, so that would just be distracting. And, if you're working with matrices, this is something you'll have a general need for, including (say) printing out the answer!
So here's the double-loop, put into its own code:
struct all_positions {
coord current {0,0};
coord end;
all_positions (coord end) : end{end} {}
bool next() {
if (++current.x < end.x) return true; // not reached the end yet
current.x = 0; // reset to the start of the row
if (++current.y < end.y) return true;
return false; // I don't have a valid position now.
}
};
This does not follow the iterator/collection API that you could use in a range-based for loop. For information on how to do that, see my article on Code Project or use the Ranges stuff in the C++20 standard library.
Given this "old fashioned" iteration helper, I can write the loop as:
all_positions scanner {output.dimensions}; // starts at {0,0}
const auto& outpos= scanner.current;
do {
⋮
} while (scanner.next());
Because of the simple implementation, it starts at {0,0} and advancing it also tests at the same time, and it returns false when it can't advance any more. Thus, you have to declare it (gives the first cell), use it, then advance&test. That is, a test-at-the-end loop. A for loop in C++ checks the condition before each use, and advances at the end, using different functions. So, making it compatible with the for loop is more work, and surprisingly making it work with the ranged-for is not much more work. Separating out the test and advance the right way is the real work; the rest is just naming conventions.
As long as this is "custom", you can further modify it for your needs. For example, add a flag inside to tell you when the row changed, or that it's the first or last of a row, to make it handy for pretty-printing.
summary
You need a bunch of things working in addition to the little piece of code you actually want to write. Here, it's a usable Matrix class. Very often, it's prompting for input, opening files, handling command-line options, and that kind of stuff. It distracts from the real problem, so get that out of the way first.
Write your code (the real code you came for) in its own function, separate from any other stuff you also need in order to house it. Get it elsewhere if you can; it's not part of the lesson and just serves as a distraction. Worse, it may be "hard" in ways you are not prepared for (or to do well) as it's unrelated to the actual lesson being worked on.
Figure out the algorithm (flowchart, pseudocode, whatever) in a general way before translating that to legal syntax and API on the objects you are using. If you're just learning C++, don't get bogged down in the formal syntax when you are trying to figure out the logic. Until you naturally start to think in C++ when doing that kind of planning, don't force it. Use whiteboard doodles, tinkertoys, whatever works for you.
Get feedback and review of the idea, the logic of how to make it happen, from your peers and mentors if available, before you spend time coding. Why write up an idea that doesn't work? Fix the logic, not the code.
Finally, sketch the needed control flow, functions and data structures you need. Use pseudocode and placeholder notes.
Then fill in the placeholders and replace the pseudo with the legal syntax. You already planned it out, so now you can concentrate on learning the syntax and library details of the programming language. You can concentrate on "how do I express (some tiny detail) in C++" rather than keeping the entire program in your head. More generally, isolate a part that you will be learning; be learning/practicing one thing without worrying about the entire edifice.
To a large extent, some of those ideas translate to the code as well. Top-Down Design means you state things at a high level and then implement that elsewhere, separately. It makes code readable and maintainable, as well as easier to write in the first place. Functions should be written this way: the function explains how to do (what it does) as a list of details that are just one level of detail further down. Each of those steps then becomes a new function. Functions should be short and expressed at one semantic level of abstraction. Don't dive down into the most primitive details inside the function that explains the task as a set of simpler steps.
Good luck, and keep it up!

Vectorization of pcl_ros::transformPointCloud

I just noticed that the function pcl_ros::transformPointCloud is not vectorized. Below is the code snippet copied from here.
void transformPointCloud(
const Eigen::Matrix4f& transform,
const sensor_msgs::PointCloud2& in,
sensor_msgs::PointCloud2& out)
{
int x_idx = pcl::getFieldIndex(in, "x");
int y_idx = pcl::getFieldIndex(in, "y");
int z_idx = pcl::getFieldIndex(in, "z");
Eigen::Array4i xyz_offset(
in.fields[x_idx].offset,
in.fields[y_idx].offset,
in.fields[z_idx].offset, 0);
// most of the code is not shown here
for (size_t i = 0; i < in.width * in.height; ++i)
{
Eigen::Vector4f pt(*(float*)&in.data[xyz_offset[0]],
*(float*)&in.data[xyz_offset[1]],
*(float*)&in.data[xyz_offset[2]], 1);
Eigen::Vector4f pt_out;
pt_out = transform * pt;
}
memcpy(&out.data[xyz_offset[0]], &pt_out[0], sizeof(float));
memcpy(&out.data[xyz_offset[1]], &pt_out[1], sizeof(float));
memcpy(&out.data[xyz_offset[2]], &pt_out[2], sizeof(float));
xyz_offset += in.point_step;
}
The code above iterated over each point in the point cloud and multiply the transformation with it.
I am wondering if this can be vectorized so as to minimize the elapsed time.
I am looking for suggestions to implement/incorporate the same. I am using ROS Indigo (PCL 1.7.1) on Ubuntu 14.04 LTS PC.
Assuming x_idx, y_idx, and z_idx are 0, 4 and 8 and you don't care about all the special case handling of non-finite data, etc, you can simplify the inner loop to something like this:
void foo(char* data_out, Eigen::Index N, int out_step, const Eigen::Matrix4f& T, const char* data_in, int in_step)
{
for(Eigen::Index i=0; i<N; ++i)
{
Eigen::Vector3f::Map((float*)(data_out + i*out_step)).noalias()
= (T * Eigen::Vector3f::Map((const float*)(data_in + i*in_step)).homogeneous()).head<3>();
}
}
N would be in.width * in.height and out_step and in_step would be the corresponding point_step members. Minor possible improvement: You can copy T into a local variable so it does not need to be loaded from memory every time.
If point_step is a multiple of sizeof(float) you could also reduce this to a single assignment, using out_stride = out.point_step / sizeof(float), etc. However, this usually generates less efficient code than the version above (may change in future versions of Eigen).
void foo2(float* data_out, Eigen::Index N, int out_stride, const Eigen::Matrix4f& T, const float* data_in, int in_stride)
{
Eigen::Matrix3Xf::Map(data_out, 3, N, Eigen::OuterStride<>(out_stride)).noalias()
= (T *
Eigen::Matrix3Xf::Map(data_in, 3, N, Eigen::OuterStride<>(in_stride))
.colwise().homogeneous()
).topRows<3>();
}
Godbolt-Link

Eigen: Efficient way of referencing ArrayWrapper

I am interfacing some code with raw pointers. So I have extensive use of the map class:
void foo(T* raw_pointer){
const int rows = ...;
const int cols = ...;
Map<Matrix<T, rows, cols>> mat(raw_pointer);
// DO some stuff with "mat"
}
Now I want to apply some cwise operations in foo, which I accomplish using .array(). The code works, however, it looks very messy due to all of the .array() calls strewn in the function. For instance, for the sake of argument, let's suppose that the function looked like this:
void foo(T* raw_pointer){
const int rows = ...;
const int cols = ...;
Map<Matrix<T, rows, cols>> mat(raw_pointer);
for (int i = 0 ; i < 1000 ; ++i)
... something = i * mat.row(1).array() * sin(mat.row(4).array()) + mat.col(1).array();
}
Part of the problem with this is that it is very unclear what the code is actually doing. It would be much nicer if gave the variables names:
void foo(T* raw_pointer){
const int rows = ...;
const int cols = ...;
Map<Matrix<T, rows, cols>> mat(raw_pointer);
Matrix<T, 1, cols> thrust = mat.row(1);
Matrix<T, 1, cols> psi = mat.row(4);
Matrix<T, 1, cols> bias = mat.row(2);
for (int i = 0 ; i < 1000 ; ++i)
... something = i * thrust.array() * sin(psi.array()) + bias.array();
}
But it would be even nicer if I could get directly get a reference to the ArrayWrappers so that we aren't making any copies. However, the only way I can figure out how to get that to work is by using auto:
void foo(T* raw_pointer){
const int rows = ...;
const int cols = ...;
Map<Matrix<T, rows, cols>> mat(raw_pointer);
auto thrust = mat.row(1).array();
auto psi = mat.row(4).array();
auto bias = mat.row(2).array();
for (int i = 0 ; i < 1000 ; ++i)
... something = i * thrust * sin(psi) + bias;
}
This code works, and upon testing appears to reference the entries in the pointer (as opposed to making copies like in the previous snippet). However,
I am concerned about its efficiency since the Eigen documentation explicitly suggests NOT doing this. So could somebody please what the preferred way to define the types for the variables is in such a circumstance?
It seems to me like I should be using a Ref here, but I can't figure out how to get that to work. Specifically, I have tried replacing auto with
Eigen::Ref<Eigen::Array<T, 1, cols>>
and
Eigen::Ref<Eigen::ArrayWrapper<Eigen::Matrix<T, 1, cols>>>
but the compiler doesn't like either of those.
To avoid having to write array() every time you use the Map<Eigen::Matrix... you can use a Map<Eigen::Array... instead/in addition. This will use the default element-wise operators instead of the matrix operators. To use a matrix operator instead, you can use map.matrix() (similar to what you have in your post mat.array()).
auto thrust = [](auto&&mat){return mat.row(1).array();};
auto psi = [](auto&&mat){return mat.row(4).array();};
auto bias = [](auto&&mat){return mat.row(2).array();};
for (int i = 0 ; i < 1000 ; ++i)
... something = i * thrust(mat) * sin(psi(mat)) + bias(mat)
has names. And the array wrappers don't persist.

Initializing std::vector with a repeating pattern

I'm working with OpenGL at the moment, creating a 'texture cache' which handles loading images and buffering them with OpenGL. In the event an image file can't be loaded it needs to fall back to a default texture which I've hard-coded in the constructor.
What I basically need to do is create a texture of a uniform colour. This is not too difficult, it's just an array of size Pixels * Colour Channels.
I am currently using a std::vector to hold the initial data before I upload it OpenGL. The problem I'm having is that I can't find any information on the best way to initialize a vector with a repeating pattern.
The first way that occurred to me was to use a loop.
std::vector<unsigned char> blue_texture;
for (int iii = 0; iii < width * height; iii++)
{
blue_texture.push_back(0);
blue_texture.push_back(0);
blue_texture.push_back(255);
blue_texture.push_back(255);
}
However, this seems inefficient since the vector will have to resize itself numerous times. Even if I reserve space first and perform the loop it's still not efficient since the contents will be zeroed before the loop which means two writes for each unsigned char.
Currently I'm using the following method:
struct colour {unsigned char r; unsigned char g; unsigned char b; unsigned char a;};
colour blue = {0, 0, 255, 255};
std::vector<colour> texture((width * height), blue);
I then extract the data using:
reinterpret_cast<unsigned char*>(texture.data());
Is there a better way than this? I'm new to C/C++ and I'll be honest, casting pointers scares me.
Your loop solution is the right way to go in my opinion. To make it efficient by removing repeated realloc calls, use blue_texture.reserve(width * height * 4)
The reserve call will increase the allocation, aka capacity to that size without zero-filling it. (Note that the operating system may still zero it, if it pulls the memory from mmap for example.) It does not change the size of the vector, so push_back and friends still work the same way.
You can use reserve to pre-allocate the vector; this will avoid the reallocations. You can also define a small sequence (probably a C style vector:
char const init[] = { 0, 0, 255, 255 };
and loop inserting that into the end of the vector:
for ( int i = 0; i < pixelCount; ++ i ) {
v.insert( v.end(), std::begin( init ), std::end( init ) );
}
this is only marginally more efficient than using the four push_back in the loop, but is more succinct, and perhaps makes it clearer what you're doing, albeit only marginally: the big advantage might be being able to give a name to the initialization sequence (eg something like defaultBackground).
The most efficient way is the way that does least work.
Unfortunately, push_back(), insert() and the like have to maintain the size() of the vector as they work, which are redundant operations when performed in a tight loop.
Therefore the most efficient way is allocate the memory once and then copy data directly into it without maintaining any other variables.
It's done like this:
#include <iostream>
#include <array>
#include <vector>
using colour_fill = std::array<uint8_t, 4>;
using pixel_map = std::vector<uint8_t>;
pixel_map make_colour_texture(size_t width, size_t height, colour_fill colour)
{
// allocate the buffer
std::vector<uint8_t> pixels(width * height * sizeof(colour_fill));
auto current = pixels.data();
auto last = current + pixels.size();
while (current != last) {
current = std::copy(begin(colour), end(colour), current);
}
return pixels;
}
auto main() -> int
{
colour_fill blue { 0, 0, 255, 255 };
auto blue_bits = make_colour_texture(100, 100, blue);
return 0;
}
I would reserve the entire size that you need and then use the insert function to repeatedly add the pattern into the vector.
std::array<unsigned char, 4> pattern{0, 0, 255, 255};
std::vector<unsigned char> blue_texture;
blue_texture.reserve(width * height * 4);
for (int i = 0; i < (width * height); ++i)
{
blue_texture.insert(blue_texture.end(), pattern.begin(), pattern.end());
}
I made this template function which will modify its input container to contain count times what it already contains.
#include <iostream>
#include <vector>
#include <algorithm>
template<typename Container>
void repeat_pattern(Container& data, std::size_t count) {
auto pattern_size = data.size();
if(count == 0 or pattern_size == 0) {
return;
}
data.resize(pattern_size * count);
const auto pbeg = data.begin();
const auto pend = std::next(pbeg, pattern_size);
auto it = std::next(data.begin(), pattern_size);
for(std::size_t k = 1; k < count; ++k) {
std::copy(pbeg, pend, it);
std::advance(it, pattern_size);
}
}
template<typename Container>
void show(const Container& data) {
for(const auto & item : data) {
std::cout << item << " ";
}
std::cout << std::endl;
}
int main() {
std::vector<int> v{1, 2, 3, 4};
repeat_pattern(v, 3);
// should show three repetitions of times 1, 2, 3, 4
show(v);
}
Output (compiled as g++ example.cpp -std=c++14 -Wall -Wextra):
1 2 3 4 1 2 3 4 1 2 3 4