Simultaneously multiply all struct-elements with a scalar - c++

I have a struct that represents a vector. This vector consists of two one-byte integers. I use them to keep values from 0 to 255.
typedef uint8_T unsigned char;
struct Vector
{
uint8_T x;
uint8_T y;
};
Now, the main use case in my program is to multiply both elements of the vector with a 32bit float value:
typedef real32_T float;
Vector Vector::operator * ( const real32_T f ) const {
return Vector( (uint8_T)(x * f), (uint8_T)(y * f) );
};
This needs to be performed very often. Is there a way that these two multiplications can be performed simultaneously? Maybe by vectorization, SSE or similar? Or is the Visual studio compiler already doing this simultaneously?
Another usecase is to interpolate between two Vectors.
Vector Vector::interpolate(const Vector& rhs, real32_T z) const
{
return Vector(
(uint8_T)(x + z * (rhs.x - x)),
(uint8_T)(y + z * (rhs.y - y))
);
}
This already uses an optimized interpolation aproach (https://stackoverflow.com/a/4353537/871495).
But again the values of the vectors are multiplied by the same scalar value.
Is there a possibility to improve the performance of these operations?
Thanks
(I am using Visual Studio 2010 with an 64bit compiler)

In my experience, Visual Studio (especially an older version like VS2010) does not do a lot of vectorization on its own. They have improved it in the newer versions, so if you can, you might see if a change of compiler speeds up your code.
Depending on the code that uses these functions and the optimization the compiler does, it may not even be the calculations that slow down your program. Function calls and cache misses may hurt a lot more.
You could try the following:
If not already done, define the functions in the header file, so the compiler can inline them
If you use these functions in a tight loop, try doing the calculations 'by hand' without any function calls (temporarily expose the variables) and see if it makes a speed difference)
If you have a lot of vectors, look at how they are laid out in memory. Store them contiguously to minimize cache misses.
For SSE to work really well, you'd have to work with 4 values at once - so multiply 2 vectors with 2 floats. In a loop, use a step of 2 and write a static function that calculates 2 vectors at once using SSE instructions. Because your vectors are not aligned (and hardly ever will be with 8 bit variables), the code could even run slower than what you have now, but it's worth a try.
If applicable and if you don't depend on the clamping that occurs with your cast from float to uint8_t (e.g. if your floats are in range [0,1]), try using float everywhere. This may allow the compiler do do far better optimization.

You haven't showed the full algorithm, but the conversions between integer and float numbers is a very slow operation. Eliminating this operation and using only one type (if possible preferably integers) can greatly improve performances.
Alternatevly, you can use lrint() to do the conversion as explained here.

Related

Can OpenMP's SIMD directive vectorize indexing operations?

Say I have an MxN matrix (SIG) and a list of Nx1 fractional indices (idxt). Each fractional index in idxt uniquely corresponds to the same position column in SIG. I would like to index to the appropriate value in SIG using the indices stored in idxt, take that value and save it in another Nx1 vector. Since the indices in idxt are fractional, I need to interpolate in SIG. Here is an implementation that uses linear interpolation:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
double& bfVal) {
Eigen::VectorXd bfPTVec(idxt.size());
#pragma omp simd
for (int j = 0; j < idxt.size(); j++) {
int vIDX = static_cast<int>(idxt(j));
double interp1 = vIDX + 1 - idxt(j);
double interp2 = idxt(j) - vIDX;
bfPTVec(j) = (SIG(vIDX,j)*interp1 + SIG(vIDX+1,j)*interp2);
}
bfVal = ((idxt.array() > 0.0).select(bfPTVec,0.0)).sum();
}
I suspect there is a better way to implement the body of the loop here that would help the compiler better exploit SIMD operations. For example, as I understand it, forcing the compiler to cast between types, both explicitly as the first line does and implicitly as some of the mathematical operations do is not a vectorizable operation.
Additionally, by making the access to SIG dependent on values in idxt which are calculated at runtime I'm not clear if the type of memory read-write I'm performing here is vectorizable, or how it could be vectorized. Looking at the big picture description of my problem where each idxt corresponds to the same "position" column as SIG, I get a sense that it should be a vectorizable operation, but I'm not sure how to translate that into good code.
Clarification
Thanks to the comments, I realized I hadn't specified that certain values that I don't want contributing to the final summation in idxt are set to zero when idxt is initialized outside of this method. Hence the last line in the example given above.
Theoretically, it should be possible, assuming the processor support this operation. However, in practice, this is not the case for many reasons.
First of all, mainstream x86-64 processors supporting the instruction set AVX-2 (or AVX-512) does have instructions for that: gather SIMD instructions. Unfortunately, the instruction set is quite limited: you can only fetch 32-bit/64-bit values from the memory base on 32-bit/64-bit indices. Moreover, this instruction is not very efficiently implemented on mainstream processors yet. Indeed, it fetch every item separately which is not faster than a scalar code, but this can still be useful if the rest of the code is vectorized since reading many scalar value to fill a SIMD register manually tends to be a bit less efficient (although it was surprisingly faster on old processors due to a quite inefficient early implementation of gather instructions). Note that is the SIG matrix is big, then cache misses will significantly slow down the code.
Additionally, AVX-2 is not enabled by default on mainstream processors because not all x86-64 processors supports it. Thus, you need to enable AVX-2 (eg. using -mavx2) so compilers could vectorize the loop efficiently. Unfortunately, this is not enough. Indeed, most compilers currently fail to automatically detect when this instruction can/should be used. Even if they could, then the fact that IEEE-754 floating point number operations are not associative and values can be infinity or NaN generally does not help them to generate an efficient code (although it should be fine here). Note that you can tell to your compiler that operations can be assumed associated and you use only finite/basic real numbers (eg. using -ffast-math, which can be unsafe). The same thing apply for Eigen type/operators if compilers fail to completely inline all the functions (which is the case for ICC).
To speed up the code, you can try to change the type of the SIG variable to a matrix reference containing int32_t items. Another possible optimization is to split the loop in small fixed-size chunks (eg.32 items) and split the loop in many parts so to compute the indirection in a separate loops so compilers can vectorize at least some of the loops. Some compilers likes Clang are able to do that automatically for you: they generate a fast SIMD implementation for a part of the loop and do the indirections use scalar instructions. If this is not enough (which appear to be the case so far), then you certainly need to vectorize the loop yourself using SIMD intrinsics (or possible use SIMD libraries that does that for you).
Probably no, but I would expect manually vectorized version to be faster.
Below is an example of that inner loop, untested. It doesn’t use AVX only SSE up to 4.1, and should be compatible with these Eigen matrices you have there.
The pIndex input pointer should point to the j-th element of your idxt vector, and pSignsColumn should point to the start of the j-th column of the SIG matrix. It assumes your SIG matrix is column major. It’s normally the default memory layout in Eigen but possible to override with template arguments, and probably with macros as well.
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn )
{
// Load the index value into both lanes of the vector
__m128d idx = _mm_loaddup_pd( pIndex );
// Convert into int32 with truncation; this assumes the number there ain't negative.
const int iFloor = _mm_cvttsd_si32( idx );
// Compute fractional part
idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
// Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
// Load two int16_t values from sequential addresses
const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
// Upcast them to int32, then to fp64
const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
// Compute the result
__m128d res = _mm_mul_pd( idx, signs );
res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
// The above 2 lines (3 instructions) can be replaced with the following one:
// const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
// It may or may not be better, the dppd instruction is not particularly fast.
return _mm_cvtsd_f64( res );
}

Dealing with a contiguous vector of fixed-size matrices for both storage layouts in Eigen

An external library gives me a raw pointer of doubles that I want to map to an Eigen type. The raw array is logically a big ordered collection of small dense fixed-size matrices, all of the same size. The main issue is that the small dense matrices may be in row-major or column-major ordering and I want to accommodate them both.
My current approach is as follows. Note that all the entries of a small fixed-size block (in the array of blocks) need to be contiguous in memory.
template<int bs, class Mattype>
void block_operation(double *const vals, const int numblocks)
{
Eigen::Map<Mattype> mappedvals(vals,
Mattype::IsRowMajor ? numblocks*bs : bs,
Mattype::IsRowMajor ? bs : numblocks*bs
);
for(int i = 0; i < numblocks; i++)
if(Mattype::isRowMajor)
mappedvals.template block<bs,bs>(i*bs,0) = block_operation_rowmajor(mappedvals);
else
mappedvals.template block<bs,bs>(0,i*bs) = block_operation_colmajor(mappedvals);
}
The calling function first figures out the Mattype (out of 2 options) and then calls the above function with the correct template parameter.
Thus all my algorithms need to be written twice and my code is interspersed with these layout checks. Is there a way to do this in a layout-agnostic way? Keep in mind that this code needs to be as fast as possible.
Ideally, I would Map the data just once and use it for all the operations needed. However, the only solution I could come up with was invoking the Map constructor once for every small block, whenever I need to access the block.
template<int bs, StorageOptions layout>
inline Map<Matrix<double,bs,bs,layout>> extractBlock(double *const vals,
const int bindex)
{
return Map<Matrix<double,bs,bs,layout>>(vals+bindex*bs*bs);
}
Would this function be optimized away to nothing (by a modern compiler like GCC 7.3 or Intel 2017 under -std=c++14 -O3), or would I be paying a small penalty every time I invoke this function (once for each block, and there are a LOT of small blocks)? Is there a better way to do this?
Your extractBlock is fine, a simpler but somewhat uglier solution is to use a reinterpret cast at the start of block_operation:
using BlockType = Matrix<double,bs,bs,layout|DontAlign>;
BlockType* blocks = reinterpret_cast<BlockType*>(vals);
for(int i...)
block[i] = ...;
This will work for fixed sizes matrices only. Also note the DontAlign which is important unless you can guaranty that vals is aligned on a 16 or even 32 bytes depending on the presence of AVX and bs.... so just use DontAlign!

SIMD optimisation for cross-pattern access

I'm tryint to write a monte-carlo simulation of the Ising model, and I was wondering if it was possible to use SIMD optimisations for accessing data in a cross pattern.
I basically want to know if there's any way of speeding up this function.
//up/down/left/right stencil accumulation
float lattice::compute_point_energy(int row, int col) {
int accumulator=0;
accumulator+= get(row? row-1: size_-1, col);
accumulator+= get((row+1)%size_, col);
accumulator+= get(row, col? col-1: size_-1);
accumulator+= get(row, (col+1)%size_) ;
return -get(row, col) * (accumulator * J_ + H_);
}
get(i, j) is a method accesses a flat std::vector of shorts. I see that there might be a few problems: the access has lots of ternary logic going on (for periodic boundary conditions), and none of the vector elements are adjacent. Is it to make SIMD optimisations for this chunk, or should I keep digging? Re-implementing the adjacency matrix and/or using a different container (e.g. an array, or vector of different type) are an option.
SIMD is the last thing you'll want to try with this function.
I think you're trying to use an up/down/left/right 4-stencil for your computation. If so, your code should have a comment noting this.
You're losing a lot of speed in this function because of the potential for branching at your ternary operators and because modulus is relatively slow.
You'd do well to surround the two-dimensional space you're operating over with a ring of cells set to values appropriate for handling edge effects. This allows you to eliminate checks for edge effects.
For accessing your stencil, I find it often works to use something like the following:
const int width = 10;
const int height = 10;
const int offset[4] = {-1,1,-width,width};
double accumulator=0;
for(int i=0;i<4;i++)
accumulator += get(current_loc+offset[i]);
Notice that the mini-array has precalculated offsets to the neighbouring cells in your domain. A good compiler will likely unroll the foregoing loop.
Once you've done all this, appropriate choice of optimization flags may lead to automatic vectorization.
As it is, the branching and mods in your code are likely preventing auto-vectorization. You can check this by enabling appropriate flags. For Intel Compiler Collection (icc), you'll want:
-qopt-report=5 -qopt-report-phase:vec
For GCC you'll want (if I recall correctly):
-fopt-info-vec -fopt-info-missed

Filling unordered_set is too slow

We have a given 3D-mesh and we are trying to eliminate identical vertexes. For this we are using a self defined struct containing the coordinates of a vertex and the corresponding normal.
struct vertice
{
float p1,p2,p3,n1,n2,n3;
bool operator == (const vertice& vert) const
{
return (p1 == vert.p1 && p2 == vert.p2 && p3 == vert.p3);
}
};
After filling the vertex with data, it is added to an unordered_set to remove the duplicates.
struct hashVertice
{
size_t operator () (const vertice& vert) const
{
return(7*vert.p1 + 13*vert.p2 + 11*vert.p3);
}
};
std::unordered_set<vertice,hashVertice> verticesSet;
vertice vert;
while(i<(scene->mMeshes[0]->mNumVertices)){
vert.p1 = (float)scene->mMeshes[0]->mVertices[i].x;
vert.p2 = (float)scene->mMeshes[0]->mVertices[i].y;
vert.p3 = (float)scene->mMeshes[0]->mVertices[i].z;
vert.n1 = (float)scene->mMeshes[0]->mNormals[i].x;
vert.n2 = (float)scene->mMeshes[0]->mNormals[i].y;
vert.n3 = (float)scene->mMeshes[0]->mNormals[i].z;
verticesSet.insert(vert);
i = i+1;
}
We discovered that it is too slow for data amounts like 3.000.000 vertexes. Even after 15 minutes of running the program wasn't finished. Is there a bottleneck we don't see or is another data structure better for such a task?
What happens if you just remove verticesSet.insert(vert); from the loop?
If it speeds-up dramatically (as I expect it would), your bottleneck is in the guts of the std::unordered_set, which is a hash-table, and the main potential performance problem with hash tables is when there are excessive hash collisions.
In your current implementation, if p1, p2 and p3 are small, the number of distinct hash codes will be small (since you "collapse" float to integer) and there will be lots of collisions.
If the above assumptions turn out to be true, I'd try to implement the hash function differently (e.g. multiply with much larger coefficients).
Other than that, profile your code, as others have already suggested.
Hashing floating point can be tricky. In particular, your hash
routine calculates the hash as a floating point value, then
converts it to an unsigned integral type. This has serious
problems if the vertices can be small: if all of the vertices
are in the range [0...1.0), for example, your hash function
will never return anything greater than 13. As an unsigned
integer, which means that there will be at most 13 different
hash codes.
The usual way to hash floating point is to hash the binary
image, checking for the special cases first. (0.0 and -0.0
have different binary images, but must hash the same. And it's
an open question what you do with NaNs.) For float this is
particularly simple, since it usually has the same size as
int, and you can reinterpret_cast:
size_t
hash( float f )
{
assert( /* not a NaN */ );
return f == 0.0 ? 0.0 : reinterpret_cast( unsigned& )( f );
}
I know, formally, this is undefined behavior. But if float and
int have the same size, and unsigned has no trapping
representations (the case on most general purpose machines
today), then a compiler which gets this wrong is being
intentionally obtuse.
You then use any combining algorithm to merge the three results;
the one you use is as good as any other (in this case—it's
not a good generic algorithm).
I might add that while some of the comments insist on profiling
(and this is generally good advice), if you're taking 15 minutes
for 3 million values, the problem can really only be a poor hash
function, which results in lots of collisions. Nothing else will
cause that bad of performance. And unless you're familiar with
the internal implementation of std::unordered_set, the usual
profiler output will probably not give you much information.
On the other hand, std::unordered_set does have functions
like bucket_count and bucket_size, which allow analysing
the quality of the hash function. In your case, if you cannot
create an unordered_set with 3 million entries, your first
step should be to create a much smaller one, and use these
functions to evaluate the quality of your hash code.
If there is a bottleneck, you are definitely not seeing it, because you don't include any kind of timing measures.
Measure the timing of your algorithm, either with a profiler or just manually. This will let you find the bottleneck - if there is one.
This is the correct way to proceed. Expecting yourself, or alternatively, StackOverflow users to spot bottlenecks by eye inspection instead of actually measuring time in your program is, from my experience, the most common cause of failed attempts at optimization.

What is the better Matrix4x4 class design c++ newbie

What would be better to use as a way to store matrix values?
float m1,m2,m3 ... ,m16
or
float[4][4].
I first tried float[16] but when im debugging and testing VS wont show what is inside of the array :( could implement a cout and try to read answer from a console test application.
Then i tried using float m1,m2,m3,etc under testing and debugging the values could be read in VS so it seemed easier to work with.
My question is because im fairly new with c++ what is the better design?
I find the float m1,m2 ... ,m16 easier to work with when debugging.
Would also love if someone could say from experience or has benchmark data what has better performance my gut says it shouldn't really matter because the matrix data should be laid out the same in memory right?
Edit:
Some more info its a column major matrix.
As far as i know i only need a 4x4 Matrix for the view transformation pipeline.
So nothing bigger and so i have some constant values.
Busy writing a simple software renderer as a way to learn more c++ and get some more experiences and learn/improve my Linear algebra skills. Will probably only go to per fragment shading and some simple lighting models and so far that i have seen 4x4 matrix is the biggest i will need for rendering.
Edit2:
Found out why i couldn't read the array data it was a float pointer i used and debugging menu only showed the pointer value i did discover a way to see the array value in watch where you have to do pointer, n where n = the element you want to see.
Everybody that answered thanks i will use the Vector4 m[4] answer for now.
You should consider a Vector4 with float [4] members, and a Matrix4 with Vector4 [4] members. With operator [], you have two useful classes, and maintain the ability to access elements with: [i][j] - in most cases, the element data will be contiguous, provided you don't use virtual methods.
You can also benefit from vector (SIMD) instructions this way, e.g., in Vector4
union alignas(16) { __m128 _v; float _s[4]; }; // members
inline float & operator [] (int i) { return _s[i]; }
inline const float & operator [] (int i) const { return _s[i]; }
and in Matrix4
Vector4 _m[4]; // members
inline Vector4 & operator [] (int i) { return _m[i]; }
inline const Vector4 & operator [] (int i) const { return _m[i]; }
The float m1, m2 .. m16; becomes very awkward to deal with when it comes to using loops to iterate through things. Using arrays of some sort is much easier. And, most likely, the compiler will generate AT LEAST as efficient code when you use loops as if you "hand-code", unless you actually write inline assembler or use SSE intrinsics.
The 16 float solution is fine as long as the code doesn't evolve (it is a hassle to maintain and it is not really readable)
The float[4][4] is a way better design (in terms of size parametrization) but you have to understand the notion of pointers.
I would use an array of 16 floats like float m[16]; with the sole reason being that it is very easy to pass it to a library like openGL, using the Matrix4fv suffix functions.
A 2D array like float m[4][4]; should also be configured in memory identically to float m[16] (see May I treat a 2D array as a contiguous 1D array?) and using that would be more convenient as far as having [row][col] (or [col][row] I am not sure which is correct in terms of openGL) indexing (compare m[1][1] vs m[5]).
Using separate variables for matrix elements may prove to be problematic. What are you planning to do when dealing with big matrices like 100x100?
Ofcourse you need to use some array-like structure and I strongly recommend you at least to use arrays