Fast dot product using SSE/AVX intrinsics - c++

I am looking for a fast way to calculate the dot product of vectors with 3 or 4 components. I tried several things, but most examples online use an array of floats while our data structure is different.
We use structs which are 16 byte aligned. Code excerpt (simplified):
struct float3 {
float x, y, z, w; // 4th component unused here
}
struct float4 {
float x, y, z, w;
}
In previous tests (using SSE4 dot product intrinsic or FMA) I could not get a speedup, compared to using the following regular c++ code.
float dot(const float3 a, const float3 b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
Tests were done with gcc and clang on Intel Ivy Bridge / Haswell. It seems that the time spend to load the data into the SIMD registers and pulling them out again kills alls the benefits.
I would appreciate some help and ideas, how the dot product can be efficiently calculated using our float3/4 data structures. SSE4, AVX or even AVX2 is fine.
Editor's note: for the 4-element case, see How to Calculate single-vector Dot Product using SSE intrinsic functions in C. That with masking is maybe good for the 3-element case, too.

Algebraically, efficient SIMD looks almost identical to scalar code. So the right way to do the dot product is to operate on four float vectors at once for SEE (eight with AVX).
Consider constructing your code like this
#include <x86intrin.h>
struct float4 {
__m128 xmm;
float4 () {};
float4 (__m128 const & x) { xmm = x; }
float4 & operator = (__m128 const & x) { xmm = x; return *this; }
float4 & load(float const * p) { xmm = _mm_loadu_ps(p); return *this; }
operator __m128() const { return xmm; }
};
static inline float4 operator + (float4 const & a, float4 const & b) {
return _mm_add_ps(a, b);
}
static inline float4 operator * (float4 const & a, float4 const & b) {
return _mm_mul_ps(a, b);
}
struct block3 {
float4 x, y, z;
};
struct block4 {
float4 x, y, z, w;
};
static inline float4 dot(block3 const & a, block3 const & b) {
return a.x*b.x + a.y*b.y + a.z*b.z;
}
static inline float4 dot(block4 const & a, block4 const & b) {
return a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w;
}
Notice that the last two functions look almost identical to your scalar dot function except that float becomes float4 and float4 becomes block3 or block4. This will do the dot product most efficiently.

To get the best out of AVX intrinsics, you have to think in a different dimension.
Instead of doing one dot product, do 8 dot products in a single go.
Look up the difference between SoA and AoS.
If your vectors are in SoA (structures of arrays) format, your data looks like this in memory:
// eight 3d vectors, called a.
float ax[8];
float ay[8];
float az[8];
// eight 3d vectors, called b.
float bx[8];
float by[8];
float bz[8];
Then to multiply all 8 a vectors with all 8 b vectors, you use three simd multiplications, one for each of x,y,z.
For dot, you still need to add afterwards, of course, which is a little trickier. But multiplication, subtraction, addition of vectors, using SoA is pretty easy, and really fast. When AVX-512 is available, you can do 16 3d vector multiplications in just 3 instructions.

Related

activity selection problem time limit error [duplicate]

I have a list of structs that I am sorting by one of the members. I am using std::sort with my own comparison function, that part is fine. However, I notice a (very) large performance gap when I change the struct from:
struct square
{
float x;
float y;
float z;
float scale;
float angle;
GLuint texture;
};
to
struct square
{
float x;
float y;
float z;
float scale;
float angle;
GLuint texture;
std::vector <float> color;
};
I have since used an entirely different method, and I realize that using a vector like this is a bad idea (I know the size of array - rgb) but I was wondering why I got the performance hit. I was comparing the z values to sort.
Here is my sorting function and struct list:
std::vector <square> square_list;
//Then add a bunch of squares
bool sort (square a,square b)
{
return a.z < b.z;
}
//Here is the sort that is slow
std::sort (square_list.begin(),square_list.end(),sort);
I wonder if it has anything to do with re-ordering the list of structs as their size is significantly bigger in the second case?
Thanks for any responses.
bool sort (square a,square b)
This copies the structs each time, including the vectors. Vectors are slower to copy than normal arrays. You should use this instead.
bool sort (const square& a, const square& b)
If you are using C++11, you can replace the vector with std::array as the size is constant.
In addition to take parameters as const ref you could use a functor for comparison. That is often faster because functors are more easy to inline.
std::vector <square> square_list;
//Then add a bunch of squares
struct sort
{
bool operator() (const square& a, const square& b) const {
return a.z < b.z;
}
}
std::sort (square_list.begin(),square_list.end(),sort);
sort copy your values every time and std::vector preallocate a bunch of memory. The amount of copy time is bigger
Did you try storing pointers instead of the whole struct in your vector?
std::vector <square*> square_list;
//Then add a bunch of squares
bool sort (square* a,square* b)
{
return a->z < b->z;
}
//Here is the sort that is slow
std::sort (square_list.begin(),square_list.end(),sort);

Squared Euclidean distance with Row Major Matrix Eigen C++

Due to the fact that i plan to pass numpy arrays into my C++ code with pybind11, naturally i would like to compute with Row Major matrices. I found a (one liner) implementation of the squared euclidean distance on stack
typedef Eigen::MatrixXd Matrix;
void squared_dist(const Matrix& X1, const Matrix& X2, Matrix& D) {
D = ((-2 * X1.transpose() * X2).colwise() + X1.colwise().squaredNorm().transpose()).rowwise() + X2.colwise().squaredNorm();
}
But this requires X1, X2, and D to be the default Column Major Matrix. How would i implement a similar one-liner for Row Major Matrices?
You can use a templated version of that one-liner function so that it can accept RowMajor as well as ColMajor Eigen::Matrix arguments:
template<class T>
void squared_dist(const T& X1, const T& X2, T& D) {
D = ((-2 * X1.transpose() * X2).colwise() + X1.colwise().squaredNorm().transpose()).rowwise() + X2.colwise().squaredNorm();
}
This godbolt demo shows how that function can be used in a code.

C++ vector class definition

I'm only beginning in C++ and I'm struggling to understand some code from a custom vector class in an article I'm working through. The author writes it as:
class vec3
{
public:
vec3() {}
vec3(float e0, float e1, float e2)
{
e[0] = e0;
e[1] = e1;
e[2] = e2;
}
(...)
But so far I've only seen class definitions where the types of data it holds are defined, such as:
class vec3
{
public:
float m_x;
float m_y;
float m_z;
vec3(float x, float y, float z) : m_x(x), m_y(y), m_z(z)
{}
My guess was that the code in the article is creating an empty vector which it then populates with floats or there was something assumed in the definition. Is this just a syntax difference or is there something more fundamental that I'm missing? Apologies for what seems like a basic question but I couldn't find any similar questions. . . it might be too basic for that! But I just wanted to understand it before I moved on.
Thanks,
Paddy
In the code you've posted, you are correct that there is no declaration for the variable e anywhere. I'm not sure if this is because you didn't post that part of the code from the book, or if the book omitted that for brevity.
Without knowing what the book author was meaning by e, I don't want to suggest a completion of the code. There are several things that e could be declared as that would be compatible with the code you've posted.
It defines e as just a float array [e[3]].
With this information added, then there is no relevant difference between three separate members and an array. In the end, both variants will require the same amount of memory, and in most cases, the (optimised!) code generated by the compiler will be exactly the same:
float getY()
{
return m_y;
// will result in address of object + offset to m_y
}
float getY()
{
return e[1];
// will result in address of object + offset to e + offset to second element
// as both offsets are constant, they will be joined to one single summand
// and we are back at first variant...
}
One of the few things you can do with arrays but not with separate members is having a loop, such as the following:
float magnitude()
{
float sum = 0.0F;
for(auto f : e)
sum += f*f;
return sqrtf(sum);
}
However, for such short loops, loop unrolling is pretty likely, and the code generated again is with high probability equivalent to the separate member variant:
float magnitude()
{
return sqrtf(m_x * m_x + m_y * m_y + m_z * m_z);
}
With an array, you could pass all three members in one single parameter to other functions (as pointer to first element), with separate members, you'd have to pass all of them separately (well, there are ways around, but they either require extra effort or are "dirty"...).

std::vector making struct sort slow? c++

I have a list of structs that I am sorting by one of the members. I am using std::sort with my own comparison function, that part is fine. However, I notice a (very) large performance gap when I change the struct from:
struct square
{
float x;
float y;
float z;
float scale;
float angle;
GLuint texture;
};
to
struct square
{
float x;
float y;
float z;
float scale;
float angle;
GLuint texture;
std::vector <float> color;
};
I have since used an entirely different method, and I realize that using a vector like this is a bad idea (I know the size of array - rgb) but I was wondering why I got the performance hit. I was comparing the z values to sort.
Here is my sorting function and struct list:
std::vector <square> square_list;
//Then add a bunch of squares
bool sort (square a,square b)
{
return a.z < b.z;
}
//Here is the sort that is slow
std::sort (square_list.begin(),square_list.end(),sort);
I wonder if it has anything to do with re-ordering the list of structs as their size is significantly bigger in the second case?
Thanks for any responses.
bool sort (square a,square b)
This copies the structs each time, including the vectors. Vectors are slower to copy than normal arrays. You should use this instead.
bool sort (const square& a, const square& b)
If you are using C++11, you can replace the vector with std::array as the size is constant.
In addition to take parameters as const ref you could use a functor for comparison. That is often faster because functors are more easy to inline.
std::vector <square> square_list;
//Then add a bunch of squares
struct sort
{
bool operator() (const square& a, const square& b) const {
return a.z < b.z;
}
}
std::sort (square_list.begin(),square_list.end(),sort);
sort copy your values every time and std::vector preallocate a bunch of memory. The amount of copy time is bigger
Did you try storing pointers instead of the whole struct in your vector?
std::vector <square*> square_list;
//Then add a bunch of squares
bool sort (square* a,square* b)
{
return a->z < b->z;
}
//Here is the sort that is slow
std::sort (square_list.begin(),square_list.end(),sort);

3D vertices class or struct

I am writing a small program for learning C++ and 3D.
I have already written a vertex class with usefull methods. (like Dot,Cross, etc...)
class cVector {
...
float x, y, z;
...
float dot(cVector& v);
cVector cross(cVector& v);
...
}
Now I realize OpenGL expects buffers where elements are more like a struct (VBO).
struct sVector {
float x, y, z;
}
So my vertex class is no longer useless, because if i want to manipulate data in the buffer :
1 - I need to extract data of elements in the buffer.
2 - Create a temporary instance of vertex class with the data.
3 - Use vertex class method. (Dot, cross, etc...)
4 - Put the data back to the buffer.
It's not very efficient :(.
I wonder if I should not use a struct to organize my vectors and create global functions that take a pointer to a struct as an argument.
I could handle data buffers more efficiently (just moving pointer) but I feel i would lose the "convenient power" of C++.
In every 3D C++ source code i ever see, all use class for vertex but i dont understand how they can manipulate large amount of vertex in a "struct like" buffer.
Can you help me to understand ? What is the best approach ?
The most common approach in a language like C++ is actually neither of these things.
You are more likely to encounter the following:
struct Vector3 {
union {
struct {
float x,y,z;
};
float v [3];
};
...
Vector3 (float x_, float y_, float z_) : x (x_), y (y_), z (z_) { };
float Norm (void) { return sqrt ((x * x) + (y * y) + (z * z)); }
void Normalize (void) {
float norm = Norm ();
v [0] /= norm;
v [1] /= norm;
v [2] /= norm;
}
};
The reason for this is because using anonymous unions and structs, you can treat the data as either an array of floats (v [...]) or reference the individual components by their name (x, y, z) without a lot of muss or fuss. You get the best of both worlds by using the language more intelligently.
As for the difference between a struct and a class in this particular case, there is none from the perspective of memory representation. The only real difference between a class and a struct in C++ is the default access; struct has public access by default.
When GL needs to access the object's internal memory, you would accomplish this by passing it the pointer: Vector3::v or the individual components, depending on the particular function.
For instance:
Vector3 vec (1.0f, 2.0f, 3.0f);
---------------------------------
glVertex3fv (vec.v);
and
glVertex3f (vec.x, vec.y, vec.z);
are equivalent
On a side-note, anonymous structures are a non-standard extension to C++ but supported virtually everywhere. In the case that you have a compiler that does not support them, you may have to qualify access to x, y, and z by giving the struct a name.
struct Vector3 {
union {
struct {
float x,y,z;
} s;
float v [3];
};
};
If you write your struct this way, then:
Vector3 vec;
assert (vec.v [0] == vec.s.x);
It is messier to have to qualify x that way (using an anonymous struct you can use vec.x).
There is exactly one difference between struct and class: For class the default scope is private, while for struct it is public.
So
class cVector {
...
float x, y, z; // data
...
float dot(cVector& v); // just a function
cVector cross(cVector& v); // just a function
...
}
and
struct sVector {
float x, y, z; // data
}
have exactly the same memory layout (given that x,y,z are the only members variables of cVector).
You can use &v.x to get a pointer to (x,y,z) for OpenGL, e.g. glVertex3f(&v.x);.
You can even do the following to get a pointer to a continuous sequence of vertices for usage with OpenGL:
std::vector<cVector> vertices(100);
const float* data = &(vertices[0].x);