Cache performance of vectors, matrices and quaternions - c++

I've noticed on a number of occasions in the past, C and C++ code that uses the following format for these structures:
class Vector3
{
float components[3];
//etc.
}
class Matrix4x4
{
float components[16];
//etc.
}
class Quaternion
{
float components[4];
//etc.
}
My question is, will this lead to any better cache performance than say, this:
class Quaternion
{
float x;
float y;
float z;
//etc.
}
...Since I'd assume the class members and functions are in contiguous memory space, anyway? I currently use the latter form because I find it more convenient (however I can also see the practical sense in the array form, since it allows one to treat axes as arbitrary dependant on the operation being performed).
Afer taking some advice from the respondents, I tested the difference and it is actually slower with the array -- I get about 3% difference in framerate. I implemented operator[] to wrap the array access inside the Vector3. Not sure if this has anything to do with it, but I doubt it since that should be inlined anyway. The only factor I could see was that I could no longer use a constructor initializer list on Vector3(x, y, z). However when I took the original version and changed it to no longer use constructor initialiser lists, it ran very marginally slower than before (less than 0.05%). No clue, but at least now I know the original approach was faster.

These declarations are not equivalent with respect to memory layout.
class Quaternion
{
float components[4];
//etc.
}
The above guarantees that the elements are continuous in memory, while, if they are individual members like in your last example, the compiler is allowed to insert padding between them (for instance to align the members with certain address-patterns).
Whether or not this results in better or worse performance depends on your mostly on your compiler, so you'd have to profile it.

I imagine the performance difference from an optimization like this is minimal. I would say something like this falls into premature optimization for most code. However, if you plan to do vector processing over your structs, say by using CUDA, struct composition makes an important difference. Look at page 23 on this if interested: http://www.eecis.udel.edu/~mpellegr/eleg662-09s/li.pdf

I am not sure if the compiler manages to optimize code better when using an array in this context (think at unions for example), but when using APIs like OpenGL, it can be an optimisation when calling functions like
void glVertex3fv(const GLfloat* v);
instead of calling
void glVertex3f(GLfloat x, GLfloat y, GLfloat z);
because, in the later case, each parameter is passed by value, whereas in the first example, only a pointer to the whole array is passed and the function can decide what to copy and when, this way reducing unnecessary copy operations.

Related

Efficiency of Eigen::Ref compared to Eigen::VectorXd as function arguments

I have a long vector Eigen::VectorXd X;, and I would like to update it segment-by-segment using one of the following functions:
void Foo1(Eigen::Ref<Eigen::VectorXd> x) {
// Update x.
}
Eigen::VectorXd Foo2() {
Eigen::VectorXd x;
// Update x.
return x;
}
int main() {
const int LARGE_NUMBER = ...; // Approximately in the range ~[600, 1000].
const int SIZES[] = {...}; // Entries roughly in the range ~[2, 20].
Eigen::VectorXd X{LARGE_NUMBER};
int j = 0;
for (int i = 0; i < LARGE_NUMBER; i += SIZES[j]) {
// Option (1).
Foo1(X.segment(i, SIZES[j]));
// Option (2)
X.segment(i, SIZES[j]) = Foo2();
++j;
}
return 0;
}
Given the above specifications, which option would be the most efficient? I would say (1) because it would directly modify the memory without creating any temporaries. However, compiler optimizations could potentially make (2) perform better -- e.g., see this post.
Secondly, consider the following functions:
void Foo3(const Eigen::Ref<const Eigen::VectorXd>& x) {
// Use x.
}
void Foo4(const Eigen::VectorXd& x) {
// Use x.
}
Is calling Foo3 with segments of X guaranteed to always be at least as efficient as calling Foo4 with the same segments? That is, is Foo3(X.segment(...)) always at least as efficient as Foo4(X.segment(...))?
Given the above specifications, which option would be the most efficient?
Most likely option 1 as you have guessed. It depends on what the update entails, of course. So you may need some benchmarking. But in general the cost of allocation is significant compared to the minor optimizations allowed by allocating a new object. Plus, option 2 incurs the additional cost of copying the result.
Is calling Foo3 with segments of X guaranteed to always be at least as efficient as calling Foo4 with the same segments?
If you call Foo4(x.segment(...)) it allocates a new vector and copies the segment into it. That is significantly more expensive than Foo3. And the only thing you gain is that the vector will be properly aligned. This is only a minor benefit on modern CPUs. So I would expect Foo3 to be more efficient.
Note that there is one option that you have not considered: Use templates.
template<class Derived>
void Foo1(const Eigen::MatrixBase<Derived>& x) {
Eigen::MatrixBase<Derived>& mutable_x = const_cast<Eigen::MatrixBase<Derived>&>(x);
// Update mutable_x.
}
The const-cast is annoying but harmless. Please refer to Eigen's documentation on that topic.
https://eigen.tuxfamily.org/dox/TopicFunctionTakingEigenTypes.html
Overall, this will allow approximately the same performance as if you inlined the function body. In your particular case, it may not be any faster than an inlined version of Foo1, though. This is because a general segment and a Ref object have basically the same performance.
Efficiency of accessing Ref vs. Vector
Let's look at the performance in more detail between computations on an Eigen::Vector, an Eigen::Ref<Vector>, an Eigen::Matrix and an Eigen::Ref<Matrix>. Eigen::Block (the return type for Vector.segment() or Matrix.block()) is functionally identical to Ref, so I don't bother mentioning it further.
Vector and Matrix guarantee that the array as a whole is aligned to 16 byte boundaries. That allows operations to use aligned memory accesses (e.g. movapd in this instance).
Ref does not guarantee alignment and therefore requires unaligned accesses (e.g. movupd). On very old CPUs this used to have a significant performance penalty. These days it is less relevant. It is nice to have alignment but it is no longer the be-all-end-all for vectorization. To quote Agner on that topic [1]:
Some microprocessors have a penalty of several clock cycles when accessing misaligned data that cross a cache line boundary.
Most XMM instructions without VEX prefix that read or write 16-byte memory operands require that the operand is aligned by 16. Instructions that accept unaligned 16-byte operands can be quite inefficient on older processors. However, this restriction is largely relieved with the AVX and later instruction sets. AVX instructions do not require alignment of memory operands, except for the explicitly aligned instructions. Processors that support the
AVX instruction set generally handle misaligned memory operands very efficiently.
All four data types guarantee that the inner dimension (only dimension in vector, single column in matrix) is stored consecutively. So Eigen can vectorize along this dimension
Ref does not guarantee that elements along the outer dimension are stored consecutively. There may be a gap from one column to the next. This means that scalar operations like Matrix+Matrix or Matrix*Scalar can use a single loop over all elements in all rows and columns while Ref+Ref need a nested loop with an outer loop over all columns and an inner loop over all rows.
Neither Ref nor Matrix guarantee proper alignment for a specific column. Therefore most matrix operations such as matrix-vector products need to use unaligned accesses.
If you create a vector or matrix inside a function, this may help escape and alias analysis. However, Eigen already assumes no aliasing in most instances and the code that Eigen creates leaves little room for the compiler to add anything. Therefore it is rarely a benefit.
There are differences in the calling convention. For example in Foo(Eigen::Ref<Vector>), the object is passed by value. Ref has a pointer, a size, and no destructor. So it will be passed in two registers. This is very efficient. It is less good for Ref<Matrix> which consumes 4 registers (pointer, rows, columns, outer stride). Foo(const Eigen::Ref<const Vector>&) would create a temporary object on the stack and pass the pointer to the function. Vector Foo() returns an object that has a destructor. So the caller allocates space on the stack, then passes a hidden pointer to the function. Usually, these differences are not significant but of course they exist and may be relevant in code that does very little computation with many function calls
With these differences in mind, let's look at the specific case at hand. You have not specified what the update method does, so I have to make some assumptions.
The computations will always be the same so we only have to look at memory allocations and accesses.
Example 1:
void Foo1(Eigen::Ref<Eigen::VectorXd> x) {
x = Eigen::VectorXd::LinSpaced(x.size(), 0., 1.);
}
Eigen::VectorXd Foo2(int n) {
return Eigen::VectorXd::LinSpaced(n, 0., 1.);
}
x.segment(..., n) = Foo2(n);
Foo1 does one unaligned memory write. Foo2 does one allocation and one aligned memory write into the temporary vector. Then it copies to the segment. That will use one aligned memory read and an unaligned memory write. Therefore Foo1 is clearly better in all circumstances.
Example 2:
void Foo3(Eigen::Ref<Eigen::VectorXd> x)
{
x = x * x.maxCoeff();
}
Eigen::VectorXd Foo4(const Eigen::Ref<Eigen::VectorXd>& x)
{
return x * x.maxCoeff();
}
Eigen::VectorXd Foo5(const Eigen::Ref<Eigen::VectorXd>& x)
{
Eigen::VectorXd rtrn = x;
rtrn = rtrn * rtrn.maxCoeff();
return rtrn;
}
Both Foo3 and 4 do two unaligned memory reads from x (one for the maxCoeff, one for the multiplication). After that, they behave the same as Foo1 and 2. Therefore Foo3 is always better than 4.
Foo5 does one unaligned memory read and one aligned memory write for the initial copy, then two aligned reads and one aligned write for the computation. After that follow the copy outside the function (same as Foo2). This is still a lot more than what Foo3 does but if you do a lot more memory accesses to the vector, it may be worthwhile at some point. I doubt it, but cases may exist.
The main take-away is this: Since you ultimately want to store the results in segments of an existing vector, you can never fully escape the unaligned memory accesses. So it is not worth worrying about them too much.
Template vs. Ref
A quick rundown of the differences:
The templated version will (if written properly) work on all data types and all memory layouts. For example if you pass a full vector or matrix, it can exploit the alignment.
There are cases where Ref will simply not compile, or work differently than expected. As written above, Ref guarantees that the inner dimension is stored consecutively. The call Foo1(Matrix.row(1)) will not work, because a matrix row is not stored consecutively in Eigen. And if you call a function with const Eigen::Ref<const Vector>&, Eigen will copy the row into a temporary vector.
The templated version will work in these cases, but of course it cannot vectorize.
The Ref version has some benefits:
It is clearer to read and has fewer chances to go wrong with unexpected inputs
You can put it in a cpp file and it creates less redundant code. Depending on your use case, more compact code may be more beneficial or appropriate
[1] https://www.agner.org/optimize/optimizing_assembly.pdf

Using memmove to initialize entire object in constructor in C++

Is it safe to use memmove/memcpy to initialize an object with constructor parameters?
No-one seems to use this method but it works fine when I tried it.
Does parameters being passed in a stack cause problems?
Say I have a class foo as follows,
class foo
{
int x,y;
float z;
foo();
foo(int,int,float);
};
Can I initialize the variables using memmove as follows?
foo::foo(int x,int y,float z)
{
memmove(this,&x, sizeof(foo));
}
This is undefined behavior.
The shown code does not attempt to initialize class variables. It attempts to memmove() onto the class pointer, and assumes that the size of the class is 2*sizeof(int)+sizeof(float). The C++ standard does not guarantee that.
Furthermore, the shown code also assumes the layout of the parameters that are passed to the constructor will be the same layout as the layout of the members of this POD. That, again, is not specified by the C++ standard.
It is safe to use memmove to initialize individual class members. For example, the following is safe:
foo::foo(int x_,int y_,float z_)
{
memmove(&x, &x_, sizeof(x));
memmove(&y, &y_, sizeof(y));
memmove(&z, &z_, sizeof(z));
}
Of course, this does nothing useful, but this would be safe.
No it is not safe, because based on the standard the members are not guaranteed to be immediately right after each other due to alignment/padding.
After your update, this is even worse because the location of passed arguments and their order are not safe to use.
We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. - Donald Knuth
You should not try to optimize a code you are not sure you need to. I would suggest you to profile your code before you are able to perform this kind of optimizations. This way you don't lose time improving the performance of some code that is not going to impact the overall performance of your application.
Usually, compilers are smart enough to guess what are you trying to do with your code, and generate high efficient code that will keep the same functionality. For that purpose, you should be sure that you are enabling compiler optimizations (-Olevel flag or toggling individual ones through compiler command arguments).
For example, I've seen that some compilers transform std::copy into a memcpy when the compiler is sure that doing so is straightforward (e.g. data is contiguous).
No it is not safe. It is undefined behavior.
And the code
foo::foo(int x,int y,float z)
{
memmove(this,&x, sizeof(foo));
}
is not even saving you any typing compared to using an initializer list
foo::foo(int x,int y,float z) : x(x), y(y), z(z)
{ }

C++ - Loop Efficiency: Storing temporarly values of a class member Vs pointer to this member

Within a class method, I'm accessing private attributes - or attributes of a nested class. Moreover, I'm looping over these attributes.
I was wondering what is the most efficient way in terms of time (and memory) between:
copying the attributes and accessing them within the loop
Accessing the attributes within the loop
Or maybe using an iterator over the attribute
I feel my question is related to : Efficiency of accessing a value through a pointer vs storing as temporary value. But in my case, I just need to access a value, not change it.
Example
Given two classes
class ClassA
{
public:
vector<double> GetAVector() { return AVector; }
private:
vector<double> m_AVector;
}
and
class ClassB
{
public:
void MyFunction();
private:
vector<double> m_Vector;
ClassA m_A;
}
I. Should I do:
1.
void ClassB::MyFunction()
{
vector<double> foo;
for(int i=0; i<... ; i++)
{
foo.push_back(SomeFunction(m_Vector[i]));
}
/// do something ...
}
2.
void ClassB::MyFunction()
{
vector<double> foo;
vector<double> VectorCopy = m_Vector;
for(int i=0; i<... ; i++)
{
foo.push_back(SomeFunction(VectorCopy[i]));
}
/// do something ...
}
3.
void ClassB::MyFunction()
{
vector<double> foo;
for(vector<double>::iterator it = m_Vector.begin(); it != m_Vector.end() ; it++)
{
foo.push_back(SomeFunction((*it)));
}
/// do something ...
}
II. What if I'm not looping over m_vector but m_A.GetAVector()?
P.S. : I understood while going through other posts that it's not useful to 'micro'-optimize at first but my question is more related to what really happens and what should be done - as for standards (and coding-style)
You're in luck: you can actually figure out the answer all by yourself, by trying each approach with your compiler and on your operating system, and timing each approach to see how long it takes.
There is no universal answer here, that applies to every imaginable C++ compiler and operating system that exists on the third planet from the sun. Each compiler, and hardware is different, and has different runtime characteristics. Even different versions of the same compiler will often result in different runtime behavior that might affect performance. Not to mention various compilation and optimization options. And since you didn't even specify your compiler and operating system, there's literally no authoritative answer that can be given here.
Although it's true that for some questions of this type it's possible to arrive at the best implementation with a high degree of certainty, for most use cases, this isn't one of them. The only way you can get the answer is to figure it out yourself, by trying each alternative yourself, profiling, and comparing the results.
I can categorically say that 2. is less efficient than 1. Copying to a local copy, and then accessing it like you would the original would only be of potential benefit if accessing a stack variable is quicker than accessing a member one, and it's not, so it's not (if you see what I mean).
Option 3. is trickier, since it depends on the implementation of the iter() method (and end(), which may be called once per loop) versus the implementation of the operator [] method. I could irritate some C++ die-hards and say there's an option 4: ask the Vector for a pointer to the array and use a pointer or array index on that directly. That might just be faster than either!
And as for II, there is a double-indirection there. A good compiler should spot that and cache the result for repeated use - but otherwise it would only be marginally slower than not doing so: again, depending on your compiler.
Without optimizations, option 2 would be slower on every imaginable platform, becasue it will incur copy of the vector, and the access time would be identical for local variable and class member.
With optimization, depending on SomeFunction, performance might be the same or worse for option 2. Same performance would happen if SomeFunction is either visible to compiler to not modify it's argument, or it's signature guarantees that argument will not be modified - in this case compiler can optimize away the copy altogether. Otherwise, the copy will remain.

C++ matrix and vector design

I'm making custom vector and matrix class for numerical calculations.
I want to treat each row and column of the matrix as a vector. Also, I do not want to use extra memory, therefore, I made VectorView class which uses data in matrix directly(Like GSL library). Here is the outline of my matrix class.
class Matrix{
priavte:
T data[];
....
public:
VectorView row(int n);
VectorView colum(int n);
};
And I define a function which uses VectorView.
myFunc(VectorView& v);
My VectorView class has some extra data, therefore I want to use VectorView as a reference to save memory.
However, I got a problem when I calling a function like this.
Matrix m;
...
...
myFunc(m.row(i));
The problem is that m.row(i) returns temporary object therefore I cannot use reference type to treat it. But
auto v = m.row(i);
myFunc(v);
this does not makes a error even though it is exactly same but not clear reason to use v. I want to use the above one. Is there an brilliant solution for this type of problem?
row returns an rvalue reference (VectorView&&), which cannot be passed as a non-const lvalue reference (VectorView&). You can either redefine myFunc as myFunc(const VectorView& v) or myFunc(VectorView&& v), depending on your requirements and the behaviour of VectorView.
If myFunc needs access to non-const members of VectorView, you'll need to define the latter, which will pass the returned value from row into myFunc using move semantics. However, since VectorView is just a "view" onto the original data, it possibly doesn't have (or need) any non-const members, in which case, you should use the former.
use valarray and gslice
http://www.cplusplus.com/reference/valarray/gslice/
N-D (including 2D) matrices are why Bjarne added gslices (AFAIK)
Don't reinvent the wheel and use Eigen
eigen.tuxfamily.org
is a header-only c++ matrix library with very good support and performance
Guess your vector view contains only a pointer to original data and a integer for step (1 for row, n for column). In that case treat it as a value object is fine (as long as you make sure the life cycle of matrix is good). So if you want the syntax, you can just use value in myFunc. Like: myFunc(VectorView) ...
Write two VectorViews: VectorView and ConstVectorView. The first holds a view of a slice of data, and a method is const iff it does not change which slice you are looking at. Changing members is ok in a const method of VectorView.
ConstVectorView is a view of a vector where changing the value of elements is illegal. You can change what you are viewing with non-const methods, and you can access elements to read with const methods.
You should be able to construct a ConstVectorView from a VectorView.
Then, when you return a VectorViewand pass it to a function, the function should either take it by value, or take it by const&. If the function doesn't modify its contents, it takes a ConstVectorView.
Make your life simple, and stick to C-style semantics (this allows you to use the wealth of C linear algebra code available out there, and easy to use the fortran ones as well).
You state that you are concerned about using extra memory, but if you are simply concerned and do not have tight bounds, ensure you store the matrix in both row and column format. This will be crucial for getting any kind of performance out of common matrix operations (as you will be using whole cache lines at a time).
class Matrix{
private:
T rowData[];
T colData[];
...
public:
T const * row(int n) const;
T const * colum(int n) const;
...
};

C memset seems to not write to every member

I wrote a small coordinate class to handle both int and float coordinates.
template <class T>
class vector2
{
public:
vector2() { memset(this, 0, sizeof(this)); }
T x;
T y;
};
Then in main() I do:
vector2<int> v;
But according to my MSVC debugger, only the x value is set to 0, the y value is untouched. Ive never used sizeof() in a template class before, could that be whats causing the trouble?
No don't use memset -- it zeroes out the size of a pointer (4 bytes on my x86 Intel machine) bytes starting at the location pointed by this. This is a bad habit: you will also zero out virtual pointers and pointers to virtual bases when using memset with a complex class. Instead do:
template <class T>
class vector2
{
public:
// use initializer lists
vector2() : x(0), y(0) {}
T x;
T y;
};
As others are saying, memset() is not the right way to do this.
There are some subtleties, however, about why not.
First, your attempt to use memset() is only clearing sizeof(void *) bytes. For your sample case, that apparently is coincidentally the bytes occupied by the x member.
The simple fix would be to write memset(this, 0, sizeof(*this)), which in this case would set both x and y.
However, if your vector2 class has any virtual methods and the usual mechanism is used to represent them by your compiler, then that memset will destroy the vtable and break the instance by setting the vtable pointer to NULL. Which is bad.
Another problem is that if the type T requires some constructor action more complex than just settings its bits to 0, then the constructors for the members are not called, but their effect is ruined by overwriting the content of the members with memset().
The only correct action is to write your default constructor as
vector2(): x(0), y(0), {}
and to just forget about trying to use memset() for this at all.
Edit: D.Shawley pointed out in a comment that the default constructors for x and y were actually called before the memset() in the original code as presented. While technically true, calling memset() overwrites the members, which is at best really, really bad form, and at worst invokes the demons of Undefined Behavior.
As written, the vector2 class is POD, as long as the type T is also plain old data as would be the case if T were int or float.
However, all it would take is for T to be some sort of bignum value class to cause problems that could be really hard to diagnose. If you were lucky, they would manifest early through access violations from dereferencing the NULL pointers created by memset(). But Lady Luck is a fickle mistress, and the more likely outcome is that some memory is leaked, and the application gets "shaky". Or more likely, "shakier".
The OP asked in a comment on another answer "...Isn't there a way to make memset work?"
The answer there is simply, "No."
Having chosen the C++ language, and chosen to take full advantage of templates, you have to pay for those advantages by using the language correctly. It simply isn't correct to bypass the constructor (in the general case). While there are circumstances under which it is legal, safe, and sensible to call memset() in a C++ program, this just isn't one of them.
The problem is this is a Pointer type, which is 4 bytes (on 32bit systems), and ints are 4 bytes (on 32bit systems). Try:
sizeof(*this)
Edit: Though I agree with others that initializer lists in the constructor are probably the correct solution here.
Don't use memset. It'll break horribly on non-POD types (and won't necessarily be easy to debug), and in this case, it's likely to be much slower than simply initializing both members to zero (two assignments versus a function call).
Moreover, you do not usually want to zero out all members of a class. You want to zero out the ones for which zero is a meaningful default value. And you should get into the habit of initializing your members to a meaningful value in any case. Blanket zeroing everything and pretending the problem doesn't exist just guarantees a lot of headaches later. If you add a member to a class, decide whether that member should be initialized, and how.
If and when you do want memset-like functionality, at least use std::fill, which is compatible with non-POD types.
If you're programming in C++, use the tools C++ makes available. Otherwise, call it C.
dirkgently is correct. However rather that constructing x and y with 0, an explicit call to the default constructor will set intrinsic types to 0 and allow the template to be used for structs and classes with a default constructor.
template <class T>
class vector2
{
public:
// use initializer lists
vector2() : x(), y() {}
T x;
T y;
};
Don't try to be smarter than the compiler. Use the initializer lists as intended by the language. The compiler knows how to efficiently initialize basic types.
If you would try your memset hack on a class with virtual functions you would most likely overwrite the vtable ending up in a disaster. Don't use hack like that, they are a maintenance nightmare.
This might work instead:
char buffer[sizeof(vector2)];
memset(buffer, 0, sizeof(buffer));
vector2 *v2 = new (buffer) vector2();
..or replacing/overriding vector2::new to do something like that.
Still seems weird to me though.
Definitely go with
vector2(): x(0), y(0), {}