Efficiency of Eigen::Ref compared to Eigen::VectorXd as function arguments - c++

I have a long vector Eigen::VectorXd X;, and I would like to update it segment-by-segment using one of the following functions:
void Foo1(Eigen::Ref<Eigen::VectorXd> x) {
// Update x.
}
Eigen::VectorXd Foo2() {
Eigen::VectorXd x;
// Update x.
return x;
}
int main() {
const int LARGE_NUMBER = ...; // Approximately in the range ~[600, 1000].
const int SIZES[] = {...}; // Entries roughly in the range ~[2, 20].
Eigen::VectorXd X{LARGE_NUMBER};
int j = 0;
for (int i = 0; i < LARGE_NUMBER; i += SIZES[j]) {
// Option (1).
Foo1(X.segment(i, SIZES[j]));
// Option (2)
X.segment(i, SIZES[j]) = Foo2();
++j;
}
return 0;
}
Given the above specifications, which option would be the most efficient? I would say (1) because it would directly modify the memory without creating any temporaries. However, compiler optimizations could potentially make (2) perform better -- e.g., see this post.
Secondly, consider the following functions:
void Foo3(const Eigen::Ref<const Eigen::VectorXd>& x) {
// Use x.
}
void Foo4(const Eigen::VectorXd& x) {
// Use x.
}
Is calling Foo3 with segments of X guaranteed to always be at least as efficient as calling Foo4 with the same segments? That is, is Foo3(X.segment(...)) always at least as efficient as Foo4(X.segment(...))?

Given the above specifications, which option would be the most efficient?
Most likely option 1 as you have guessed. It depends on what the update entails, of course. So you may need some benchmarking. But in general the cost of allocation is significant compared to the minor optimizations allowed by allocating a new object. Plus, option 2 incurs the additional cost of copying the result.
Is calling Foo3 with segments of X guaranteed to always be at least as efficient as calling Foo4 with the same segments?
If you call Foo4(x.segment(...)) it allocates a new vector and copies the segment into it. That is significantly more expensive than Foo3. And the only thing you gain is that the vector will be properly aligned. This is only a minor benefit on modern CPUs. So I would expect Foo3 to be more efficient.
Note that there is one option that you have not considered: Use templates.
template<class Derived>
void Foo1(const Eigen::MatrixBase<Derived>& x) {
Eigen::MatrixBase<Derived>& mutable_x = const_cast<Eigen::MatrixBase<Derived>&>(x);
// Update mutable_x.
}
The const-cast is annoying but harmless. Please refer to Eigen's documentation on that topic.
https://eigen.tuxfamily.org/dox/TopicFunctionTakingEigenTypes.html
Overall, this will allow approximately the same performance as if you inlined the function body. In your particular case, it may not be any faster than an inlined version of Foo1, though. This is because a general segment and a Ref object have basically the same performance.
Efficiency of accessing Ref vs. Vector
Let's look at the performance in more detail between computations on an Eigen::Vector, an Eigen::Ref<Vector>, an Eigen::Matrix and an Eigen::Ref<Matrix>. Eigen::Block (the return type for Vector.segment() or Matrix.block()) is functionally identical to Ref, so I don't bother mentioning it further.
Vector and Matrix guarantee that the array as a whole is aligned to 16 byte boundaries. That allows operations to use aligned memory accesses (e.g. movapd in this instance).
Ref does not guarantee alignment and therefore requires unaligned accesses (e.g. movupd). On very old CPUs this used to have a significant performance penalty. These days it is less relevant. It is nice to have alignment but it is no longer the be-all-end-all for vectorization. To quote Agner on that topic [1]:
Some microprocessors have a penalty of several clock cycles when accessing misaligned data that cross a cache line boundary.
Most XMM instructions without VEX prefix that read or write 16-byte memory operands require that the operand is aligned by 16. Instructions that accept unaligned 16-byte operands can be quite inefficient on older processors. However, this restriction is largely relieved with the AVX and later instruction sets. AVX instructions do not require alignment of memory operands, except for the explicitly aligned instructions. Processors that support the
AVX instruction set generally handle misaligned memory operands very efficiently.
All four data types guarantee that the inner dimension (only dimension in vector, single column in matrix) is stored consecutively. So Eigen can vectorize along this dimension
Ref does not guarantee that elements along the outer dimension are stored consecutively. There may be a gap from one column to the next. This means that scalar operations like Matrix+Matrix or Matrix*Scalar can use a single loop over all elements in all rows and columns while Ref+Ref need a nested loop with an outer loop over all columns and an inner loop over all rows.
Neither Ref nor Matrix guarantee proper alignment for a specific column. Therefore most matrix operations such as matrix-vector products need to use unaligned accesses.
If you create a vector or matrix inside a function, this may help escape and alias analysis. However, Eigen already assumes no aliasing in most instances and the code that Eigen creates leaves little room for the compiler to add anything. Therefore it is rarely a benefit.
There are differences in the calling convention. For example in Foo(Eigen::Ref<Vector>), the object is passed by value. Ref has a pointer, a size, and no destructor. So it will be passed in two registers. This is very efficient. It is less good for Ref<Matrix> which consumes 4 registers (pointer, rows, columns, outer stride). Foo(const Eigen::Ref<const Vector>&) would create a temporary object on the stack and pass the pointer to the function. Vector Foo() returns an object that has a destructor. So the caller allocates space on the stack, then passes a hidden pointer to the function. Usually, these differences are not significant but of course they exist and may be relevant in code that does very little computation with many function calls
With these differences in mind, let's look at the specific case at hand. You have not specified what the update method does, so I have to make some assumptions.
The computations will always be the same so we only have to look at memory allocations and accesses.
Example 1:
void Foo1(Eigen::Ref<Eigen::VectorXd> x) {
x = Eigen::VectorXd::LinSpaced(x.size(), 0., 1.);
}
Eigen::VectorXd Foo2(int n) {
return Eigen::VectorXd::LinSpaced(n, 0., 1.);
}
x.segment(..., n) = Foo2(n);
Foo1 does one unaligned memory write. Foo2 does one allocation and one aligned memory write into the temporary vector. Then it copies to the segment. That will use one aligned memory read and an unaligned memory write. Therefore Foo1 is clearly better in all circumstances.
Example 2:
void Foo3(Eigen::Ref<Eigen::VectorXd> x)
{
x = x * x.maxCoeff();
}
Eigen::VectorXd Foo4(const Eigen::Ref<Eigen::VectorXd>& x)
{
return x * x.maxCoeff();
}
Eigen::VectorXd Foo5(const Eigen::Ref<Eigen::VectorXd>& x)
{
Eigen::VectorXd rtrn = x;
rtrn = rtrn * rtrn.maxCoeff();
return rtrn;
}
Both Foo3 and 4 do two unaligned memory reads from x (one for the maxCoeff, one for the multiplication). After that, they behave the same as Foo1 and 2. Therefore Foo3 is always better than 4.
Foo5 does one unaligned memory read and one aligned memory write for the initial copy, then two aligned reads and one aligned write for the computation. After that follow the copy outside the function (same as Foo2). This is still a lot more than what Foo3 does but if you do a lot more memory accesses to the vector, it may be worthwhile at some point. I doubt it, but cases may exist.
The main take-away is this: Since you ultimately want to store the results in segments of an existing vector, you can never fully escape the unaligned memory accesses. So it is not worth worrying about them too much.
Template vs. Ref
A quick rundown of the differences:
The templated version will (if written properly) work on all data types and all memory layouts. For example if you pass a full vector or matrix, it can exploit the alignment.
There are cases where Ref will simply not compile, or work differently than expected. As written above, Ref guarantees that the inner dimension is stored consecutively. The call Foo1(Matrix.row(1)) will not work, because a matrix row is not stored consecutively in Eigen. And if you call a function with const Eigen::Ref<const Vector>&, Eigen will copy the row into a temporary vector.
The templated version will work in these cases, but of course it cannot vectorize.
The Ref version has some benefits:
It is clearer to read and has fewer chances to go wrong with unexpected inputs
You can put it in a cpp file and it creates less redundant code. Depending on your use case, more compact code may be more beneficial or appropriate
[1] https://www.agner.org/optimize/optimizing_assembly.pdf

Related

Block Structure memory allocation for variables

for(int i=0;i<10;i++)
{
int x=0;
printf("%d",x);
{
int x=10;
printf("%d",x);
}
printf("%d",x);
}
Here I want to know if memory for variable x will be allocated twice or is the value just reset after exiting the 2nd block and memory is allocated only once (for x) ?
From the point of view of the C programming model, the two definitions of x are two completely different objects. The assignment in the inner block will not affect the value of x in the outer block.
Moreover, the definitions for each iteration of the loop count as different objects too. Assigning a value to either x in one iteration will not affect the x in subsequent iterations.
As far as real implementations are concerned, there are two common scenarios, assuming no optimisation is done. If you have optimisation turned on, your code is likely to be discarded because it's quite easy for the compiler to figure out that the loop has no effect on anything outside it except i.
The two common scenarios are
The variables are stored on the stack. In this scenario, the compiler will reserve a slot on the stack for the outer x and a slot on the stack for the inner x. In theory it ought to allocate the slots at the beginning of the scope and deallocate at the end of the scope, but that just wastes time, so it'll reuse the slots on each iteration.
The variables are stored in registers. This is the more likely option on modern 64 bit architectures. Again, the compiler ought to "allocate" (allocate is not really the right word) a register at the beginning of the scope and "deallocate" at the end but, it'll just reuse the same registers in real life.
In both cases, you will note that the value from each iteration will be preserved to the next iteration because the compiler uses the same storage space. However, never do this
for (int i = 0 ; i < 10 ; ++i)
{
int x;
if (i > 0)
{
printf("Before %d\n", x); // UNDEFINED BEHAVIOUR
}
x = i;
printf("After %d\n", x);
}
If you compile and run the above (with no optimisation), you'll probably find it prints sensible values but, each time you go round the loop, x is theoretically a completely new object, so the first printf accesses an uninitialised variable. This is undefined behaviour so the program may give you the value from the previous iteration because it is using the same storage or it may firebomb your house and sell your daughter into slavery.
Setting aside compiler optimization that might remove these unused variables, the answer is twice.
The second definition of x masks (technical term) the other definition in its scope following its declaration.
But the first definition is visible again after that scope.
So logically (forgetting about optimization) the first value of x (x=0) has to be held somewhere while x=10 is 'in play'. So two pieces of storage are (logically) required.
Execute the C program below. Typical partial output:
A0 x==0 0x7ffc1c47a868
B0 x==0 0x7ffc1c47a868
C0 x==10 0x7ffc1c47a86c
D0 x==0 0x7ffc1c47a868
A1 x==0 0x7ffc1c47a868
B1 x==0 0x7ffc1c47a868
C1 x==10 0x7ffc1c47a86c
//Etc...
Notice how only point C sees the variable x with value 10 and the variable with value 0 is visible again at point D. Also see how the two versions of x are stored at different addresses.
Theoretically the addresses could be different for each iteration but I'm not aware of an implementation that actually does that because it is unnecessary. However if you made these non-trival C++ objects their constuctors and destructors would get called on each loop though still reside at the same addresses (in practice).
It is obviously confusing to human readers to hide variables like this and it's not recommended.
#include <stdio.h>
int main(void) {
for(int i=0;i<10;i++)
{
int x=0;
printf("A%d x==%d %p\n",i,x,&x);
{
printf("B%d x==%d %p\n",i,x,&x);
int x=10;
printf("C%d x==%d %p\n",i,x,&x);
}
printf("D%d x==%d %p\n",i,x,&x);
}
}
This is an implementation specific detail.
For example, in code optimization stage it might detect those are not used. So no space will be allocated for them.
Even if some compiler has not that thing, then you can expect that there might be cases where two different variables spaces are not allocated.
For your information, the thing braces doesn't always mean it is different memory or stack space. It is a scope issue. And for the variable it might be the case that they are allocated in CPU register.
So you can't say anything in general. The thing you can say is they are of different scope.
I would expect that most compilers will use memory on the stack for variables of this type, if any memory is needed at all. In some cases a CPU register might be used for one or both the x's. Both will have their own storage, but it's compiler-dependent whether the lifetime of that storage is the same as the scope of the variables as declared in the source. So, for example, the memory used for the "inner" x might continue to be in use beyond the point at which that variable is out of scope -- this really depends on the compiler implementation.

Is there an easy and efficient way to dynamically change a data type once that data type's limit has been reached?

I would like to increase the size of a data type once that data type's limit has been reached. For example, let's say I have a class:
struct Counter {
unsigned short x;
void IncrementCount() { x++; }
};
Once I change the value of Counter.x, I would like it to promote x to be an unsigned int instead of an unsigned short.
int main() {
Counter c;
c.x = ~0; // c.x is max unsigned short value 2^16
c.IncrementCount(); // c.x is now an unsigned int with value 2^16 + 1, NOT an unsigned short with value 0
}
The reason for me doing this is so that I store as little memory in my Counter class as possible.
Obviously I want as little impact to performance and readability as possible.
I've considered using a void pointer and casting to the appropriate data type (but how would I track the type without using extra memory?), changing the IncrementCount() method on my Counter class and creating a new Counter object (with an int instead of a short) when the unsigned short's limit is reached (but this will add some complexity, as well as an extra if on every increment). Maybe a bitmap which increases in size every time it needs an extra bit? Still adds complexity and some performance hit.
If I have 10 million Counters, I don't want to use an extra 2 bytes for an int (as that would require 20 million extra bytes). I know alignment might fix the issue for short -> int but I want this to also be valid for int32 -> int64 and other types. It is also worth noting that this is a runtime problem (I don't know the size at compile time).
Is there a simpler way in C++ to do this?
Data types in C and C++ have to be totally defined at compile time. You can't have a short variable that then gets promoted to an int.
Programming languages like Python attach the data type to values, not variables. That's the reason why you can do:
a = 1
a = "hi"
Because the data type is attached to the value 1 and then to the value "hi".
The price to pay for this is that the bookkeeping for "a" is usually high, with at least one pointer to a dynamically allocated block of memory, bookkeeping for dynamic memory allocation, and values have a data type discrimination mechanism to get the value of a data type. This mechanism allows to infer data types at runtime at the cost of lower runtime efficiency.
There are at least two ways to implement this. Historically this has been done as a variant data type. See here for more information. The alternative is to do it as an object oriented way, where you have a base class Object with all possible operations. This is more or less what Python does, but in C instead with C++ and with function pointers. For example:
class Object {
public:
virtual Object_ptr add(Object_ptr other) = 0;
};
For example the + operation for ints is the usual arithmetic add, where 2+2=4 but for strings is the concatenation operation, where "Hello " + "World" = "Hello World". Following the same logic we can do:
class IntObject {
public:
virtual Object_ptr add(Object_ptr other) {
if (other->dataType() == DATATYPE_INT) {
return new IntObject(this->value + other->value);
} else {
raiseError("dataTypeError");
}
}
};
Python also have the nice feature of long-arbitrary size integers. You can have a value with as many bits as memory you have. For example in Python:
>>> 1<<128
340282366920938463463374607431768211456L
Internally, Python detects that the number won't fit in an int32/int64 and will upgrade the data type at runtime. This is what more or less Python does internally:
class IntObject {
public:
virtual Object_ptr add(Object_ptr other) {
if (other->dataType() == DATATYPE_INT) {
if (operationWillOverflow(other)) {
auto a = new LongIntObject(this);
auto b = new LongIntObject(other);
return a.add(b);
} else {
return new IntObject(this->value + other->value);
}
} else {
raiseError("dataTypeError");
}
}
};
The only way to do that is just, as you said, to check each increment whether incremented values fits current type, if not - create new obj with say uint32 instead of uint16 and copy the old val. However, in this case if objects are polymorphic -> they must contain a pointer to a virtual table function which requires additional 8 byte memory.
Sure, you can consider storing count number in a different data structure rather then just a fundamental type, but most likely they contain some extra info which requires more memory (but the memory is vital in you case ) so that is not an option. Also you can't deal with a free store, because the pointer is 32-64 bits (depending on the machine) and it's the same as storing a uint64.
The point of all this is that you can't avoid copying the same reason you have to copy when reallocating a memory for a vector. You can never now what is located next to your counter in the memory -> now way to just "resize" a type.
How you can figure out that the counter number is too big for current type and new type is needed? Very simple, just check x + 1 value for being equal to 0. If that happened -> overflow took place -> current type is to small.
To summarize, copying is the only way you can "resize" an object. But if you agree to do that, you will need to write some code for checking and copying and also you can't avoid value checking overhead each increment.
C++ is a statically typed language. Every data structure has a compile-time size which cannot be changed.
You can have a class allocate more memory. But that is a separate allocation from the class's own storage (and as noted in the comments, it would be just as memory costly if not moreso than an int). A specific C++ class will only ever have a single size.
Have a look at your data storage policy for the 20 million values. There are two basic ways:
If you store objects in an array their size and type is set in stone. That is Nicol's answer. It's what has been recommended in the comments: Use the largest integer type needed and be done.
If you store pointers to your objects you are free to replace an object pointed to with another type which has a larger value space. But you have a memory overhead built-in from the start, probably an overhead of two pointers per value (the second one is the hidden vtable pointer inside the polymorphic object). That doesn't make sense.
If you want to change the underlying integer type of the same object you must allocate the integer dynamically and hold a pointer to that memory in your object; you just moved the pointer (this time to an integer object of various sizes) into the object, with the same overhead, plus the memory overhead for the heap administration which becomes dominant for small little chunks like here.

Setting size of custom C++ container as template parameter vs constructor

I've written a fixed-size container (a ring buffer, to be exact) in C++. Currently I'm setting the size of the container in the constructor and then allocate the actual buffer on the heap. However, I've been thinking about moving the size parameter out of the constructor and into the template.
Going from this (RingBuffer fitting 100 integers)
RingBuffer<int> buffer(size);
to this
RingBuffer<int, 100> buffer;
This would allow me to allocate the whole buffer on the stack, which is faster than heap allocation, as far as I know. Mainly it's a matter of readability and maintainability though. These buffers often appear as members of classes. I have to initialize them with a size, so I have to initialize them in the initializer-list of every single constructor of the class. That means if I want to change the capacity of the RingBuffer I have to either remember to change it in every initializer-list or work with awkward static const int BUFFER_SIZE = 100; member variables.
My question is, is there any downside to specifying the container size as a template parameter as opposed to in the constructor? What are the pros and cons of either method?
As far as I know the compiler will generate a new type for each differently-sized RingBuffer. This could turn out to be quite a few. Does that hurt compile times much? Does it bloat the code or prevent optimizations? Of course I'm aware that much of this depends on the exact use case but what are the things I need to be aware of when making this decision?
My question is, is there any downside to specifying the container size as a template parameter as opposed to in the constructor? What are the pros and cons of either method?
If you give the size as template parameter, then it needs to be a constexpr (compile time constant expression). Thus your buffer size cannot depend on any run time characteristics (like user input).
Being a compile time constant opens up doors for some optimizations (loop unrolling and constant folding come to my mind) to be more efficient.
As far as I know the compiler will generate a new type for each differently-sized RingBuffer.
This is true. But I wouldn't worry about that, as having many different types per se won't have any impact on performance or code size (but probably on compile time).
Does that hurt compile times much?
It will make compilation slower. Though I doubt that in your case (this is a pretty simple template) this will even be noticeable. Thus it depends on your definition of "much".
Does it bloat the code or prevent optimizations?
Prevent optimizations? No. Bloat the code? Possibly. That depends on both how exactly you implement your class and what your compiler does. Example:
template<size_t N>
struct Buffer {
std::array<char, N> data;
void doSomething(std::function<void(char)> f) {
for (size_t i = 0; i < N; ++i) {
f(data[i]);
}
}
void doSomethingDifferently(std::function<void(char)> f) {
doIt(data.data(), N, f);
}
};
void doIt(char const * data, size_t size, std::function<void(char)> f) {
for (size_t i = 0; i < size; ++i) {
f(data[i]);
}
}
doSomething might get compiled to (perhaps completely) unrolled loop code, and you'd have a Buffer<100>::doSomething, a Buffer<200>::doSomething and so on, each a possibly large function. doSomethingDifferently might get compiled to not much more than a simple jump instruction, so having multiple of those wouldn't be much of an issue. Though your compiler could also change doSomething to be implemented similar doSomethingDifferently, or the other way around.
So in the end:
Don't try to make this decision depend on performance, optimizations, compile time or code bloat. Decide what's more meaningful in your situation. Will there only ever be buffers with compile time known sizes?
Also:
These buffers often appear as members of classes. I have to initialize them with a size, so I have to initialize them in the initializer-list of every single constructor of the class.
Do you know "delegating constructors"?
As Daniel Jour already said code bloating is not a huge issue and can be dealt with if needed.
The good about having size as constexpr is that it will allow you to detect some errors in compile time that would otherwise happen in runtime.
This would allow me to allocate the whole buffer on the stack, which is faster than heap allocation, as far as I know.
These buffers often appear as members of classes
This will happen only if owning class is allocated in automatic memory. Which is usually not the case. Consider following example:
struct A {
int myArray[10];
};
struct B {
B(): dynamic(new A()) {}
A automatic; // should be in the "stack"
A* dynamic; // should be in the "heap"
};
int main() {
B b1;
b1; // automatic memory
b1.automatic; // automatic memory
b1.automatic.myArray; // automatic memory
b1.dynamic; // automatic memory
(*b1.dynamic); // dynamic memory
(*b1.dynamic).myArray; // dynamic memory
B* b2 = new B();
b2; // automatic memory
(*b2); // dynamic memory
(*b2).automatic; // dynamic memory
(*b2).automatic.myArray; // dynamic memory
(*b2).dynamic; // dynamic memory
(*(*b2).dynamic).myArray; // dynamic memory
}

C++ - Loop Efficiency: Storing temporarly values of a class member Vs pointer to this member

Within a class method, I'm accessing private attributes - or attributes of a nested class. Moreover, I'm looping over these attributes.
I was wondering what is the most efficient way in terms of time (and memory) between:
copying the attributes and accessing them within the loop
Accessing the attributes within the loop
Or maybe using an iterator over the attribute
I feel my question is related to : Efficiency of accessing a value through a pointer vs storing as temporary value. But in my case, I just need to access a value, not change it.
Example
Given two classes
class ClassA
{
public:
vector<double> GetAVector() { return AVector; }
private:
vector<double> m_AVector;
}
and
class ClassB
{
public:
void MyFunction();
private:
vector<double> m_Vector;
ClassA m_A;
}
I. Should I do:
1.
void ClassB::MyFunction()
{
vector<double> foo;
for(int i=0; i<... ; i++)
{
foo.push_back(SomeFunction(m_Vector[i]));
}
/// do something ...
}
2.
void ClassB::MyFunction()
{
vector<double> foo;
vector<double> VectorCopy = m_Vector;
for(int i=0; i<... ; i++)
{
foo.push_back(SomeFunction(VectorCopy[i]));
}
/// do something ...
}
3.
void ClassB::MyFunction()
{
vector<double> foo;
for(vector<double>::iterator it = m_Vector.begin(); it != m_Vector.end() ; it++)
{
foo.push_back(SomeFunction((*it)));
}
/// do something ...
}
II. What if I'm not looping over m_vector but m_A.GetAVector()?
P.S. : I understood while going through other posts that it's not useful to 'micro'-optimize at first but my question is more related to what really happens and what should be done - as for standards (and coding-style)
You're in luck: you can actually figure out the answer all by yourself, by trying each approach with your compiler and on your operating system, and timing each approach to see how long it takes.
There is no universal answer here, that applies to every imaginable C++ compiler and operating system that exists on the third planet from the sun. Each compiler, and hardware is different, and has different runtime characteristics. Even different versions of the same compiler will often result in different runtime behavior that might affect performance. Not to mention various compilation and optimization options. And since you didn't even specify your compiler and operating system, there's literally no authoritative answer that can be given here.
Although it's true that for some questions of this type it's possible to arrive at the best implementation with a high degree of certainty, for most use cases, this isn't one of them. The only way you can get the answer is to figure it out yourself, by trying each alternative yourself, profiling, and comparing the results.
I can categorically say that 2. is less efficient than 1. Copying to a local copy, and then accessing it like you would the original would only be of potential benefit if accessing a stack variable is quicker than accessing a member one, and it's not, so it's not (if you see what I mean).
Option 3. is trickier, since it depends on the implementation of the iter() method (and end(), which may be called once per loop) versus the implementation of the operator [] method. I could irritate some C++ die-hards and say there's an option 4: ask the Vector for a pointer to the array and use a pointer or array index on that directly. That might just be faster than either!
And as for II, there is a double-indirection there. A good compiler should spot that and cache the result for repeated use - but otherwise it would only be marginally slower than not doing so: again, depending on your compiler.
Without optimizations, option 2 would be slower on every imaginable platform, becasue it will incur copy of the vector, and the access time would be identical for local variable and class member.
With optimization, depending on SomeFunction, performance might be the same or worse for option 2. Same performance would happen if SomeFunction is either visible to compiler to not modify it's argument, or it's signature guarantees that argument will not be modified - in this case compiler can optimize away the copy altogether. Otherwise, the copy will remain.

Cache performance of vectors, matrices and quaternions

I've noticed on a number of occasions in the past, C and C++ code that uses the following format for these structures:
class Vector3
{
float components[3];
//etc.
}
class Matrix4x4
{
float components[16];
//etc.
}
class Quaternion
{
float components[4];
//etc.
}
My question is, will this lead to any better cache performance than say, this:
class Quaternion
{
float x;
float y;
float z;
//etc.
}
...Since I'd assume the class members and functions are in contiguous memory space, anyway? I currently use the latter form because I find it more convenient (however I can also see the practical sense in the array form, since it allows one to treat axes as arbitrary dependant on the operation being performed).
Afer taking some advice from the respondents, I tested the difference and it is actually slower with the array -- I get about 3% difference in framerate. I implemented operator[] to wrap the array access inside the Vector3. Not sure if this has anything to do with it, but I doubt it since that should be inlined anyway. The only factor I could see was that I could no longer use a constructor initializer list on Vector3(x, y, z). However when I took the original version and changed it to no longer use constructor initialiser lists, it ran very marginally slower than before (less than 0.05%). No clue, but at least now I know the original approach was faster.
These declarations are not equivalent with respect to memory layout.
class Quaternion
{
float components[4];
//etc.
}
The above guarantees that the elements are continuous in memory, while, if they are individual members like in your last example, the compiler is allowed to insert padding between them (for instance to align the members with certain address-patterns).
Whether or not this results in better or worse performance depends on your mostly on your compiler, so you'd have to profile it.
I imagine the performance difference from an optimization like this is minimal. I would say something like this falls into premature optimization for most code. However, if you plan to do vector processing over your structs, say by using CUDA, struct composition makes an important difference. Look at page 23 on this if interested: http://www.eecis.udel.edu/~mpellegr/eleg662-09s/li.pdf
I am not sure if the compiler manages to optimize code better when using an array in this context (think at unions for example), but when using APIs like OpenGL, it can be an optimisation when calling functions like
void glVertex3fv(const GLfloat* v);
instead of calling
void glVertex3f(GLfloat x, GLfloat y, GLfloat z);
because, in the later case, each parameter is passed by value, whereas in the first example, only a pointer to the whole array is passed and the function can decide what to copy and when, this way reducing unnecessary copy operations.