What does this structure actually do? - c++

I found this structure code in a Julia Set example from a book on CUDA. I'm a newbie C programmer and cannot get my head around what it's doing, nor have I found the right thing to read on the web to clear it up. Here's the structure:
struct cuComplex {
float r;
float i;
cuComplex( float a, float b ) : r(a), i(b) {}
float magnitude2( void ) { return r * r + i * i; }
cuComplex operator*(const cuComplex& a) {
return cuComplex(r*a.r - i*a.i, i*a.r + r*a.i);
}
cuComplex operator+(const cuComplex& a) {
return cuComplex(r+a.r, i+a.i);
}
};
and it's called very simply like this:
cuComplex c(-0.8, 0.156);
cuComplex a(jx, jy);
int i = 0;
for (i=0; i<200; i++) {
a = a * a + c;
if (a.magnitude2() > 1000)
return 0;
}
return 1;
So, the code did what? Defined something of structure type 'cuComplex' giving the real and imaginary parts of a number. (-0.8 & 0.156) What is getting returned? (Or placed in the structure?) How do I work through the logic of the operator stuff in the struct to understand what is actually calculated and held there?
I think that it's probably doing recursive calls back into the stucture
float magnitude2 (void) { return return r * r + i * i; }
probably calls the '*' operator for r and again for i, and then the results of those two operations call the '+' operator? Is this correct and what gets returned at each step?
Just plain confused.
Thanks!

r and i are members of the struct declared as float. The expression in the magnitude2 function simply does standard float arithmetic with the values stored in those members.
The operator functions defined in the struct are used when the operators * and + are applied to variable of the struct type, for instance in the line a = a * a + c.

It's a C++ implementation of a complex number, providing a method to return the magnitude and operator overloads for addition and multiplication. The real (r) and imaginary (i) parts are stored separately.
a * a + c calls the overloaded methods: (a.operator*(a)).operator+(c)
It appears you have very little grasp of even C syntax (return r * r + i * i; returns r times r plus i times i), so I suggest you read a beginner's introduction to C/C++, followed by an introduction to complex numbers and then read up on operator overloading in C++.

This is not a simple struct but a class ( which is basically a struct with functions ) and is C++.
cuComplex c(-0.8, 0.156);
Here he creates an instance of this class and sets the 2 values by calling the constructor ( special function that initializes the instance fields of the class ).
This probably won't make enough sense so I suggest you study a C++ book. Accelerated C++ is a good choice if you already know some programming.

The multiplication operator simply takes the real and imaginary part of argument a and add these with the real and imaginary parts of the object the operator is called upon and returns a new complex number object of the result. I've added the this pointer to clarify:
cuComplex operator*(const cuComplex& a) {
// first constructor argument is the real part, second is the imaginary part
return cuComplex(this->r*a.r - this->i*a.i, this->i*a.r + this->r*a.i);
}
Same goes for the addition operator. Again the copy of a new object of type cuComplex is created and returned. This time the real and imaginary part of it being the sum of the respective fields of this object and the argument.
cuComplex operator+(const cuComplex& a) {
return cuComplex(this->r+a.r, this->i+a.i);
}
For the loop, it seems the imaginary number a is multiplied with itself (resulting in a rotation in the Re-Im-plane and and a constant imaginary c is added in each iteration until the magnitude (lenght) of the result exceeds a certain threshold.
Note that both the operator* and operator+, as well the function magnitude2() are members of structure cuComplex and thus the this pointer is available.
Hope that helps.

Like you said cuComplex hold two values for real (r) and imaginary (i) part of a number.
The constructor simply assigns the value to r and i.
The * operator are working on cuComplex numbers. The multiply and add operators will only be called if you multiply two cuComplex isntances together.
They are simply there to simplify you code. Without the operator you would have to do the add operation yourself:
cuComplex c(-0.8, 0.156);
cuComplex a(jx, jy);
// Add, both operations are equivalent.
cuComplex result1 = cuComplex(c.r + a.r, c.i + a.i);
cuComplex result2 = c + a;
As for the code
cuComplex c(-0.8, 0.156); // Create cuComplex instance called c
cuComplex a(jx, jy); // Create cuComplex instance called a
int i = 0;
for (i=0; i<200; i++) {
a = a * a + c; // Use the * and + operator of cuComplex
if (a.magnitude2() > 1000)
return 0;
}
return 1;
I think that it's probably doing
recursive calls back into the stucture
float magnitude2 (void) { return return r * r + i * i; }
It is not since r and i are float. The * and + operator are overloaded for cuComplex not float

You should tag this question as C++, not C (even if you have a struct, this one has a constructor and redefines operators which are typical object-oriented concepts).
This structure defines complex numbers, allows to multiply (via operator*), add them (via operator+) and get their module (via magnitude2).
At the beininning; you have one constant complex number, c, and a, another complex number which is not constant, given by user probably via coordinates jx and jy. At each iteration of the loop, a is mutliplied by itself and c is added to this result.
If at some point a has got a module greater than 1000, the loop ends. I guess this is a test program to see how the module of a grows according to initial conditions given by a user.

If you are familiar with the concept of classes replace the word "struct" with "class" it makes it much easier to understand.
The "class" contains two variables r and i, a constructor that takes two float args, an operator to multiply, an operator to add, and a function to calculate the magnitude.

In C++ simply use std::complex<float>.

Related

A polynomial class

Here is the problem I am trying to solve:
Using dynamic arrays, implement a polynomial class with polynomial addition,
subtraction, and multiplication. Discussion: A variable in a polynomial does nothing but act as a placeholder for
the coefficients. Hence, the only interesting thing about polynomials is the array
of coefficients and the corresponding exponent. Think about the polynomial
xxx + x + 1
Where is the term in x*x ? One simple way to implement the polynomial class is to
use an array of doubles to store the coefficients. The index of the array is the
exponent of the corresponding term. If a term is missing, then it simply has a zero
coefficient.
There are techniques for representing polynomials of high degree with many missing
terms. These use so-called sparse matrix techniques. Unless you already know
these techniques, or learn very quickly, do not use these techniques.
Provide a default constructor, a copy constructor, and a parameterized constructor
that enables an arbitrary polynomial to be constructed.
Supply an overloaded operator = and a destructor.
Provide these operations:
polynomial + polynomial, constant + polynomial, polynomial + constant,
polynomial - polynomial, constant - polynomial, polynomial - constant.
polynomial * polynomial, constant * polynomial, polynomial * constant,
Supply functions to assign and extract coefficients, indexed by exponent.
Supply a function to evaluate the polynomial at a value of type double .
You should decide whether to implement these functions as members, friends, or standalone functions.
This is not for a class, I am just trying to teach myself C++ because I need it as I will start my graduate studies in financial mathematics at FSU this fall. Here is my code thus far:
class Polynomial
{
private:
double *coefficients; //this will be the array where we store the coefficients
int degree; //this is the degree of the polynomial (i.e. one less then the length of the array of coefficients)
public:
Polynomial(); //the default constructor to initialize a polynomial equal to 0
Polynomial(double coeffs[] , int nterms); //the constructor to initialize a polynomial with the given coefficient array and degree
Polynomial(Polynomial&); //the copy constructor
Polynomial(double); //the constructor to initialize a polynomial equal to the given constant
~Polynomial() { delete coefficients; } //the deconstructor to clear up the allocated memory
//the operations to define for the Polynomial class
Polynomial operator+(Polynomial p) const;
Polynomial operator-(Polynomial p) const;
Polynomial operator*(Polynomial p) const;
};
//This is the default constructor
Polynomial::Polynomial() {
degree = 0;
coefficients = new double[degree + 1];
coefficients[0] = 0;
}
//Initialize a polynomial with the given coefficient array and degree
Polynomial::Polynomial(double coeffs[], int nterms){
degree = nterms;
coefficients = new double[degree]; //array to hold coefficient values
for(int i = 0; i < degree; i++)
coefficients[i] = coeffs[i];
}
Polynomial::Polynomial(Polynomial&){
}
//The constructor to initialize a polynomial equal to the given constant
Polynomial::Polynomial(double){
}
Polynomial::operator *(Polynomial p) const{
}
Polynomial::operator +(Polynomial p) const{
}
Polynomial::operator -(Polynomial p) const{
}
I am just wondering if I am on the right track, if there is a better way of doing this please let me know. Any comments or suggestions would be greatly appreciated.
This is not a full answer but a starting point for you. I used std::set because it keeps its elements ordered, so I implemented a functor and used for my set. Now, elements in set will be sorted based on my comparison functor. In the current implementation, terms will be ordered in descending order in terms of exponents.
#include<set>
struct Term
{
int coefficient;
int exponent;
Term(int coef, int exp) : coefficient{ coef }, exponent{ exp } {}
};
struct TermComparator
{
bool operator()(const Term& lhs, const Term& rhs) {
return lhs.exponent < rhs.exponent;
}
};
class Polynomial
{
private:
std::set<Term, TermComparator> terms;
public:
Polynomial();
~Polynomial();
Polynomial operator+(Polynomial p);
};
My implementation allows you to store higher order polynomials efficiently.
I have implemented addition for you. It is not in best shape in terms OOP, but you can refactor it.
Polynomial Polynomial::operator+(Polynomial p)
{
auto my_it = terms.begin();
auto p_it = p.terms.begin();
Polynomial result;
while (my_it != terms.end() && p_it != p.terms.end())
{
if (my_it->exponent > p_it->exponent)
{
result.terms.insert(*my_it);
++my_it;
}
else if (my_it->exponent == p_it->exponent)
{
result.terms.insert(Term(my_it->coefficient + p_it->coefficient, my_it->exponent));
++my_it;
++p_it;
}
else
{
result.terms.insert(*p_it);
++p_it;
}
}
//only one of the for loops will be effective
for (; my_it != terms.end(); ++my_it) result.terms.insert(*my_it);
for (; p_it != p.terms.end(); ++p_it) result.terms.insert(*p_it);
return result;
}
State of your code
You code is correct so far if by nterms you mean maximum degree of your polynomial
One simple way to implement the polynomial class is to use an array of doubles to store the coefficients.
This is what you did
The index of the array is the exponent of the corresponding term
This is why I told you you array size is equal to the degree number + 1
This way you access the coefficient (the value of your array) using its degree (which will be the key)
If a term is missing, then it simply has a zero coefficient.
Notice that in the example given, x² doesn't exist, but coefficients[2] exists in your code and is equal to 0
Provide these operations: polynomial + polynomial, constant + polynomial, polynomial + constant, polynomial - polynomial, constant - polynomial, polynomial - constant. polynomial * polynomial, constant * polynomial, polynomial * constant, Supply functions to assign and extract coefficients, indexed by exponent. Supply a function to evaluate the polynomial at a value of type double
As you mentioned it, you are missing some of those operator overload.
To go further
Here's a non-exhaustive list of what could be done to get some more experience with C++ when you will be done with this exercise:
- You could implement an expression parser (using )
- Handle more complex polynomial (i.e. x² + y² + 2xy + 1)
- Use map to store your coefficients (Map may not be considered as dynamic arrays by this exercise but could be fun for you to play with) or another data structure to get ride of your zeros in your coefficients ! (c.f. sparse matrix/arrays techniques)
Have fun in your future studies !

C++: Why accessing class data members is so slow compared to accessing global variables?

I'm implementing a computationally expensive program and in the last days I spent a lot of time getting familiar with object oriented design, design patterns and SOLID principles. I need to implement several metrics in my program so I designed a simple interface to get it done:
class Metric {
typedef ... Vector;
virtual ~Metric() {}
virtual double distance(const Vector& a, const Vector& b) const = 0;
};
the first metric I implemented was the Minkowski metric,
class MinkowskiMetric : public Metric {
public:
MinkowskiMetric(double p) : p(p) {}
double distance(const Vector& a, const Vector& b) const {
const double POW = this->p; /** hot spot */
return std::pow((std::pow(std::abs(a - b), POW)).sum(), 1.0 / POW);
private:
const double p;
};
Using this implementation the code ran really slow someone tried a global variable instead of accessing the data member, my last implementation doesn't get the job done but looks like this.
namespace parameters {
const double p = 2.0; /** for instance */
}
And the hot spot line looks like:
...
const double POW = parameters::p; /** hot spot */
return ...
Just making that change, the code runs at least 275 times faster in my machine, using either gcc-4.8 or clang-3.4 with optimization flags in Ubuntu 14.04.1.
Is this a problem a common pitfall?
Is there any way around it?
Am I just missing something?
The difference between the two version is that in one case, the compiler has to load p and perform some computation with it, while in the other, you're using a global constant, which the compiler can probably just substitute directly. So in one case, the resulting code probably does this:
Load p.
Call abs(a - b), name the result c
Call pow(c, p), name the result d
Call d.sum() (whatever that means), name the result e
Calculate 1.0 / p, name the result i
Call pow(e, i).
That's a bunch of library calls, and library calls are slow. Also, pow is slow.
When you use the global constant, the compiler can do some calculations by itself.
Call abs(a - b), name the result c.
pow(c, 2.0) is more efficiently calculated as c * c, name the result d
Call d.sum(), name the result e
1.0 / 2.0 is 0.5, and pow(e, 0.5) can be translated to the more efficient sqrt(e).
Let's have a look at what is going on here:
...
Metric *metric = new MinkowskiMetric(2.0);
metric->distance(a, b);
Since distance is a virtual function the runtime has to look up the address of the metric pointer to load in the virtual function table pointer and then use that to look up the address of the distance function for your object.
This is probably incidental to what is happening next:
double distance(const Vector& a, const Vector& b) const {
const double POW = this->p; /** hot spot */
The function has to then look up the address of the this pointer (which happens to be explicitly stated here) in order to know from which location to load in the value of p. Compare that to the version which uses a global variable:
double distance(const Vector& a, const Vector& b) const {
const double POW = parameters::p; /** hot spot */
...
namespace parameters {
const double p = 2.0; /** for instance */
}
This version of p is always going to live at the same address and therefore loading in its value is only ever going to be a single operation and removes a level of indirection which is almost certainly causing a cache miss and causing the CPU to block waiting for data to be loaded from RAM.
So how can you avoid this? Try to allocate objects on the stack as much as possible. This enables a locality of reference known as spatial locality which means that your data is much more likely to be living in the CPU's cache when it needs to load it in. You can see Herb Sutter discussing this issue in the middle of this talk.
If you want to use OOP in code that should be somewhat performant you'll still have to minimise the amount of memory accesses. This means a change in design. Taking your example (assuming you're evaluating the metric a few times):
double MinkowskiMetric::distance(const Vector& a, const Vector& b) const {
const double POW = this->p; /** hot spot */
return std::pow((std::pow(std::abs(a - b), POW)).sum(), 1.0 / POW);
}
can be turned into
template<class VectorIter, class OutIter>
void MinkowskiMetric::distance(VectorIter aBegin, VectorIter aEnd, VectorIter bBegin, OutIter rBegin) const {
const double pow = this->p, powInv = 1.0 / pow;
while(aBegin != aEnd) {
Vector a = *aBegin++;
Vector b = *bBegin++;
*rBegin++ = std::pow((std::pow(std::abs(a - b), pow)).sum(), powInv);
}
}
Now you'll access the location of the virtual function and the members of this exactly once for a set of Vector pairs - adjust your algorithm accordingly to make use of this optimisation.

Error: Expression must be a modifiable lvalue

I have been getting this error come up in the for loop when I try to assign values to x_dev, y_dev, and pearson. As far as I can see they should all be modifiable. Can anyone see where I have gone wrong?
class LoopBody
{
double *const x_data;
double *const y_data;
double const x_mean;
double const y_mean;
double x_dev;
double y_dev;
double pearson;
public:
LoopBody(double *x, double *y, double xmean, double ymean, double xdev, double ydev, double pear)
: x_data(x), y_data(y), x_mean(xmean), y_mean(ymean), x_dev(xdev), y_dev(ydev), pearson(pear) {}
void operator() (const blocked_range<size_t> &r) const {
for(size_t i = r.begin(); i != r.end(); i++)
{
double x_temp = x_data[i] - x_mean;
double y_temp = y_data[i] - y_mean;
x_dev += x_temp * x_temp;
y_dev += y_temp * y_temp;
pearson += x_temp * y_temp;
}
}
};
Having followed #Bathsheba 's advice I have overcome these problems. However When running a parallel_for the operator is runs but the for loop is never entered.
This is where I call the parallel_for:
parallel_for(blocked_range<size_t>(0,n), LoopBody(x, y, x_mean, y_mean, x_dev, y_dev, pearson), auto_partitioner());
The () operator is marked const, and you're attempting to modify class member data (e.g. x_dev, y_dev and person). That is not allowed and is why you're getting the compile-time error.
You probably want to drop the const from the method.
Alternatively you can mark the member data that you want to modify as mutable, but this is not the preferred solution as it makes code brittle, difficult to read and can wreak havoc with multi-threading.
Seemingly you want to do reduction, i.e. compute some aggregate values over the data.
For that, TBB offers a special function template: parallel_reduce. Unlike parallel_for that perhaps you use now, parallel_reduce does not require operator() of a body class to be const, because an instance of that class accumulates partial results. However, it poses other requirements to the class: the need to have a special constructor as well as a method to merge partial results from another body instance.
More information can be found in the Intel(R) TBB User Guide: http://www.threadingbuildingblocks.org/docs/help/tbb_userguide/parallel_reduce.htm
Also there is an overload of parallel_reduce which takes two functors - one for body and another one for merging partial results - as well as a special "identity" value used to initialize accumulators. But you are computing three aggregate values at once, so you would still need to have a struct or class to store all three values in a single variable.

Core Dumped While Multiplying Iteratively

I am trying to do something very simple. I have a class for functions, and a class for polynomials derived from the function class. In the polynomial, I am overloading the *= operator. But, when I invoke this operator, the program dumps the core and crashes.
Polynomial& Polynomial::operator*= (double c)
{
for(int i = 0; i <= degree; i++)
a[i] = a[i] * c;
return *this;
}
The polynomial class holds the coefficients in array a. The index of a directly relates to the power of x for that particular coefficient. Function main hands us the constant c, which we then multiply each coefficient by.
The prototype for the function is part of an assignment, or I would change it. I'm assuming there's something I'm doing wrong with respect to the return type. Any help is appreciated.
I am willing to provide more code if requested.
The return type is fine, I'm guessing the problem is i <= degree instead of i < degree. Arrays in C++ are 0-based.
EDIT: or perhaps you want to keep that as <= for consistency with the polynomial, in which case you need to allocate degree+1 items for your array.

Efficient operator+

I have to compute large sums of 3d vectors and a comparison of using a vector class with overloaded operator+ and operator* versus summing up of separate components shows a performance difference of about a factor of three. I know assume the difference must be due to construction of objects in the overloaded operators.
How can one avoid the construction and improve performance?
I'm espacially puzzled, because the following is afaik basically the standard way to do it and I would expect the compiler to optimize this. In real life, the sums are not going to be done within a loop but in quite large expressions (several tens of MBs in total pre executable) summing up different vectors, this is why operator+ is used below.
class Vector
{
double x,y,z;
...
Vector&
Vector::operator+=(const Vector &v)
{
x += v.x;
y += v.y;
z += v.z;
return *this;
}
Vector
Vector::operator+(const Vector &v)
{
return Vector(*this) += v; // bad: construction and copy(?)
}
...
}
// comparison
double xx[N], yy[N], zz[N];
Vector vec[N];
// assume xx, yy, zz and vec are properly initialized
Vector sum(0,0,0);
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
// this is a factor 3 faster than the above loop
double sumxx = 0;
double sumyy = 0;
double sumzz = 0;
for(int i = 0; i < N; ++i)
{
sumxx = sumxx + xx[i];
sumyy = sumyy + yy[i];
sumzz = sumzz + zz[i];
}
Any help is greatly appreciated.
EDIT:
Thank you all for your great input, I have the performance now at the same level.
#Dima's and especially #Xeo's answer did the trick. I wish I could mark more than one answer "accepted". I'll test some of the other suggestions too.
This article has a really good argumentation on how to optimize operators such as +, -, *, /.
Implement the operator+ as a free function like this in terms of operator+=:
Vector operator+(Vector lhs, Vector const& rhs){
return lhs += rhs;
}
Notice on how the lhs Vector is a copy and not a reference. This allowes the compiler to make optimizations such as copy elision.
The general rule that article conveys: If you need a copy, do it in the parameters, so the compiler can optimize. The article doesn't use this example, but the operator= for the copy-and-swap idiom.
Why not replace
sum = sum + vec[i];
with
sum += vec[i];
... that should eliminate two calls to the copy constructor and one call to the assignment operator for each iteration.
But as always, profile and know where the expense is coming instead of guessing.
You might be interested in expression templates.
I implemented most of the optimizations being proposed here and compared it with the performance of a function call like
Vector::isSumOf( Vector v1, Vector v2)
{
x = v1.x + v2.x;
...
}
Repeatedly executing same loop with a few billion vector summations for every method in alternating order, did not result in the promised gains.
In case of the member function posted by bbtrb, this method took 50% more time than the isSumOf() function call.
Free, non member operator+ (Xeo) method needed up to double the time (100% more) of the is SumOf() function.
(gcc 4.6.3 -O3)
I aware of the fact, that this was not a representative testing, but since i could not reproduce any performance gains by using operators at all. I suggest to avoid them, if possible.
Usually, operator + looks like:
return Vector (x + v.x, y + v.y, z + v.z);
with a suitably defined constructor. This allows the compiler to do return value optimisation.
But if you're compiling for IA32, then SIMD would be worth considering, along with changes to the algorithms to take advantage of the SIMD nature. Other processors may have SIMD style instructions.
I think the difference in performance is caused by the compiler optimization here. Adding up elements of arrays in a loop can be vectorized by the compiler. Modern CPUs have instructions for adding multiple numbers in a single clock tick, such as SSE, SSE2, etc. This seems to be a likely explanation for the factor of 3 difference that you are seeing.
In other words, adding corresponding elements of two arrays in a loop may generally be faster than adding corresponding members of a class. If you represent the vector as an array inside your class, rather than x, y, and z, you may get the same speedup for your overloaded operators.
Are the implementations to your Vector operator functions directly in the header file or are they in a separate cpp file? In the header file they would typically be inlined in an optimized build. But if they are compiled in a different translation unit, then they often won't be (depending on your build settings). If the functions aren't inlined, then the compiler won't be able to do the type of optimization you are looking for.
In cases like these, have a look at the disassembly. Even if you don't know much about assembly code it's usually pretty easy to figure out what's different in simple cases like these.
Actually if you look at any real matrix code the operator+ and the operator+= don't do that.
Because of the copying involved they introduce a pseudo object into the expression and only do the real work when the assignment is executed. Using lazy evaluation like this also allows NULL operations to be removed during expression evaluation:
class Matrix;
class MatrixOp
{
public: virtual void DoOperation(Matrix& resultInHere) = 0;
};
class Matrix
{
public:
void operator=(MatrixOp* op)
{
// No copying has been done.
// You have built an operation tree.
// Now you are goign to evaluate the expression and put the
// result into *this
op->DoOperation(*this);
}
MatrixOp* operator+(Matrix& rhs) { return new MatrixOpPlus(*this,rhs);}
MatrixOp* operator+(MatrixOp* rhs){ return new MatrixOpPlus(*this,rhs);}
// etc
};
Of course this is a lot more complex than I have portrayed here in this simplified example. But if you use a library that has been designed for matrix operations then it will have already been done for you.
Your Vector implementation:
Implement the operator+() like this:
Vector
Vector::operator+(const Vector &v)
{
return Vector(x + v.x, y + v.y, z + v.z);
}
and add the inline operator in your class definition (this avoids the stack pushs and pops of the return address and method arguments for each method call, if the compiler finds it useful).
Then add this constructor:
Vector::Vector(const double &x, const double &y, const double &z)
: x(x), y(y), z(z)
{
}
which lets you construct a new vector very efficiently (like you would do in my operator+() suggestion)!
In the code using your Vector:
You did:
for(int i = 0; i < N; ++i)
{
sum = sum + vec[i];
}
Unroll this kind of loops! Doing only one operation (as it would be optimized to using the SSE2/3 extensions or something similar) in a very large loop is very inefficient. You should rather do something like this:
//Unrolled loop:
for(int i = 0; i <= N - 10; i += 10)
{
sum = sum + vec[i];
+ vec[i+1];
+ vec[i+2];
+ vec[i+3];
+ vec[i+4];
+ vec[i+5];
+ vec[i+6];
+ vec[i+7];
+ vec[i+8];
+ vec[i+9];
}
//Doing the "rest":
for(int i = (N / 10) * 10; i < N; ++i)
{
sum = sum + vec[i];
}
(Note that this code is untested and may contain a "off-by-one"-error or so...)
Note that you are asking different things because the data is not disposed in the same way in memory. When using Vector array the coordinates are interleaved "x1,y1,z1,x2,y2,z2,...", while with the double arrays you have "x1,x2,...,y1,y2,...z1,z2...". I suppose this could have an impact on compiler optimizations or how the caching handles it.