Most of the time, the way you would make a grid of objects in C++ is like this:
class Grid {
private:
GridObject[5][5];
};
But is there a way to save memory and not use tiles that have nothing in them? Say you had a chess board. What if you didn't want to use memory in unoccupied spaces? I was thinking of ways to do this but they are inefficient.
I would have a structure like:
class GridObjectWithIndex
{
public:
size_t index;
GridObject obj;
};
Where index is a value of the form x + y * GRIDSIZE so that you easily can decode the x and y of it.
And in your Grid class have:
class Grid
{
std::vector<GridObjectWithIndex> grid_elements;
};
The first thing which comes to my mind is storing pointers GridObject* in the grid. It will still consume sizeof(GridObject*) bytes for non-existent elements but it saves you from the nightmare of managing the data structure. If it's not absolutely critical - don't make it more complex than needed. It's definitely not worth for chess (8x8).
fritzone's solution will be good if you have a small number of elements over a wide range of indices. However think about read/write performance.
If you want to save the space while sending over the network or saving in a file you may want to consider archiving algorithms. Althogh it only makes sense for large sets on data.
What you are calling for is called a sparse data-structure, the most striking examples being sparse matrices.
The idea is to trade memory for CPU: use less memory at the cost of more operations. In general it means:
having a structure only representing the "unusual" values (with a fast access by coordinates)
keeping the "usual" value aside
As a proof of concept, you can use a std::map< std::pair<size_t, size_t>, Object > for the sparse structure and just have another Object on the side for whenever it's not present in the map.
Related
Hi I am a graduate student studying scientific computing using c++. Some of our research focus on speed of an algorithm, therefore it is important to construct array structure that is fast enough.
I've seen two ways of constructing 3D Arrays.
First one is to use vector liblary.
vector<vector<vector<double>>> a (isize,vector<double>(jsize,vector<double>(ksize,0)))
This gives 3D array structure of size isize x jsize x ksize.
The other one is to construct a structure containing 1d array of size isize* jsize * ksize using
new double[isize*jsize*ksize]. To access the specific location of (i,j,k) easily, operator overloading is necessary(am I right?).
And from what I have experienced, first one is much faster since it can access to location (i,j,k) easily while latter one has to compute location and return the value. But I have seen some people preferring latter one over the first one. Why do they prefer the latter setting? and is there any disadvantage of using the first one?
Thanks in adavance.
Main difference between those will be the layout:
vector<vector<vector<T>>>
This will get you a 1D array of vector<vector<T>>.
Each item will be a 1D array of vector<T>.
And each item of those 1D array will be a 1D array of T.
The point is, vector itself does not store its content. It manages a chunk of memory, and stores the content there. This has a number of bad consequences:
For a matrix of dimension X·Y·Z, you will end up allocating 1 + X + X·Y memory chunks. That's horribly slow, and will trash the heap. Imagine: a cube matrix of size 20 would trigger 421 calls to new!
To access a cell, you have 3 levels of indirection:
You must access the vector<vector<vector<T>>> object to get pointer to top-level memory chunk.
You must then access the vector<vector<T>> object to get pointer to second-level memory chunk.
You must then access the vector<T> object to get pointer to the leaf memory chunk.
Only then you can access the T data.
Those memory chunks will be spread around the heap, causing a lot of cache misses and slowing the overall computation.
Should you get it wrong at some point, it is possible to end up with some lines in your matrix having different lengths. After all, they're independent 1-d arrays.
Having a contiguous memory block (like new T[X * Y * Z]) on the other hand gives:
You allocate 1 memory chunk. No heap trashing, O(1).
You only need to access the pointer to the memory chunk, then can go straight for desired element.
All matrix is contiguous in memory, which is cache-friendly.
Those days, a single cache miss means dozens or hundreds lost computing cycles, do not underestimate the cache-friendliness aspect.
By the way, there is a probably better way you didn't mention: using one of the numerous matrix libraries that will handle this for you automatically and provide nice support tools (like SSE-accelerated matrix operations). One such library is Eigen, but there are plenty others.
→ You want to do scientific computing? Let a lib handle the boilerplate and the basics so you can focus on the scientific computing part.
In my point of view, there are too much advantages std::vector's have over normal plain arrays.
In short here are some:
It is much harder to create memory leaks with std::vector. This point alone is one of the biggest advantages. This has nothing to do with performance, but should be considered all the time.
std::vector is part of the STL. This part of C++ is one of the most used one. Thousands of people use the STL and so they get "tested" every day. Over the last years they got optimized so radically, they don't lack any performance anymore. (pls correct me if i see this wrong)
Handling with std::vector is easy as 1, 2, 3. No pointer handling no nothing... Just accessing it via methods or with []-operator and more other methods.
First of all, the idea that you access (i,j,k) in your vec^3 directly is somewhat flawed. What you have is a structure of pointers where you need to dereference three pointers along the way. Note that I have no idea whether that is faster or slower than computing the position within a one-dimensional array, though. You'd need to test that and it might depend on the size of your data (especially whether it fits in a chunk).
Second, the vector^3 requires pointers and vector sizes, which require more memory. In many cases, this will be irrelevant (as the image grows cubically but the memory difference only quadratically) but if your algoritm is really going to fill out any memory available, that can matter.
Third, the raw array stores everything in consecutive memory, which is good for streaming and can be good for certain algorithms because of quick cache accesses. For example when you add one 3D image to another.
Note that all of this is about hyper-optimization that you might not need. The advantages of vectors that skratchi.at pointed out in his answer are quite strong, and I add the advantage that vectors usually increase readability. If you do not have very good reasons not to use vectors, then use them.
If you should decide for the raw array, in any case, make sure that you wrap it well and keep the class small and simple, in order to counter problems regarding leaks and such.
Welcome to SO.
If everything what you have are the two alternatives, then the first one could be better.
Prefer using STL array or vector instead of a C array
You should avoid to use C++ plain arrays since you need to manage yourself the memory allocating/deallocating with new/delete and other boilerplate code like keep track of the size/check bounds. In clearly words "C arrays are less safe, and have no advantages over array and vector."
However, there are some important drawbacks in the first alternative. Something I would like to highlight is that:
std::vector<std::vector<std::vector<T>>>
is not a 3-d matrix. In a matrix, all the rows must have the same size. On the other hand, in a "vector of vectors" there is no guarantee that all the nested vectors have the same length. The reason is that a vector is a linear 1-D structure as pointed out in the #spectras answer. Hence, to avoid all sort of bad or unexpected behaviours, you must to include guards in your code to obtain the rectangular invariant guaranteed.
Luckily, the first alternative is not the only one you may have in hands.
For example, you can replace the c-style array by a std::array:
const int n = i_size * j_size * k_size;
std::array<int, n> myFlattenMatrix;
or use std::vector in case your matrix dimensions can change.
Accessing element by its 3 coordinates
Regarding your question
To access the specific location of (i,j,k) easily, operator
overloading is necessary(am I right?).
Not exactly. Since there isn't a 3-parameter operator for neither std::vector nor array, you can't overload it. But you can create a template class or function to wrap it for you. In any case you will must to deference the 3 vectors or calculate the flatten index of the element in the linear storage.
Considering do not use a third part matrix library like Eigen for your experiments
You aren't coding it for production but for research purposes instead. Particularly, your research is exactly regarding the performance of algorithms. In that case, I prefer do not recommend to use a third part library, like Eigen, absolutely. Of course it depends a lot of what kind of "speed of an algorithm" metrics are you willing to gather, but Eigen, for instance, will do a lot of things under the hood (like vectorization) which will have a tremendous influence on your experiments. Since it will be hard for you to control those unseen optimizations, these library's features may lead you to wrong conclusions about your algorithms.
Algorithm's performance and big-o notation
Usually, the performance of algorithms are analysed by using the big-O approach where factors like the actual time spent, hardware speed or programming language traits aren't taken in account. The book "Data Structures and Algorithms in C++" by Adam Drozdek can provide more details about it.
I have to read a file in which is stored a matrix with cars (1=BlueCar, 2=RedCar, 0=Empty).
I need to write an algorithm to move the cars of the matrix in that way:
blue ones move downward;
red ones move rightward;
there is a turn in which all the blue ones move and a turn to move all the red ones.
Before the file is read I don't know the matrix size and if it's dense or sparse, so I have to implement two data structures (one for dense and one for sparse) and two algorithms.
I need to reach the best time and space complexity possible.
Due to the unknown matrix size, I think to store the data on the heap.
If the matrix is dense, I think to use something like:
short int** M = new short int*[m];
short int* M_data = new short int[m*n];
for(int i=0; i< m; ++i)
{
M[i] = M_data + i * n;
}
With this structure I can allocate a contiguous space of memory and it is also simple to be accessed with M[i][j].
Now the problem is the structure to choose for the sparse case, and I have to consider also how I can move the cars through the algorithm in the simplest way: for example when I evaluate a car, I need to find easily if in the next position (downward or rightward) there is another car or if it's empty.
Initially I thought to define BlueCar and RedCar objects that inherits from the general Car object. In this objects I can save the matrix coordinates and then put them in:
std::vector<BluCar> sparseBlu;
std::vector<RedCar> sparseRed;
Otherwise I can do something like:
vector< tuple< row, column, value >> sparseMatrix
But the problem of finding what's in the next position still remains.
Probably this is not the best way to do it, so how can I implement the sparse case in a efficient way? (also using a unique structure for sparse)
Why not simply create a memory mapping directly over the file? (assuming your data 0,1,2 is stored in contiguous bytes (or bits) in the file, and the position of those bytes also represents the coordinates of the cars)
This way you don't need to allocate extra memory and read in all the data, and the data can simply and efficiently be accessed with M[i][j].
Going over the rows would be L1-cache friendly.
In case of very sparse data, you could scan through the data once and keep a list of the empty regions/blocks in memory (only need to store startpos and size), which you could then skip (and adjust where needed) in further runs.
With memory mapping, only frequently accessed pages are kept in memory. This means that once you have scanned for the empty regions, memory will only be allocated for the frequently accessed non-empty regions (all this will be done automagically by the kernel - no need to keep track of it yourself).
Another benefit is that you are accessing the OS disk cache directly. Thus no need to keep copying and moving data between kernel space and user space.
To further optimize space- and memory usage, the cars could be stored in 2 bits in the file.
Update:
I'll have to move cars with openMP and MPI... Will the memory mapping
work also with concurrent threads?
You could certainly use multithreading, but not sure if openMP would be the best solution here, because if you work on different parts of the data at the same time, you may need to check some overlapping regions (i.e. a car could move from one block to another).
Or you could let the threads work on the middle parts of the blocks, and then start other threads to do the boundaries (with red cars that would be one byte, with blue cars a full row).
You would also need a locking mechanism for adjusting the list of the sparse regions. I think the best way would be to launch separate threads (depending on the size of the data of course).
In a somewhat similar task, I simply made use of Compressed Row Storage.
The Compressed Row and Column (in the next section) Storage formats
are the most general: they make absolutely no assumptions about the
sparsity structure of the matrix, and they don't store any unnecessary
elements. On the other hand, they are not very efficient, needing an
indirect addressing step for every single scalar operation in a
matrix-vector product or preconditioner solve.
You will need to be a bit more specific about time and space complexity requirements. CSR requires an extra indexing step for simple operations, but that is a minor amount of overhead if you're just doing simple matrix operations.
There's already an existing C++ implementation available online as well.
I have a class that implements two simple, pre-sized stacks; those are stored as members of the class of type vector pre-sized by the constructor. They are small and cache line size friendly objects.
Those two stacks are constant in size, persisted and updated lazily, and are often accessed together by some computationally cheap methods that, however, can be called a large number of times (tens to hundred of thousands of times per second).
All objects are already in good state (code is clean and does what it's supposed to do), all sizes kept under control (64k to 128K most cases for the whole chain of ops including results, rarely they get close to 256k, so at worse an L2 look-up and often L1).
some auto-vectorization comes into play, but other than that it's single threaded code throughout.
The class, minus some minor things and padding, looks like this:
class Curve{
private:
std::vector<ControlPoint> m_controls;
std::vector<Segment> m_segments;
unsigned int m_cvCount;
unsigned int m_sgCount;
std::vector<unsigned int> m_sgSampleCount;
unsigned int m_maxIter;
unsigned int m_iterSamples;
float m_lengthTolerance;
float m_length;
}
Curve::Curve(){
m_controls = std::vector<ControlPoint>(CONTROL_CAP);
m_segments = std::vector<Segment>( (CONTROL_CAP-3) );
m_cvCount = 0;
m_sgCount = 0;
std::vector<unsigned int> m_sgSampleCount(CONTROL_CAP-3);
m_maxIter = 3;
m_iterSamples = 20;
m_lengthTolerance = 0.001;
m_length = 0.0;
}
Curve::~Curve(){}
Bear with the verbosity, please, I'm trying to educate myself and make sure I'm not operating by some half-arsed knowledge:
Given the operations that are run on those and their actual use, performance is largely memory I/O bound.
I have a few questions related to optimal positioning of the data, keep in mind this is on Intel CPUs (Ivy and a few Haswell) and with GCC 4.4, I have no other use cases for this:
I'm assuming that if the actual storage of controls and segments are contiguous to the instance of Curve that's an ideal scenario for the cache (size wise the lot can easily fit on my target CPUs).
A related assumption is that if the vectors are distant from the instance of the Curve , and between themselves, as methods alternatively access the contents of those two members, there will be more frequent eviction and re-populating the L1 cache.
1) Is that correct (data is pulled for the entire stretch of cache size from the address first looked up on a new operation, and not in convenient multiple segments of appropriate size), or am I mis-understanding the caching mechanism and the cache can pull and preserve multiple smaller stretches of ram?
2) Following the above, insofar by pure circumstance all my test always end up with the class' instance and the vectors contiguous, but I assume that's just dumb luck, however statistically probable. Normally instancing the class reserves only the space for that object, and then the vectors are allocated in the next free contiguous chunk available, which is not guaranteed to be anywhere near my Curve instance if that previously found a small emptier niche in memory.
Is this correct?
3) Assuming 1 and 2 are correct, or close enough functionally speaking, I understand to guarantee performance I'd have to write an allocator of sorts to make sure the class object itself is large enough on instancing, and then copy the vectors in there myself and from there on refer to those.
I can probably hack my way to something like that if it's the only way to work through the problem, but I'd rather not hack it horribly if there are nice/smart ways to go about something like that. Any pointers on best practices and suggested methods would be hugely helpful (beyond "don't use malloc it's not guaranteed contiguous", that one I already have down :) ).
If the Curve instances fit into a cache line and the data of the two vectors also fit a cachline each, the situation is not that bad, because you have four constant cachelines then. If every element was accessed indirectly and randomly positioned in memory, every access to an element might cost you a fetch operation, which is avoided in that case. In the case that both Curve and its elements fit into less than four cachelines, you would reap benefits from putting them into contiguous storage.
True.
If you used std::array, you would have the guarantee that the elements are embedded in the owning class and not have the dynamic allocation (which in and of itself costs you memory space and bandwidth). You would then even avoid the indirect access that you would still have if you used a special allocator that puts the vector content in contiguous storage with the Curve instance.
BTW: Short style remark:
Curve::Curve()
{
m_controls = std::vector<ControlPoint>(CONTROL_CAP, ControlPoint());
m_segments = std::vector<Segment>(CONTROL_CAP - 3, Segment());
...
}
...should be written like this:
Curve::Curve():
m_controls(CONTROL_CAP),
m_segments(CONTROL_CAP - 3)
{
...
}
This is called "initializer list", search for that term for further explanations. Also, a default-initialized element, which you provide as second parameter, is already the default, so no need to specify that explicitly.
I am working on an OpenGL based game, and this problem arises in my search for a robust way to pass entity information to the rendering and networking functions.
Something like this:
vector <DynamicEntity> DynamicEntities;
vector <StaticEntity> StaticEntities;
vector <Entity> *Entities;
Entities.push_back(&DynamicEntities);
Entities.push_back(&StaticEnt);
Graphics.draw(Entities);
Server.bufferData(Entities);
The only information Graphics.draw and Server.bufferData need is contained by Entity, which both DynamicEntity and StaticEntity inherit from.
I know this would work:
vector <Entity*> Entities;
for(uint32_t iEntity=0;iEntity<iNumEntities;iEntity++)
{
Entities.push_back(&DynamicEntities[iEntity])
}
//(ad nauseam for other entity types)
but that seems awfully inefficient, and would have to account for changing numbers of entities as the game progresses. I know relatively little of the underpinnings of data structures; so this may be a stupid question, but I would appreciate some insight.
That copy operation in the for loop you show is likely to be so fast as to be unmeasurable as compared to the cost of drawing the graphics. You may be able to make it faster still by doing Entities.resize(...) before the loop and using array indexing in the loop.
The solution you propose in the first part of your question will not work, because the data layout for a vector<StaticEntity> likely differs from the layout for a vector<DynamicEntity>. Arrays of objects are not polymorphic, and the storage managed by vector<> is essentially an array with some dynamic management around it.
In fact, because the vector storage is array-like, if you replace push_back with array indexing, the body of for loop will compile to something resembling:
*p1++ = p2++;
which the compiler may even auto-vectorize!
Okay, here's sample code comparing an Object Oriented Programming (OOP) solution vs a Data Oriented Design (DOD) solution of updating a bunch of balls.
const size_t ArraySize = 1000;
class Ball
{
public:
float x,y,z;
Ball():
x(0),
y(0),
z(0)
{
}
void Update()
{
x += 5;
y += 5;
z += 5;
}
};
std::vector<Ball> g_balls(ArraySize);
class Balls
{
public:
std::vector<float> x;
std::vector<float> y;
std::vector<float> z;
Balls():
x(ArraySize,0),
y(ArraySize,0),
z(ArraySize,0)
{
}
void Update()
{
const size_t num = x.size();
if(num == 0)
{
return;
}
const float* lastX = &x[num - 1];
float* pX = &x[0];
float* pY = &y[0];
float* pZ = &z[0];
for( ; pX <= lastX; ++pX, ++pY, ++pZ)
{
*pX += 5;
*pY += 5;
*pZ += 5;
}
}
};
int main()
{
Balls balls;
Timer time1;
time1.Start();
balls.Update();
time1.Stop();
Timer time2;
time2.Start();
const size_t arrSize = g_balls.size();
if(arrSize > 0)
{
const Ball* lastBall = &g_balls[arrSize - 1];
Ball* pBall = &g_balls[0];
for( ; pBall <= lastBall; ++pBall)
{
pBall->Update();
}
}
time2.Stop();
printf("Data Oriented design time: %f\n",time1.Get_Microseconds());
printf("OOB oriented design time: %f\n",time2.Get_Microseconds());
return 0;
}
Now, this does compile and run in Visual Studio, though I'm wondering if I'm allowed to do this, supposed to be able to reliably do this:
const float* lastX = &x[num - 1];//remember, x is a std::vector of floats
float* pX = &x[0];//remember, x is a std::vector of floats
float* pY = &y[0];//remember, y is a std::vector of floats
float* pZ = &z[0];//remember, z is a std::vector of floats
for( ; pX <= lastX; ++pX, ++pY, ++pZ)
{
*pX += 5;
*pY += 5;
*pZ += 5;
}
From my understanding the data in a std::vector are supposed to be contiguous, though I'm not sure because of how it's being stored internally if this is going to be an issue on another platform, if it breaks the standard. Also, this was the only way I was able to get the DOD solution to outdo the OOP solution, any other way of iterating wasn't as good. I could use iterators, though I'm pretty sure that it might only be quicker than OOP with optimizations enabled, aka in release mode.
So, is this a good way to do DOD (best way?), and is this legal c++?
[EDIT]
Okay, for DOD this is a poor example; the x,y,z should be packaged in a Vector3. So, while DOD ran faster in debug than OOP, in release it was another story. Again, this is a bad example of how you would want to use DOD efficiently, though it does show it's short-comings if you need to access a bunch of data at the same time. The key to using DOD properly is to, "design data based on access patterns".
The question with all the code and such is a bit convoluted, so let's try to see if I understand what you really need:
From my understanding the data in a std::vector are supposed to be contiguous
It is. The standard mandates that the data in the vector is stored contiguously, which means that this will be the case in all platforms / compilers that conform to the standard.
this was the only way I was able to get the DOD solution to outdo the OOP solution
I don't know what you mean with DOD
I could use iterators, though I'm pretty sure that might only be quicker with optimizations
Actually, iterators in this case (assuming that you have debug iterators disabled in VS) will be as fast if not faster than direct modifications through pointers. An iterator into a vector can be implemented with a plain pointer to the element. Again, note that by default in VS iterators do extra work to help debugging.
The next thing to consider is that the memory layout of the two approaches differs, which means that if at a later stage you need to access all x, y and z from a single element, in the first case they will most probably fall in a single cache line, while in the three vectors approach it will require pulling memory from three different locations.
As pointed out already, vector was generally contiguous prior to C++11 and now guaranteed as such with a new data method which actually returns a direct pointer to the internal array used by it. Here's your ISO C++ standard quote:
23.2.6 Class template vector [vector]
[...] The elements of a vector are stored contiguously, meaning that if v is a vector where T is some type other than bool, then it obeys the identity &v[n] == &v[0] + n for all 0 <= n < v.size().
That said, I wanted to jump in mainly because of the way you're benchmarking and using "DOD":
So, while DOD ran faster in debug than OOP, in release it was another
story.
This kind of sentence doesn't make much sense because DOD is not synonymous with using SoAs for everything, especially if that leads to a performance degradation.
Data-oriented design is just a generalized approach to design where you consider how to store and efficiently access data in advance. It becomes one of the first things to consider upfront when approaching designs using this mindset. The opposite is starting off, say, designing an architecture trying to figure out all the functionality it should provide along with objects and abstractions and pure interfaces and so forth and then leaving the data as an implementation detail to be filled out later. DOD would start off with the data as a fundamental thing to consider in the design stage and not an implementation detail to be filled in as an afterthought. It's useful in performance-critical cases where performance is a fundamental design-level requirement demanded by customers, and not just an implementation luxury.
In some cases the efficient representation of the data structures actually leads to new features, somewhat allowing the data itself to design the software. Git is an example of such a software where its features actually revolve around the changeset data structure to some degree where its efficiency actually lead to new features being conceived. In those cases the software's features and user-end design actually evolves out of its efficiency, opening up new doors because the efficiency allows things to be done, say, interactively that were previously thought to be too computationally expensive to do in any reasonable amount of time. Another example is ZBrush which reshaped my VFX industry by allowing things people thought were impossible a couple of decades ago, like sculpting 20 million polygon meshes interactively with a sculpting brush to achieve models so detailed of a kind no one had seen before in the late 90s and early 2000s. Another is voxel cone tracing which is allowing games even written on Playstation to have indirect lighting with diffuse reflections; something people would still think requires minutes or hours to render a single frame without such data-oriented techniques, not 60+ frames per second. So sometimes an effective DOD approach will actually yield new features in the software that people formerly thought were not possible because it broke the analogical sound barrier.
A DOD mindset could still lead to a design that utilizes an AoS representation if that's deemed to be more efficient. An AoS would often excel in cases where you need random-access, for example, and all or most interleaved fields are hot and frequently accessed and/or modified together.
Also this is just getting to my spin on it, but in my opinion DOD doesn't have to arrive at the most efficient data representation upfront. What it does need is to arrive at the most efficient interface designs upfront to leave sufficient breathing room to optimize as needed. An example of a software which seems to lack the foresight that a DOD mindset would provide would be a video compositing software that represents data like this:
class IPixel
{
public:
virtual ~IPixel() {}
...
};
Just one glance at the code above can reveal that there's a significant lack of foresight as to how to design things for efficient data representation and access. For starters, if you consider a 32-bit RGBA pixel, the cost of the virtual pointer, assuming 64-bit pointer size and alignment, would quadruple the size of a single pixel (64-bit vptr + 32-bit pixel data + 32-bits of padding for alignment of vptr). So anyone applying a DOD mindset would generally steer clear of such interface designs like the plague. They might still benefit from an abstraction, however, like being able to utilize the same code for images with many different pixel formats. But in that case I'd expect this:
class IImage
{
public:
virtual ~IImage() {}
...
};
... which trivializes the overhead of a vptr, virtual dispatch, possible loss of contiguity, etc. to the level of an entire image (possibly millions of pixels) and not something paid per-pixel.
Typically a DOD mindset does tend to lead to coarser, not granular, interface designs (interfaces for whole containers, as with the case of an image interface representing a container of pixels, or sometimes even containers of containers). The main reason is that you don't have much breathing room to centrally optimize if you have a codebase like this:
Because now let's say you want to multithread the processing of many balls at once everywhere. You can't without rewriting the entire codebase using balls individually. As another example let's say you want to change the representation of a ball from AoS to SoA. That would require rewriting Ball to become Balls along with the entire codebase using the former Ball design. Similar thing if you want to process balls on the GPU. So typically a DOD mindset would tend to favor a coarser design, like Balls:
In this second case, you can apply all the optimizations you ever need to process balls in parallel, represent them with SoAs, etc. -- whatever you want without rewriting the codebase. But that said, the Balls implementation might still privately store each individual Ball using an AoS:
class Balls
{
public:
...
private:
struct Ball
{
...
};
vector<Ball> balls;
};
... or not. It doesn't matter nearly as much at this point because you're free to change the private implementation of Balls all you like now without affecting the rest of the codebase.
Finally for your benchmark, what does it do? It's basically looping through a bunch of single-precision floats and adding 5 to them. It doesn't make any real difference in that case whether you store one array of floats or a thousand. If you store more arrays, then inevitably that adds some overhead with no benefit whatsoever if all you're going to be doing is looping through all floats and adding 5 to them.
To get use of an SoA representation, you can't just write code that does exactly the same thing to all fields. SoAs typically excel in a sequential access pattern on non-trivial input sizes when you actually need to do something different with each field, like transform each x/y/z data field using a transformation matrix with efficient SIMD instructions (either handwritten or generated by your optimizer) transforming 4+ balls at once, not simply adding 5 to a boatload of floats. They especially excel when not all fields are hot, like a physics system not being interested in the sprite field of a particle (which would be wasteful to load into a cache line only to not use it). So to test the differences between an SoA and AoS rep, you need a sufficiently real-world kind of benchmark to see practical differences.
Yes, you can do this.
Vector containers are implemented as dynamic arrays; Just as regular arrays, vector containers have their elements stored in contiguous storage locations, which means that their elements can be accessed not only using iterators but also using offsets on regular pointers to elements.
http://cplusplus.com/reference/stl/vector/