In my C and C++ programs, I have three arrays: one for an x coordinate, a y coordinate, and the location in the list. I was wondering if it would take less memory if I put all three in the same array. Any idea? Also, is there something about only having to define one dimension and having the other variable?
Thank you for the help.
It's unlikely to take less memory to put them together; depending on the structure packing and alignment, it might even take more. That's doubtful in this case though. The most likely outcome is that they'll be the same.
One thing that will be affected is cache coherency. If you usually access all of the values together at the same time, it will be slightly more efficient to have them close together. On the other hand if you typically zip through one array at a time, it will be more efficient to keep them separated.
P.S. If it wasn't obvious, I'm advocating putting the values in a structure rather than a 2D array. The memory layout would be similar but the syntax is different.
It will take exactly the same amount of memory; it will just be arranged differently.
It is (arguably, generally) better style to have 1 array of complex things, if those things go together. One reason is you can't accidentally have a different number of each component.
A multi dimenstional array is really just a one dimensional array. With this in mind you would want to use data localization principles (when you access a data point in memory it's easier to load datapoints nearby to it).
Take into consideration whether you'll be accessing your x's in one go then your y's then your locations, or whether you'll be accessing x y and location data all at once.
Look more into how multi -dimensional arrays are represented in memory to decide how you'd "group" your data.
Related
Hi I am a graduate student studying scientific computing using c++. Some of our research focus on speed of an algorithm, therefore it is important to construct array structure that is fast enough.
I've seen two ways of constructing 3D Arrays.
First one is to use vector liblary.
vector<vector<vector<double>>> a (isize,vector<double>(jsize,vector<double>(ksize,0)))
This gives 3D array structure of size isize x jsize x ksize.
The other one is to construct a structure containing 1d array of size isize* jsize * ksize using
new double[isize*jsize*ksize]. To access the specific location of (i,j,k) easily, operator overloading is necessary(am I right?).
And from what I have experienced, first one is much faster since it can access to location (i,j,k) easily while latter one has to compute location and return the value. But I have seen some people preferring latter one over the first one. Why do they prefer the latter setting? and is there any disadvantage of using the first one?
Thanks in adavance.
Main difference between those will be the layout:
vector<vector<vector<T>>>
This will get you a 1D array of vector<vector<T>>.
Each item will be a 1D array of vector<T>.
And each item of those 1D array will be a 1D array of T.
The point is, vector itself does not store its content. It manages a chunk of memory, and stores the content there. This has a number of bad consequences:
For a matrix of dimension X·Y·Z, you will end up allocating 1 + X + X·Y memory chunks. That's horribly slow, and will trash the heap. Imagine: a cube matrix of size 20 would trigger 421 calls to new!
To access a cell, you have 3 levels of indirection:
You must access the vector<vector<vector<T>>> object to get pointer to top-level memory chunk.
You must then access the vector<vector<T>> object to get pointer to second-level memory chunk.
You must then access the vector<T> object to get pointer to the leaf memory chunk.
Only then you can access the T data.
Those memory chunks will be spread around the heap, causing a lot of cache misses and slowing the overall computation.
Should you get it wrong at some point, it is possible to end up with some lines in your matrix having different lengths. After all, they're independent 1-d arrays.
Having a contiguous memory block (like new T[X * Y * Z]) on the other hand gives:
You allocate 1 memory chunk. No heap trashing, O(1).
You only need to access the pointer to the memory chunk, then can go straight for desired element.
All matrix is contiguous in memory, which is cache-friendly.
Those days, a single cache miss means dozens or hundreds lost computing cycles, do not underestimate the cache-friendliness aspect.
By the way, there is a probably better way you didn't mention: using one of the numerous matrix libraries that will handle this for you automatically and provide nice support tools (like SSE-accelerated matrix operations). One such library is Eigen, but there are plenty others.
→ You want to do scientific computing? Let a lib handle the boilerplate and the basics so you can focus on the scientific computing part.
In my point of view, there are too much advantages std::vector's have over normal plain arrays.
In short here are some:
It is much harder to create memory leaks with std::vector. This point alone is one of the biggest advantages. This has nothing to do with performance, but should be considered all the time.
std::vector is part of the STL. This part of C++ is one of the most used one. Thousands of people use the STL and so they get "tested" every day. Over the last years they got optimized so radically, they don't lack any performance anymore. (pls correct me if i see this wrong)
Handling with std::vector is easy as 1, 2, 3. No pointer handling no nothing... Just accessing it via methods or with []-operator and more other methods.
First of all, the idea that you access (i,j,k) in your vec^3 directly is somewhat flawed. What you have is a structure of pointers where you need to dereference three pointers along the way. Note that I have no idea whether that is faster or slower than computing the position within a one-dimensional array, though. You'd need to test that and it might depend on the size of your data (especially whether it fits in a chunk).
Second, the vector^3 requires pointers and vector sizes, which require more memory. In many cases, this will be irrelevant (as the image grows cubically but the memory difference only quadratically) but if your algoritm is really going to fill out any memory available, that can matter.
Third, the raw array stores everything in consecutive memory, which is good for streaming and can be good for certain algorithms because of quick cache accesses. For example when you add one 3D image to another.
Note that all of this is about hyper-optimization that you might not need. The advantages of vectors that skratchi.at pointed out in his answer are quite strong, and I add the advantage that vectors usually increase readability. If you do not have very good reasons not to use vectors, then use them.
If you should decide for the raw array, in any case, make sure that you wrap it well and keep the class small and simple, in order to counter problems regarding leaks and such.
Welcome to SO.
If everything what you have are the two alternatives, then the first one could be better.
Prefer using STL array or vector instead of a C array
You should avoid to use C++ plain arrays since you need to manage yourself the memory allocating/deallocating with new/delete and other boilerplate code like keep track of the size/check bounds. In clearly words "C arrays are less safe, and have no advantages over array and vector."
However, there are some important drawbacks in the first alternative. Something I would like to highlight is that:
std::vector<std::vector<std::vector<T>>>
is not a 3-d matrix. In a matrix, all the rows must have the same size. On the other hand, in a "vector of vectors" there is no guarantee that all the nested vectors have the same length. The reason is that a vector is a linear 1-D structure as pointed out in the #spectras answer. Hence, to avoid all sort of bad or unexpected behaviours, you must to include guards in your code to obtain the rectangular invariant guaranteed.
Luckily, the first alternative is not the only one you may have in hands.
For example, you can replace the c-style array by a std::array:
const int n = i_size * j_size * k_size;
std::array<int, n> myFlattenMatrix;
or use std::vector in case your matrix dimensions can change.
Accessing element by its 3 coordinates
Regarding your question
To access the specific location of (i,j,k) easily, operator
overloading is necessary(am I right?).
Not exactly. Since there isn't a 3-parameter operator for neither std::vector nor array, you can't overload it. But you can create a template class or function to wrap it for you. In any case you will must to deference the 3 vectors or calculate the flatten index of the element in the linear storage.
Considering do not use a third part matrix library like Eigen for your experiments
You aren't coding it for production but for research purposes instead. Particularly, your research is exactly regarding the performance of algorithms. In that case, I prefer do not recommend to use a third part library, like Eigen, absolutely. Of course it depends a lot of what kind of "speed of an algorithm" metrics are you willing to gather, but Eigen, for instance, will do a lot of things under the hood (like vectorization) which will have a tremendous influence on your experiments. Since it will be hard for you to control those unseen optimizations, these library's features may lead you to wrong conclusions about your algorithms.
Algorithm's performance and big-o notation
Usually, the performance of algorithms are analysed by using the big-O approach where factors like the actual time spent, hardware speed or programming language traits aren't taken in account. The book "Data Structures and Algorithms in C++" by Adam Drozdek can provide more details about it.
Last week I have read about great concepts as cache locality and pipelining in a cpu. Although these concepts are easy to understand I have two questions. Suppose one can choose between a vector of objects or a vector of pointers to objects (as in this question).
Then an argument for using pointers is that shufling larger objects may be expensive. However, I'm not able to find when I should call an object large. Is an object of several bytes already large?
An argument against the pointers is the loss of cache locality. Will it help if one uses two vectors where the first one contains the objects and will not be reordered and the second one contains pointers to these objects? Say that we have a vector of 200 objects and create a vector with pointers to these objects and then randomly shuffle the last vector. Is the cache locality then lost if we loop over the vector with pointers?
This last scenario happens a lot in my programs where I have City objects and then have around 200 vectors of pointers to these Cities. To avoid having 200 instances of each City I use a vector of pointers instead of a vector of Cities.
There is no simple answer to this question. You need to understand how your system interacts with regards to memory, what operations you do on the container, and which of those operations are "important". But by understanding the concepts and what affects what, you can get a better understanding of how things work. So here's some "discussion" on the subject.
"Cache locality" is largely about "keeping things in the cache". In other words, if you look at A, then B, and A is located close to B, they are probably getting loaded into the cache together.
If objects are large enough that they fill one or more cache-lines (modern CPU's have cache-lines of 64-128 bytes, mobile ones are sometimes smaller), the "next object in line" will not be in the cache anyways [1], so the cache-locality of the "next element in the vector" is less important. The smaller the object is, the more effect of this you get - assuming you are accessing objects in the order they are stored. If you pick a random number, then other factors start to become important [2], and the cache locality is much less important.
On the other other hand, as objects get larger, moving them within the vector (including growing, removing, inserting, as well as "random shuffle") will be more time consuming, as copying more data gets more extensive.
Of course, one further step is always needed to read from a pointer vs. reading an element directly in a vector, since the pointer itself needs to be "read" before we can get to the actual data in the pointee object. Again, this becomes more important when random-accessing things.
I always start with "whatever is simplest" (which depends on the overall construct of the code, e.g. sometimes it's easier to create a vector of pointers because you have to dynamically create the objects in the first place). Most of the code in a system is not performance critical anyway, so why worry about it's performance - just get it working and leave it be if it doesn't turn up in your performance measurements.
Of course, also, if you are doing a lot of movement of objects in a container, maybe vector isn't the best container. That's why there are multiple container variants - vector, list, map, tree, deque - as they have different characteristics with regards to their access and insert/remove as well as characteristics for linearly walking the data.
Oh, and in your example, you talk of 200 city objects - well, they are probably going to all fit in the cache of any modern CPU anyways. So stick them in a vector. Unless a city contains a list of every individual living in the city... But that probably should be a vector (or other container object) in itself.
As an experiment, make a program that does the same operations on a std::vector<int> and std::vector<int*> [such as filling with random numbers, then sorting the elements], then make an object that is large [stick some array of integers in there, or some such], with one integer so that you can do the very same operations on that. Vary the size of the object stored, and see how it behaves. On YOUR system, where is the benefit of having pointers, over having plain objects. Of course, also vary the number of elements, to see what effect that has.
[1] Well, modern processors use cache-prefetching, which MAY load "next data" into the cache speculatively, but we certainly can't rely on this.
[2] An extreme case of this is a telephone exchange with a large number of subscribers (millions). When placing a call, the caller and callee are looked up in a table. But the chance of either caller or callee being in the cache is nearly zero, because (assuming we're dealing with a large city, say London) the number of calls placed and received every second is quite large. So caches become useless, and it gets worse, because the processor also caches the page-table entries, and they are also, most likely, out of date. For these sort of applications, the CPU designers have "huge pages", which means that the memory is split into 1GB pages instead of the usual 4K or 2MB pages that have been around for a while. This reduces the amount of memory reading needed before "we get to the right place". Of course, the same applies to various other "large database, unpredictable pattern" - airlines, facebook, stackoverflow all have these sort of problems.
I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).
I need to deal with large arrays of float numbers (>200,000 numbers) and perform some maths with these arrays.
What do you suggest to treat these arrays so that I do not get any stack overflow problems?
UPDATE: I want to do simple and complex maths (sums, products, sin, cos, arctan) operations.
Plain numerical data you need to sequentially operate on?
std::valarray<double>
If profiling shows this is slowing you down, look for ways to make it faster by
std::valarray<double>::resize()
(yes, there's no reserve() unfortunately.
Why std::valarray<double> for numerical data? If you want to perform an operation on every element, just call
std::valarray<double>::apply(somefunction)
See for more information: a C++ reference.
If you want to be able to reserve(), you'll need std::vector, which is also fine but doesn't have overloads for the math functions you may want to use.
EDIT: This is of course assuming you have enough memory to fit all your arrays into std::valarrays. If not, you should split the 200,000 rows up so that only so many floats are in memory at the same time.
If your data are sparse, you could probably use boost's sparse_matrix http://www.boost.org/doc/libs/1_41_0/libs/numeric/ublas/doc/matrix_sparse.htm to represent you data structure and significantly reduce memory requirements.
Otherwise I would suggest looking into ways that you can split the data into chunks and work on one chunk in memory, then store that state to file and repeat.
I suggest you treat them like 10.000 per 10.000 then sum everything ?
It depends of what operations you are doing.
Depends on what you want to do with them.
Also, as chris said in the comments, dynamically allocate memory for your array (to get memory from the heap) and avoid using it as a local variable (which is allocated in the stack).
If execution speed is important, should I use this,
struct myType {
float dim[3];
};
myType arr[size];
or to use a 2D array as arr[size][index]
It does not matter. The compiler will produce the exact same code regardless in almost all cases. The only difference would be if the struct induced some kind of padding, but given that you have floats that seems unlikely.
It depends on your use case. If you use the three dimensions typically together, the struct organization can be reasonable. Especially when using the dimension individually the array layout is most likely to give better performance: contemporary processors don't just load individual words but rather units of cache lines. If only parts of the data is used there are words loaded which aren't used.
The array layout is also more accessible to parallel processing e.g. using SIMD operations. This is unfortunate to some extend because the object layout is generally different. Actually, the arrays you are using are probably similar but if you change things to become float array[3][size] things become different.
No difference at all. Pick what is more readable to you.
Unless you're working on some weird platform, the memory layout of those two will be the same -- and for the compiler this is what counts most.
The only difference is when you pass something to a function.
When you use the array solution, you never copy the array contains but just pass the array address.
The structs will always be copied if you don't explicitly pass the struct address in case of the struct solution.
One other thing to keep in mind that another poster mentioned: If dim will always have a size of 3 in the struct, but the collection really represents something like "Red, Green, Blue" or "X, Y, Z" or "Car, Truck, Boat", from a maintenance standpoint you might be better off breaking them out. That is, use something like
typedef struct VEHICLES
{
float fCar;
float fTruck;
float fBoat;
} Vehicles;
That way when you come back in two years to debug it, or someone else has to look at it, they will not have to guess what dim[0], dim[1] and dim[2] refer to.
You might want to map out the 2d array to 1d. Might be more cache friendly