Best way to stock 2xx millions of points - c++

So I need to have an array of Terrain (ground) which contains 2xx little terrain (ground). Each little terrain(ground) contains 1 000 000 floats points.
So I thought about something like
QVector<QVector<float> > terrain;
QVector<float> littleTerrain;
But at a moment there is a problem with all the allocations I can do .
So of course, I thought about
QVector<float*>
but it's necessary to delete all the pointers in the vector
So, using the smart pointer is possible here ? Otherwise, what is the best choice to do ?

float is 4 bytes and pointer is also 4 bytes in 32bit config so there is no point to hold pointers for your floats. Vectors can be used for you purpose but you should first reserve memory for your vectors. Also try to determined the limits of your vector implementation. Implementation can be different from library to library.

If you only wish to support 64 bit architectures, then certainly a list of vectors would work. It'll work very crappily since there's likely not enough physical RAM to back it all up.
On 32 bit architectures you don't have enough address space to keep it all in a simple data structure anyway.
So, what you need to do is a caching scheme where only a subset of this data is resident in the in-memory data structure at any time. To optimize the caching behavior, at all levels (L1, L2, L3, paging), you need to - and this is important -- keep spatially neighboring data in neighboring memory addresses.
One way of doing it would be to have all terrain subdivided into 32x32 unit squares. Note that 32*32*sizeof(float) == 4096 is the small page size on most systems. This square should be further subdivided in 4x4 unit squares. Note that 4*4*sizeof(float) == 64 and is the cache line size on many systems.

The overhead will be proportional to the number of containers you have - a few hundred containers, a few kilobytes in total. Nothing at all, in other words.
The important thing to do is to reserve sufficient memory for the number of points per "littleTerrain". Otherwise Qt will make sure it fits, but probably overallocate by 50% or so.

Related

Why does the capacity of a vector increase by a factor of 1.5? [duplicate]

C++ has std::vector and Java has ArrayList, and many other languages have their own form of dynamically allocated array. When a dynamic array runs out of space, it gets reallocated into a larger area and the old values are copied into the new array. A question central to the performance of such an array is how fast the array grows in size. If you always only grow large enough to fit the current push, you'll end up reallocating every time. So it makes sense to double the array size, or multiply it by say 1.5x.
Is there an ideal growth factor? 2x? 1.5x? By ideal I mean mathematically justified, best balancing performance and wasted memory. I realize that theoretically, given that your application could have any potential distribution of pushes that this is somewhat application dependent. But I'm curious to know if there's a value that's "usually" best, or is considered best within some rigorous constraint.
I've heard there's a paper on this somewhere, but I've been unable to find it.
I remember reading many years ago why 1.5 is preferred over two, at least as applied to C++ (this probably doesn't apply to managed languages, where the runtime system can relocate objects at will).
The reasoning is this:
Say you start with a 16-byte allocation.
When you need more, you allocate 32 bytes, then free up 16 bytes. This leaves a 16-byte hole in memory.
When you need more, you allocate 64 bytes, freeing up the 32 bytes. This leaves a 48-byte hole (if the 16 and 32 were adjacent).
When you need more, you allocate 128 bytes, freeing up the 64 bytes. This leaves a 112-byte hole (assuming all previous allocations are adjacent).
And so and and so forth.
The idea is that, with a 2x expansion, there is no point in time that the resulting hole is ever going to be large enough to reuse for the next allocation. Using a 1.5x allocation, we have this instead:
Start with 16 bytes.
When you need more, allocate 24 bytes, then free up the 16, leaving a 16-byte hole.
When you need more, allocate 36 bytes, then free up the 24, leaving a 40-byte hole.
When you need more, allocate 54 bytes, then free up the 36, leaving a 76-byte hole.
When you need more, allocate 81 bytes, then free up the 54, leaving a 130-byte hole.
When you need more, use 122 bytes (rounding up) from the 130-byte hole.
In the limit as n → ∞, it would be the golden ratio: ϕ = 1.618...
For finite n, you want something close, like 1.5.
The reason is that you want to be able to reuse older memory blocks, to take advantage of caching and avoid constantly making the OS give you more memory pages. The equation you'd solve to ensure that a subsequent allocation can re-use all prior blocks reduces to xn − 1 − 1 = xn + 1 − xn, whose solution approaches x = ϕ for large n. In practice n is finite and you'll want to be able to reusing the last few blocks every few allocations, and so 1.5 is great for ensuring that.
(See the link for a more detailed explanation.)
It will entirely depend on the use case. Do you care more about the time wasted copying data around (and reallocating arrays) or the extra memory? How long is the array going to last? If it's not going to be around for long, using a bigger buffer may well be a good idea - the penalty is short-lived. If it's going to hang around (e.g. in Java, going into older and older generations) that's obviously more of a penalty.
There's no such thing as an "ideal growth factor." It's not just theoretically application dependent, it's definitely application dependent.
2 is a pretty common growth factor - I'm pretty sure that's what ArrayList and List<T> in .NET uses. ArrayList<T> in Java uses 1.5.
EDIT: As Erich points out, Dictionary<,> in .NET uses "double the size then increase to the next prime number" so that hash values can be distributed reasonably between buckets. (I'm sure I've recently seen documentation suggesting that primes aren't actually that great for distributing hash buckets, but that's an argument for another answer.)
One approach when answering questions like this is to just "cheat" and look at what popular libraries do, under the assumption that a widely used library is, at the very least, not doing something horrible.
So just checking very quickly, Ruby (1.9.1-p129) appears to use 1.5x when appending to an array, and Python (2.6.2) uses 1.125x plus a constant (in Objects/listobject.c):
/* This over-allocates proportional to the list size, making room
* for additional growth. The over-allocation is mild, but is
* enough to give linear-time amortized behavior over a long
* sequence of appends() in the presence of a poorly-performing
* system realloc().
* The growth pattern is: 0, 4, 8, 16, 25, 35, 46, 58, 72, 88, ...
*/
new_allocated = (newsize >> 3) + (newsize < 9 ? 3 : 6);
/* check for integer overflow */
if (new_allocated > PY_SIZE_MAX - newsize) {
PyErr_NoMemory();
return -1;
} else {
new_allocated += newsize;
}
newsize above is the number of elements in the array. Note well that newsize is added to new_allocated, so the expression with the bitshifts and ternary operator is really just calculating the over-allocation.
Let's say you grow the array size by x. So assume you start with size T. The next time you grow the array its size will be T*x. Then it will be T*x^2 and so on.
If your goal is to be able to reuse the memory that has been created before, then you want to make sure the new memory you allocate is less than the sum of previous memory you deallocated. Therefore, we have this inequality:
T*x^n <= T + T*x + T*x^2 + ... + T*x^(n-2)
We can remove T from both sides. So we get this:
x^n <= 1 + x + x^2 + ... + x^(n-2)
Informally, what we say is that at nth allocation, we want our all previously deallocated memory to be greater than or equal to the memory need at the nth allocation so that we can reuse the previously deallocated memory.
For instance, if we want to be able to do this at the 3rd step (i.e., n=3), then we have
x^3 <= 1 + x
This equation is true for all x such that 0 < x <= 1.3 (roughly)
See what x we get for different n's below:
n maximum-x (roughly)
3 1.3
4 1.4
5 1.53
6 1.57
7 1.59
22 1.61
Note that the growing factor has to be less than 2 since x^n > x^(n-2) + ... + x^2 + x + 1 for all x>=2.
Another two cents
Most computers have virtual memory! In the physical memory you can have random pages everywhere which are displayed as a single contiguous space in your program's virtual memory. The resolving of the indirection is done by the hardware. Virtual memory exhaustion was a problem on 32 bit systems, but it is really not a problem anymore. So filling the hole is not a concern anymore (except special environments). Since Windows 7 even Microsoft supports 64 bit without extra effort. # 2011
O(1) is reached with any r > 1 factor. Same mathematical proof works not only for 2 as parameter.
r = 1.5 can be calculated with old*3/2 so there is no need for floating point operations. (I say /2 because compilers will replace it with bit shifting in the generated assembly code if they see fit.)
MSVC went for r = 1.5, so there is at least one major compiler that does not use 2 as ratio.
As mentioned by someone 2 feels better than 8. And also 2 feels better than 1.1.
My feeling is that 1.5 is a good default. Other than that it depends on the specific case.
The top-voted and the accepted answer are both good, but neither answer the part of the question asking for a "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory". (The second-top-voted answer does try to answer this part of the question, but its reasoning is confused.)
The question perfectly identifies the 2 considerations that have to be balanced, performance and wasted memory. If you choose a growth rate too low, performance suffers because you'll run out of extra space too quickly and have to reallocate too frequently. If you choose a growth rate too high, like 2x, you'll waste memory because you'll never be able to reuse old memory blocks.
In particular, if you do the math1 you'll find that the upper limit on the growth rate is the golden ratio ϕ = 1.618… . Growth rate larger than ϕ (like 2x) mean that you'll never be able to reuse old memory blocks. Growth rates only slightly less than ϕ mean you won't be able to reuse old memory blocks until after many many reallocations, during which time you'll be wasting memory. So you want to be as far below ϕ as you can get without sacrificing too much performance.
Therefore I'd suggest these candidates for "mathematically justified" "ideal growth rate", "best balancing performance and wasted memory":
≈1.466x (the solution to x4=1+x+x2) allows memory reuse after just 3 reallocations, one sooner than 1.5x allows, while reallocating only slightly more frequently
≈1.534x (the solution to x5=1+x+x2+x3) allows memory reuse after 4 reallocations, same as 1.5x, while reallocating slightly less frequently for improved performance
≈1.570x (the solution to x6=1+x+x2+x3+x4) only allows memory reuse after 5 reallocations, but will reallocate even less infrequently for even further improved performance (barely)
Clearly there's some diminishing returns there, so I think the global optimum is probably among those. Also, note that 1.5x is a great approximation to whatever the global optimum actually is, and has the advantage being extremely simple.
1 Credits to #user541686 for this excellent source.
It really depends. Some people analyze common usage cases to find the optimal number.
I've seen 1.5x 2.0x phi x, and power of 2 used before.
If you have a distribution over array lengths, and you have a utility function that says how much you like wasting space vs. wasting time, then you can definitely choose an optimal resizing (and initial sizing) strategy.
The reason the simple constant multiple is used, is obviously so that each append has amortized constant time. But that doesn't mean you can't use a different (larger) ratio for small sizes.
In Scala, you can override loadFactor for the standard library hash tables with a function that looks at the current size. Oddly, the resizable arrays just double, which is what most people do in practice.
I don't know of any doubling (or 1.5*ing) arrays that actually catch out of memory errors and grow less in that case. It seems that if you had a huge single array, you'd want to do that.
I'd further add that if you're keeping the resizable arrays around long enough, and you favor space over time, it might make sense to dramatically overallocate (for most cases) initially and then reallocate to exactly the right size when you're done.
I recently was fascinated by the experimental data I've got on the wasted memory aspect of things. The chart below is showing the "overhead factor" calculated as the amount of overhead space divided by the useful space, the x-axis shows a growth factor. I'm yet to find a good explanation/model of what it reveals.
Simulation snippet: https://gist.github.com/gubenkoved/7cd3f0cb36da56c219ff049e4518a4bd.
Neither shape nor the absolute values that simulation reveals are something I've expected.
Higher-resolution chart showing dependency on the max useful data size is here: https://i.stack.imgur.com/Ld2yJ.png.
UPDATE. After pondering this more, I've finally come up with the correct model to explain the simulation data, and hopefully, it matches experimental data nicely. The formula is quite easy to infer simply by looking at the size of the array that we would need to have for a given amount of elements we need to contain.
Referenced earlier GitHub gist was updated to include calculations using scipy.integrate for numerical integration that allows creating the plot below which verifies the experimental data pretty nicely.
UPDATE 2. One should however keep in mind that what we model/emulate there mostly has to do with the Virtual Memory, meaning the over-allocation overheads can be left entirely on the Virtual Memory territory as physical memory footprint is only incurred when we first access a page of Virtual Memory, so it's possible to malloc a big chunk of memory, but until we first access the pages all we do is reserving virtual address space. I've updated the GitHub gist with CPP program that has a very basic dynamic array implementation that allows changing the growth factor and the Python snippet that runs it multiple times to gather the "real" data. Please see the final graph below.
The conclusion there could be that for x64 environments where virtual address space is not a limiting factor there could be really little to no difference in terms of the Physical Memory footprint between different growth factors. Additionally, as far as Virtual Memory is concerned the model above seems to make pretty good predictions!
Simulation snippet was built with g++.exe simulator.cpp -o simulator.exe on Windows 10 (build 19043), g++ version is below.
g++.exe (x86_64-posix-seh-rev0, Built by MinGW-W64 project) 8.1.0
PS. Note that the end result is implementation-specific. Depending on implementation details dynamic array might or might not access the memory outside the "useful" boundaries. Some implementations would use memset to zero-initialize POD elements for whole capacity -- this will cause virtual memory page translated into physical. However, std::vector implementation on a referenced above compiler does not seem to do that and so behaves as per mock dynamic array in the snippet -- meaning overhead is incurred on the Virtual Memory side, and negligible on the Physical Memory.
I agree with Jon Skeet, even my theorycrafter friend insists that this can be proven to be O(1) when setting the factor to 2x.
The ratio between cpu time and memory is different on each machine, and so the factor will vary just as much. If you have a machine with gigabytes of ram, and a slow CPU, copying the elements to a new array is a lot more expensive than on a fast machine, which might in turn have less memory. It's a question that can be answered in theory, for a uniform computer, which in real scenarios doesnt help you at all.
I know it is an old question, but there are several things that everyone seems to be missing.
First, this is multiplication by 2: size << 1. This is multiplication by anything between 1 and 2: int(float(size) * x), where x is the number, the * is floating point math, and the processor has to run additional instructions for casting between float and int. In other words, at the machine level, doubling takes a single, very fast instruction to find the new size. Multiplying by something between 1 and 2 requires at least one instruction to cast size to a float, one instruction to multiply (which is float multiplication, so it probably takes at least twice as many cycles, if not 4 or even 8 times as many), and one instruction to cast back to int, and that assumes that your platform can perform float math on the general purpose registers, instead of requiring the use of special registers. In short, you should expect the math for each allocation to take at least 10 times as long as a simple left shift. If you are copying a lot of data during the reallocation though, this might not make much of a difference.
Second, and probably the big kicker: Everyone seems to assume that the memory that is being freed is both contiguous with itself, as well as contiguous with the newly allocated memory. Unless you are pre-allocating all of the memory yourself and then using it as a pool, this is almost certainly not the case. The OS might occasionally end up doing this, but most of the time, there is going to be enough free space fragmentation that any half decent memory management system will be able to find a small hole where your memory will just fit. Once you get to really bit chunks, you are more likely to end up with contiguous pieces, but by then, your allocations are big enough that you are not doing them frequently enough for it to matter anymore. In short, it is fun to imagine that using some ideal number will allow the most efficient use of free memory space, but in reality, it is not going to happen unless your program is running on bare metal (as in, there is no OS underneath it making all of the decisions).
My answer to the question? Nope, there is no ideal number. It is so application specific that no one really even tries. If your goal is ideal memory usage, you are pretty much out of luck. For performance, less frequent allocations are better, but if we went just with that, we could multiply by 4 or even 8! Of course, when Firefox jumps from using 1GB to 8GB in one shot, people are going to complain, so that does not even make sense. Here are some rules of thumb I would go by though:
If you cannot optimize memory usage, at least don't waste processor cycles. Multiplying by 2 is at least an order of magnitude faster than doing floating point math. It might not make a huge difference, but it will make some difference at least (especially early on, during the more frequent and smaller allocations).
Don't overthink it. If you just spent 4 hours trying to figure out how to do something that has already been done, you just wasted your time. Totally honestly, if there was a better option than *2, it would have been done in the C++ vector class (and many other places) decades ago.
Lastly, if you really want to optimize, don't sweat the small stuff. Now days, no one cares about 4KB of memory being wasted, unless they are working on embedded systems. When you get to 1GB of objects that are between 1MB and 10MB each, doubling is probably way too much (I mean, that is between 100 and 1,000 objects). If you can estimate expected expansion rate, you can level it out to a linear growth rate at a certain point. If you expect around 10 objects per minute, then growing at 5 to 10 object sizes per step (once every 30 seconds to a minute) is probably fine.
What it all comes down to is, don't over think it, optimize what you can, and customize to your application (and platform) if you must.

How to avoid wasting memory on 64-bit pointers

I'm hoping for some high-level advice on how to approach a design I'm about to undertake.
The straightforward approach to my problem will result in millions and millions of pointers. On a 64-bit system these will presumably be 64-bit pointers. But as far as my application is concerned, I don't think I need more than a 32-bit address space. I would still like for the system to take advantage of 64-bit processor arithmetic, however (assuming that is what I get by running on a 64-bit system).
Further background
I'm implementing a tree-like data structure where each "node" contains an 8-byte payload, but also needs pointers to four neighboring nodes (parent, left-child, middle-child, right-child). On a 64-bit system using 64-bit pointers, this amounts to 32 bytes just for linking an 8-byte payload into the tree -- a "linking overhead" of 400%.
The data structure will contain millions of these nodes, but my application will not need much memory beyond that, so all these 64-bit pointers seem wasteful. What to do? Is there a way to use 32-bit pointers on a 64-bit system?
I've considered
Storing the payloads in an array in a way such that an index implies (and is implied by) a "tree address" and neighbors of a given index can be calculated with simple arithmetic on that index. Unfortunately this requires me to size the array according to the maximum depth of the tree, which I don't know beforehand, and it would probably incur even greater memory overhead due to empty node elements in the lower levels because not all branches of the tree go to the same depth.
Storing nodes in an array large enough to hold them all, and then using indices instead of pointers to link neighbors. AFAIK the main disadvantage here would be that each node would need the array's base address in order to find its neighbors. So they either need to store it (a million times over) or it needs to be passed around with every function call. I don't like this.
Assuming that the most-significant 32 bits of all these pointers are zero, throwing an exception if they aren't, and storing only the least-significant 32 bits. So the required pointer can be reconstructed on demand. The system is likely to use more than 4GB, but the process will never. I'm just assuming that pointers are offset from a process base-address and have no idea how safe (if at all) this would be across the common platforms (Windows, Linux, OSX).
Storing the difference between 64-bit this and the 64-bit pointer to the neighbor, assuming that this difference will be within the range of int32_t (and throwing if it isn't). Then any node can find it's neighbors by adding that offset to this.
Any advice? Regarding that last idea (which I currently feel is my best candidate) can I assume that in a process that uses less than 2GB, dynamically allocated objects will be within 2 GB of each other? Or not at all necessarily?
Combining ideas 2 and 4 from the question, put all the nodes into a big array, and store e.g. int32_t neighborOffset = neighborIndex - thisIndex. Then you can get the neighbor from *(this+neighborOffset). This gets rid of the disadvantages/assumptions of both 2 and 4.
If on Linux, you might consider using (and compiling for) the x32 ABI. IMHO, this is the preferred solution for your issues.
Alternatively, don't use pointers, but indexes into a huge array (or an std::vector in C++) which could be a global or static variable. Manage a single huge heap-allocated array of nodes, and use indexes of nodes instead of pointers to nodes. So like your §2, but since the array is a global or static data you won't need to pass it everywhere.
(I guess that an optimizing compiler would be able to generate clever code, which could be nearly as efficient as using pointers)
You can remove the disadvantage of (2) by exploiting the alignment of memory regions to find the base address of the the array "automatically". For example, if you want to support up to 4 GB of nodes, ensure your node array starts at a 4GB boundary.
Then within a node with address addr, you can determine the address of another at index as addr & -(1UL << 32) + index.
This is kind of the "absolute" variant of the accepted solution which is "relative". One advantage of this solution is that an index always has the same meaning within a tree, whereas in the relative solution you really need the (node_address, index) pair to interpret an index (of course, you can also use the absolute indexes in the relative scenarios where it is useful). It means that when you duplicate a node, you don't need to adjust any index values it contains.
The "relative" solution also loses 1 effective index bit relative to this solution in its index since it needs to store a signed offset, so with a 32-bit index, you could only support 2^31 nodes (assuming full compression of trailing zero bits, otherwise it is only 2^31 bytes of nodes).
You can also store the base tree structure (e.g,. the pointer to the root and whatever bookkeeping your have outside of the nodes themselves) right at the 4GB address which means that any node can jump to the associated base structure without traversing all the parent pointers or whatever.
Finally, you can also exploit this alignment idea within the tree itself to "implicitly" store other pointers. For example, perhaps the parent node is stored at an N-byte aligned boundary, and then all children are stored in the same N-byte block so they know their parent "implicitly". How feasible that is depends on how dynamic your tree is, how much the fan-out varies, etc.
You can accomplish this kind of thing by writing your own allocator that uses mmap to allocate suitably aligned blocks (usually just reserve a huge amount of virtual address space and then allocate blocks of it as needed) - ether via the hint parameter or just by reserving a big enough region that you are guaranteed to get the alignment you want somewhere in the region. The need to mess around with allocators is the primary disadvantage compared to the accepted solution, but if this is the main data structure in your program it might be worth it. When you control the allocator you have other advantages too: if you know all your nodes are allocated on an 2^N-byte boundary you can "compress" your indexes further since you know the low N bits will always be zero, so with a 32-bit index you could actually store 2^(32+5) = 2^37 nodes if you knew they were 32-byte aligned.
These kind of tricks are really only feasible in 64-bit programs, with the huge amount of virtual address space available, so in a way 64-bit giveth and also taketh away.
Your assertion that a 64 bit system necessarily has to have 64 bit pointers is not correct. The C++ standard makes no such assertion.
In fact, different pointer types can be different sizes: sizeof(double*) might not be the same as sizeof(int*).
Short answer: don't make any assumptions about the sizes of any C++ pointer.
Sounds like to me that you want to build you own memory management framework.

C++ vector out of memory

I have a very large vector (millions of entries 1024 bytes each). I am exceeding the maximum size of the vector (getting a bad memory alloc exception). I am doing a recursive operation over the vector of items which requires accessing other elements in the vector. The operations need to be done quickly. I am trying to avoid writing to disk for speed reasons. Is there any other way to store this data that would not require writing to disk? If I have to write the data to disk, what would be the most ideal way to do it>
edit for a few more details.
The operations that I am performing on the data set is generating a string recursively based on other data points in the vector. The data is sorted when it is read in. Data sets ranging from 50,000 to 50,000,0000.
The easiest way to solve this problem is to use STXXL. It's a reimplementation of the STL for large structures that transparently writes to disk when the data won't fit in memory.
Your problem cannot be solved as stated and clarified in the comments.
You have requested a way to have a contiguous in-memory buffer of 50,000,000 entries of size 1024 on a 32 bit system.
A 32 bit system has only 4294967296 bytes of addressable memory. You are asking for 51200000000 bytes of addressable memory, or 11.9 times the amount of memory address space on your system.
If you don't require that your data be contiguous and memory-addressable, if you don't require that your data all be in memory at once, or if you relax other requirements, there may be an answer to your problem. Ie, some OSs expose access to a non-memory space of values that corresponds to RAM (there where ways in 8 gig windows systems to use more than 4 gigs of total RAM) through some hacky interface or other.
But as stated, the answer is "no, you cannot do that".
Because your data must be contiguous, and you know how many elements you need to store, just create a std::vector and use the reserve() function to attempt to gain a contiguous block of memory of the required size.
There is very little overhead in storing a vector (just a few pointers to manage the beginning and end). This is as good as you'll be able to do.
If that fails:
add more memory to your machine (may not actually help, if you've run up against addressing or implementation constraints)
switch to a raw array
find a way to reduce the size of your elements
try to find a solution that can tackle the problem in small blocks
That is 1GB of memory (1024KB * 10^6 = 1MB * 10^3 = 1GB). Ideally for a 32 bit machine upto 4GB memory operations can be performed.
To answer your question, try first a normal malloc() call and allocate 1 GB of memory. This should be done without any error.
Also, please paste the exact error msg that you get while using the vector.

What is the most efficient (yet sufficiently flexible) way to store multi-dimensional variable-length data?

I would like to know what the best practice for efficiently storing (and subsequently accessing) sets of multi-dimensional data arrays with variable length. The focus is on performance, but I also need to be able to handle changing the length of an individual data set during runtime without too much overhead.
Note: I know this is a somewhat lengthy question, but I have looked around quite a lot and could not find a solution or example which describes the problem at hand with sufficient accuracy.
Background
The context is a computational fluid dynamics (CFD) code that is based on the discontinuous Galerkin spectral element method (DGSEM) (cf. Kopriva (2009), Implementing Spectral Methods for Partial Differential Equations). For the sake of simplicity, let us assume a 2D data layout (it is in fact in three dimensions, but the extension from 2D to 3D should be straightforward).
I have a grid that consists of K square elements k (k = 0,...,K-1) that can be of different (physical) sizes. Within each grid element (or "cell") k, I have N_k^2 data points. N_k is the number of data points in each dimension, and can vary between different grid cells.
At each data point n_k,i (where i = 0,...,N_k^2-1) I have to store an array of solution values, which has the same length nVars in the whole domain (i.e. everywhere), and which does not change during runtime.
Dimensions and changes
The number of grid cells K is of O(10^5) to O(10^6) and can change during runtime.
The number of data points N_k in each grid cell is between 2 and 8 and can change during runtime (and may be different for different cells).
The number of variables nVars stored at each grid point is around 5 to 10 and cannot change during runtime (it is also the same for every grid cell).
Requirements
Performance is the key issue here. I need to be able to regularly iterate in an ordered fashion over all grid points of all cells in an efficient manner (i.e. without too many cache misses). Generally, K and N_k do not change very often during the simulation, so for example a large contiguous block of memory for all cells and data points could be an option.
However, I do need to be able to refine or coarsen the grid (i.e. delete cells and create new ones, the new ones may be appended to the end) during runtime. I also need to be able to change the approximation order N_k, so the number of data points I store for each cell can change during runtime as well.
Conclusion
Any input is appreciated. If you have experience yourself, or just know a few good resources that I could look at, please let me know. However, while the solution will be crucial to the performance of the final program, it is just one of many problems, so the solution needs to be of an applied nature and not purely academic.
Should this be the wrong venue to ask this question, please let me know what a more suitable place would be.
Often, these sorts of dynamic mesh structures can be very tricky to deal with efficiently, but in block-structured adaptive mesh refinement codes (common in astrophysics, where complex geometries aren't important) or your spectral element code where you have large block sizes, it is often much less of an issue. You have so much work to do per block/element (with at least 10^5 cells x 2 points/cell in your case) that the cost of switching between blocks is comparitively minor.
Keep in mind, too, that you can't generally do too much of the hard work on each element or block until a substantial amount of that block's data is already in cache. You're already going to have to had flushed most of block N's data out of cache before getting much work done on block N+1's anyway. (There might be some operations in your code which are exceptions to this, but those are probably not the ones where you're spending much time anyway, cache or no cache, because there's not a lot of data reuse - eg, elementwise operations on cell values). So keeping each the blocks/elements beside each other isn't necessarily a huge deal; on the other hand, you definitely want the blocks/elements to be themselves contiguous.
Also notice that you can move blocks around to keep them contiguous as things get resized, but not only are all those memory copies also going to wipe your cache, but the memory copies themselves get very expensive. If your problem is filling a significant fraction of memory (and aren't we always?), say 1GB, and you have to move 20% of that around after a refinement to make things contiguous again, that's .2 GB (read + write) / ~20 GB/s ~ 20 ms compared to reloading (say) 16k cache lines at ~100ns each ~ 1.5 ms. And your cache is trashed after the shuffle anyway. This might still be worth doing if you knew that you were going to do the refinement/derefinement very seldom.
But as a practical matter, most adaptive mesh codes in astrophysical fluid dynamics (where I know the codes well enough to say) simply maintain a list of blocks and their metadata and don't worry about their contiguity. YMMV of course. My suggestion would be - before spending too much time crafting the perfect data structure - to first just test the operation on two elements, twice; the first, with the elements in order and computing on them 1-2, and the second, doing the operation in the "wrong" order, 2-1, and timing the two computations several times.
For each cell, store the offset in which to find the cell data in a contiguous array. This offset mapping is very efficient and widely used. You can reorder the cells for cache reuse in traversals. When the order or number of cells changes, create a new array and interpolate, then throw away the old arrays. This storage is much better for external analysis because operations like inner products in Krylov methods and stages in Runge-Kutta methods can be managed without reference to the mesh. It also requires minimal memory per vector (e.g. in Krylov bases and with time integration).

Layout of Pixel-data in Memory?

I'm writing a C++ library for an image format that is based on PNG. One stopping point for me is that I'm unsure as to how I ought to lay out the pixel data in memory; as far as I'm aware, there are two practical approaches:
An array of size (width * height); each pixel can be accessed by array[y*width + x].
An array of size (height), containing pointers to arrays of size (width).
The standard reference implementation for PNG (libpng) uses method 2 of the above, while I've seen others use method 1. Is one better than the other, or is each a method with its own pros and cons, to where a compromise must be made? Further, which format do most graphical display systems use (perhaps for ease of using the output of my library into other APIs)?
Off the top of my head:
The one thing that would make me choose #2 is the fact that your memory requirements are a little relaxed. If you were to go for #1, the system will need to be able to allocate height * width amount of contiguous memory. Whereas, in case of #2, it has the freedom to allocate smaller chunks of contiguous memory of size width (could as well be height) off of areas that are free. (When you factor in the channels per pixel, the #1 may fail for even moderately sized images.)
Further, it may be slightly better when swapping rows (or columns) if required for image manipulation purposes (pointer swap suffices).
The downside for #2 is of course an extra level of indirection that seeps in for every access and the array of pointers to be maintained. But this is hardly a matter given todays processor speed and memory.
The second downside for #2 is that the data isn't necessarily next to each other, which makes it harder for the processor the load the right memory pages into the cache.
The advantage of method 2 (cutting up the array in rows) is that you can perform memory operations in steps, e.g. resizing or shuffling the image without reallocating the entire chunk of memory at once. For really large images this may be an advantage.
The advantage of a single array is that you calculations are simpler, i.e. to go one row down you do
pos += width;
instead of having to reference pointers. For small to medium images this is probably faster. Unless you're dealing with images of hundreds of Mb, I would go with method 1.
I suspect that libpng does that (style 2) for a few possible reasons:
Avoid large allocations (as mentioned), and may ease handling VERY large PNGs, especially in systems without VM
(perhaps) allow for interleaved decode ala interleaved JPEG (if PNG supports that)
Ease of certain transformations (vertical flip) (unlikely)
Ease of scaling (insert or remove lines without needing a full second buffer, or widen/narrow lines) (unlikely but possible)
The problem with this approach (assuming each line is an allocation) is a LOT more allocation/free overhead, and perhaps encourage memory fragmentation.
Unless you have a good reason, use style 1 (single allocation), and perhaps round to a "good" boundary for the architecture you're using (could be 4, 8, 16 or perhaps even more bytes). Note that many library functions may look for style 1 with no padding - think about how you'll be using this and where you'll be passing them to.
Windows itself uses a variation of method 1. Each row of the image is padded to a multiple of 4 bytes, and the order of the colors is B,G,R instead of the more normal R,G,B. Also the first row of the buffer is the bottom row of the image.