I have a 2d integer array used for a tile map.
The size of the map is unknown and read in from a file at runtime. currently the biggest file is 2500 items(50x50 grid).
I have a working method of dynamic memory allocation from an earlier question but people keep saying that it a bad idea so I have been thinking whether or not to just use a big array and not fill it all up when using a smaller map.
Do people know of any pros or cons to either solution ? any advice or personal opinions welcome.
c++ btw
edit: all the maps are made by me so I can pick a max size.
Probably the easiest way is for example a std::vector<std::vector<int> > to allow it to be dynamically sized AND let the library do all the allocations for you. This will prevent accidentally leaking memory.
My preference would be to dynamically allocate. That way should you encounter a surprisingly large map you (hopefully) won't overflow if you've written it correctly, whereas with the fixed size your only option is to return an error and fail.
Presumably loading tile maps is a pretty infrequent operation. I'd be willing to bet too that you can't even measure a meaningful difference in speed between the two. Unless there is a measurable performance reduction, or you're actually hitting something else which is causing you problems the static sized one seems like a premature optimisation and is asking for trouble later on.
It depends entirely on requirements that you haven't stated :-)
If you want your app to be as blazingly fast as possible, with no ability to handle larger tile maps, then by all means just use a big array. For small PIC-based embedded systems this could be an ideal approach.
But, if you want your code to be robust, extensible, maintainable and generally suitable for a wider audience, use STL containers.
Or, if you just want to learn stuff, and have no concern about maintainability or performance, try and write your own dynamically allocating containers from scratch.
I believe the issue people refer to with dynamic allocation results from allocating randomly sized blocks of memory and not being able to effectively manage the random sized holes left when deallocated. If you're allocating fixed sized tiles then this may not be an issue.
I see quite a few people suggest allocating a large block of memory and managing it themselves. That might be an alternative solution.
Is allocating the memory dynamically a bottleneck in your program? Is it the cause of a performance issue? If not, then simply keep dynamic allocation, you can handle any map size. If yes, then maybe use some data structure that does not deallocate the memory it has allocated but rather use its old buffer and if needed, reallocate more memory.
Related
So I am currently trying to allocate dynamically a large array of elements in C++ (using "new"). Obviously, when "large" becomes too large (>4GB), my program crashes with a "bad_alloc" exception because it can't find such a large chunk of memory available.
I could allocate each element of my array separately and then store the pointers to these elements in a separate array. However, time is critical in my application so I would like to avoid as much cache misses as I can. I could also group some of these elements into blocks but what would be the best size for such a block?
My question is then: what is the best way (timewise) to allocate dynamically a large array of elements such that elements do not have to be stored contiguously but they must be accessible by index (using [])? This array is never going to be resized, no elements is going to be inserted or deleted of it.
I thought I could use std::deque for this purpose, knowing that the elements of an std::deque might or might not be stored contiguously in memory but I read there are concerns about the extra memory this container takes?
Thank you for your help on this!
If your problem is such that you actually run out of memory allocating fairly small blocks (as is done by deque) is not going to help, the overhead of tracking the allocations will only make the situation worse. You need to re-think your implementation such that you can deal with it in blocks that will still fit in memory. For such problems, if using x86 or x64 based hardware I would suggest blocks of at least 2 megabytes (the large page size).
Obviously, when "large" becomes too large (>4GB), my program crashes
with a "bad_alloc" exception because it can't find such a large chunk
of memory available.
You should be using 64-bit CPU and OS at this point, allocating huge contiguous chunk of memory should not be a problem, unless you are actually running out of memory. It is possible that you are building 32-bit program. In this case you won't be able to allocate more than 4 GB. You should build 64-bit application.
If you want something better than plain operator new, then your question is OS-specific. Look at API provided by your OS: on POSIX system you should look for mmap and for VirtualAlloc on Windows.
There are multiple problems with large allocations:
For security reasons OS kernel never gives you pages filled with garbage values, instead all new memory will be zero initialized. This means you don't have to initialize that memory as long as zeroes are exactly what you want.
OS gives you real memory lazily on first access. If you are processing large array, you might waste a lot of time taking page faults. To avoid this you can use MAP_POPULATE on Linux. On Windows you can try PrefetchVirtualMemory (but I am not sure if it can do the job). This should make init allocation slower, but should decrease total time spent in kernel.
Working with large chunks of memory wastes slots in Translation Lookaside Buffer (TLB). Depending on you memory access pattern, this can cause noticeable slowdown. To avoid this you can try using large pages (mmap with MAP_HUGETLB, MAP_HUGE_2MB, MAP_HUGE_1GB on Linux, VirtualAlloc and MEM_LARGE_PAGES). Using large pages is not easy, as they are usually not available by default. They also cannot be swapped out (always "locked in memory"), so using them requires privileges.
If you don't want to use OS-specific functions, the best you can find in C++ is std::calloc. Unlike std::malloc or operator new it returns zero initialized memory so you can probably avoid wasting time initializing that memory. Other than that, there is nothing special about that function. But this is the closest you can get while staying withing standard C++.
There are no standard containers designed to handle large allocations, moreover, all standard container are really really bad at handling those situations.
Some OSes (like Linux) overcommit memory, others (like Windows) do not. Windows might refuse to give you memory if it knows it won't be able to satisfy your request later. To avoid this you might want to increase your page file. Windows needs to reserve that space on disk beforehand, but it does not mean it will use it (start swapping). As actual memory is given to programs lazily, there are might be a lot of memory reserved for applications that will never be actually given to them.
If increasing page file is too inconvenient, you can try creating large file and map it into memory. That file will serve as a "page file" for your memory. See CreateFileMapping and MapViewOfFile.
The answer to this question is extremely application, and platform, dependent. These days if you just need a small integer factor greater than 4GB, you use a 64-bit machine, if possible. Sometimes reducing the size of the element in the array is possible as well. (E.g. using 16-bit fixed-point of half-float instead of 32-bit float.)
Beyond this, you are either looking at sparse arrays or out-of-core techniques. Sparse arrays are used when you are not actually storing elements at all locations in the array. There are many possible implementations and which is best depends on both the distribution of the data and the access pattern of the algorithm. See Eigen for example.
Out-of-core involves explicitly reading and writing parts of the array to/from disk. This used to be fairly common, but people work pretty hard to avoid doing this now. Applications that really require such are often built on top of a database or similar to handle the data management. In scientific computing, one ends up needing to distribute the compute as well as the data storage so there's a lot of complexity around that as well. For important problems the entire design may be driven by having good locality of reference.
Any sparse data structure will have overhead in how much space it takes. This can be fairly low, but it means you have to be careful if you actually have a dense array and are simply looking to avoid memory fragmentation.
If your problem can be broken into smaller pieces that only access part of the array at a time and the main issue is memory fragmentation making it hard to allocate one large block, then breaking the array in to pieces, effectively adding an outer vector of pointers, is a good bet. If you have random access to an array larger than 4 gigabytes and no way to localize the accesses, 64-bit is the way to go.
Depending on what you need the memory for and your speed concerns, and if you're using Linux, you can always try using mmap and simulate a sort of swap. It might be slower, but you can map very large sizes. See Mmap() an entire large file
Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well
If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.
As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.
The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.
I have a baseclass named GameObject from which other classes derive.
I am wondering if handling the memory allocation by allocating all derived classes of GameObjects in contiguous memory will improve performance.
I will end up iterating over all of them each game engine frame.
My question is, does contiguous memory storage in this case give me faster iteration times than mallocing memory without contiguation? In both cases, I have to keep a vector of pointers to the Game Objects since they will vary in size.
Iterating through objects in continuous memory likely works better because cache and locality. However, I recommend that you build the two systems and actually profile them. Good luck!
I'm not sure I understand the question. Are you asking if its better to pre-allocate all your objects in one giant block of memory and store the pointers to subsections of that in memory? If so please don't do that.
Instead of being faster you're more likely bound to slow things down since the system has to request contiguous memory of larger blocks instead of non-contiguous of smaller blocks. Keep in mind block allocation, paging, etc. You may request 100 megabytes of contiguous memory but its really not contiguous. A bunch of it is in disk and everything is broken up into pages anyways.
Then you're faced with the question of do you allocate all your GameObjects in one go to get contiguous memory or are you creating them on demand? Do you really want to pre-allocate for this one minor optimization? What happens if you need to create a new object and your contiguous memory block wasn't large enough? etc.
Really I'm just brainstorming potential problems here. Like the other comments said its a case of premature optimization.
Now, it would certainly be faster if you stored all your pointers in a contiguous array instead of a vector which grows and copies based on its current size, but even then unless you absolutely know the amount of game objects you're better off just allocating a sufficiently large vector so that it only grows once or twice.
I will be allocating and deallocating MANY dynamic, multidimensional arrays that represent matrices, every frame.
Priorities, even at the expense of error-checking and manual memory management:
Speed
Small memory footprint
Are C-style arrays the best choice, given these priorities? I know this is an oft-asked question, but I haven't been able to find a definitive answer for my circumstance.
If you can characterize the maximum amount of memory needed for a set of these arrays that will be used for any particular 'frame' (whatever that is), and if you will be dealing with only a single frame at a time (in other words, you'll be performing work on a single set of arrays, then dumping all of those arrays before performing another round of work on another set of arrays) then you'll likely get the best performance by allocating your arrays from a block of static memory that you've size appropriate for your largest possible work set.
Then your array allocation can be a simple pool allocator that carves out memory for an array from the front of the block and adjusts the block pointer to just past that allocation to be ready for the next array allocation. When you're done with the work on that set of arrays, everything can be freed by 'cleaning the pool' - simply resetting the block pointer to the start of the static memory pool.
Of course, since you haven't given much in the way of details for how your work must be done, this technique might not fit at all (that's probably why you haven't found a definitive answer yet - such an answer depends on the specific characteristics of the work you're performing).
std::vectors are typically as good as C-arrays, but if you want ultimate bare-bone speed and you know what you are doing, nothing can match managing C-array yourself.
There are trade-offs you have to consider.
- How much of time you are willing to spend debugging custom code?
- How much of custom code you are willing to write?
Also array/vector libraries are very well tested and optimized for speed and memory consumption, you might want to benchmark them using various compiler settings before deciding (in case if you do please share the results).
You give us too little information about your problem to give you a good answer. Is your program supposed to run only on one platform, or should it be platform independent? Is time-efficiency critical for your project? If so, perhaps using 'new' and 'delete' will be too slow for you and you need to resort to some platform specific or third party allocator. Then, the choice between a dynamically allocated array and an std::vector shouldn't make a difference.
Or do you want to allocate the arrays on the stack? But there is a limit to the size of the array you can create on the stack. What are the sizes of matrices?
If you want to represent a matrix, one unidimensionnal array is already better than a multidimensional array. An array is already the simplest structure you have at your disposition, so it is more appropriate.
The purpose of Vector is to implement a dynamic array, and you may not need this feature as a matrix has a fixed size.
When using malloc to allocate memory, is it generally quicker to do multiple mallocs of smaller chunks of data or fewer mallocs of larger chunks of data? For example, say you are working with an image file that has black pixels and white pixels. You are iterating through the pixels and want to save the x and y position of each black pixel in a new structure that also has a pointer to the next and previous pixels x and y values. Would it be generally faster to iterate through the pixels allocating a new structure for each black pixel's x and y values with the pointers, or would it be faster to get a count of the number of black pixels by iterating through once, then allocating a large chunk of memory using a structure containing just the x and y values, but no pointers, then iterating through again, saving the x and y values into that array? I'm assuming certain platforms might be different than others as to which is faster, but what does everyone think would generally be faster?
It depends:
Multiple small times means multiple times, which is slower
There may be a special/fast implementation for small allocations.
If I cared, I'd measure it! If I really cared a lot, and couldn't guess, then I might implement both, and measure at run-time on the target machine, and adapt accordingly.
In general I'd assume that fewer is better: but there are size and run-time library implementations such that a (sufficiently) large allocation will be delegated to the (relatively slow) O/S. whereas a (sufficiently) small allocation will be served from a (relatively quick) already-allocated heap.
Allocating large blocks is more efficient; additionally, since you are using larger contiguous blocks, you have greater locality of reference, and traversing your in-memory structure once you've generated it should also be more efficient! Further, allocating large blocks should help to reduce memory fragmentation.
Generally speaking, allocating larger chunks of memory fewer times will be faster. There's overhead involved each time a call to malloc() is made.
Except speed issues there is also the memory fragmentation problem.
Allocating memory is work. The amount of work done when allocating a block of memory is typically independent of the size of the block. You work it out from here.
It's faster not to allocate in performance-sensitive code at all. Allocate the memory you're going to need once in advance, and then use and reuse that as much as you like.
Memory allocation is a relatively slow operation in general, so don't do it more often than necessary.
In general malloc is expensive. It has to find an appropriate memory chunk from which to allocate memory and keep track of non-contiguous memory blocks. In several libraries you will find small memory allocators that try to minimize the impact by allocating a large block and managing the memory in the allocator.
Alexandrescu deals with the problem in 'Modern C++ Design' and in the Loki library if you want to take a look at one such libs.
This question is one of pragmatism, I'm afraid; that is to say, it depends.
If you have a LOT of pixels, only a few of which are black then counting them might be the highest cost.
If you're using C++, which your tags suggest you are, I would strongly suggest using STL, somthing like std::vector.
The implementation of vector, if I remember correctly, uses a pragmatic approach to allocation. There are a few heuristics for allocation strategies, an informative one is this:
class SampleVector {
int N,used,*data;
public:
SampleVector() {N=1;used=0;data=malloc(N);}
void push_back(int i)
{
if (used>=N)
{
// handle reallocation
N*=2;
data=realloc(data,N);
}
data[used++]=i;
}
};
In this case, you DOUBLE the amount of memory allocated every time you realloc.
This means that reallocations progressively halve in frequency.
Your STL implementation will have been well-tuned, so if you can use that, do!
Another point to consider is how this interacts with threading. Using malloc many times in a threaded concurrent application is a major drag on performance. In that environment you are better off with a scalable allocator like the one used in Intel's Thread Building Blocks or Hoard. The major limitation with malloc is that there is a single global lock that all the threads contend for. It can be so bad that adding another thread dramatically slows down your application.
As already mentonned, malloc is costly, so fewer will probably be faster.
Also, working with the pixels, on most platforms will have less cache-misses and will be faster.
However, there is no guarantee on every platforms
Next to the allocation overhead itself, allocating multiple small chunks may result in lots of cache misses, while if you can iterate through a contiguous block, chances are better.
The scenario you describe asks for preallocation of a large block, imho.
Although allocating large blocks is faster per byte of allocated memory, it will probably not be faster if you artificially increase the allocation size only to chop it up yourself. You're are just duplicating the memory management.
Do an iteration over the pixels to count the number of them to be stored.
Then allocate an array for the exact number of items. This is the most efficient solution.
You can use std::vector for easier memory management (see the std::vector::reserve procedure). Note: reserve will allocate probably a little (probably up to 2 times) more memory then necessary.
"I can allocate-it-all" (really, I can!)
We can philosophy about some special implementations, that speed up small allocations considerably ... yes! But in general this holds:
malloc must be general. It must implement all different kinds of allocations. That is the reason it is considerably slow! It might be, that you use a special kinky-super-duper Library, that speeds things up, but also those can not do wonders, since they have to implement malloc in its full spectrum.
The rule is, when you have more specialized allocation coding, you are always faster then the broad "I can allocate-it-all" routine "malloc".
So when you are able to allocate the memory in bigger blocks in your coding (and it does not cost you to much) you can speed up things considerably. Also - as mentioned by others - you will get lot less fragmentation of memory, that also speeds things up and can cost less memory. You must also see, that malloc needs additional memory for every chunk of memory it returns to you (yes, special routines can reduce this ... but you don't know! what it does really unless you implemented it yourself or bought some wonder-library).