How best to quickly populate a vector? - c++

I have some simulation code I'm working on and I've just gotten rid of all the low hanging fruit so far as optimisation is concerned. The code now spends half its time pushing back vectors. (The size of the final vectors is known and I reserve appropriately)
Essentially I'm rearranging one vector into a permutation of another, or populating the vector with random elements.
Is there any faster means of pushing back into a vector? Or pushing back/copying multiple elements?
std::vector<unsigned int, std::allocator<unsigned int> >::push_back(unsigned int const&)
Thanks in advance.
EDIT: Extra info; I'm running a release build with -O3, also: the original vector needs to be preserved.

You can have a look at
c++0x (which enables a lot of optimizations in this area in the concept of move semantics)
EASTL (which boasts superior performance, mainly through the use of custom allocators (_you can get it up and running in about an hour, and the only visible change will be std::vector --> eastl::vector and some extra link objects).
you can drop in google perftools tcmalloc (although since apparently you already optimize by pre-allocating, this shouldn't really matter).
I'd personally not expect much gain if really the vector handling is the bottleneck. I'd really look at parallelizing with (in order of preference):
GNU openmp (CPPFLAGS+=-D_GLIBCXX_PARALLEL -fopenmp)
just openmp and 'manual' #pragma parallel for
Intel TBB (most appropriate if using Intel compiler)
I must be forgetting stuff. O yeah, look here: http://www.agner.org/optimize/
Edit: I always forget the simplest things: Use memcpy/memmove for bulk appending POD elements to pre-allocated vectors.

If you're pre-reserving space then your vector is as fast as an array. You cannot mathematically make it faster; stop worrying and move on to something else!
You may be experiencing slow-down if you're running a "debug build" i.e. where your standard library implementation has optimisations turned off, and debug tracking info turned on.

push_back on int is extremely efficient. So I would look elsewhere for optimization opportunities.
Nemo's first rule of micro-optimization: Math is fast; memory is slow. Creating a huge vector is very cache-unfriendly.
For example, instead of creating a permutation of the original vector, can you just compute which element you need as you need it and then access that element directly from the original vector?
Similarly, do you really need a vector of random integers? Why not just generate a random number when it is needed? (If you have to remember it for later, then go ahead and push it onto the vector then... But not before.)
push_back on int is about as fast as it is going to get. I would bet you could barely notice the difference even if you got rid of the reserve (because the re-allocation does not happen often and is going to use a very fast bulk copy already). So you need to take a broader view to improve performance.

If you have multiple vectors, you may be able to improve speed by allocating them continuously using a custom allocator. Improving memory locality may well improve the running time of the algorithm.

If you are using Debug version of STL there is a debug overhead (esp. in iterators) in all STL calls.
I would advice to replace STL vector with regular array. If you are using trivially-copyable types you could easily copy multiple elements using memcpy call.

Related

Container Data Structure similar to std::vector and std::list

I am currently developing (at Design-Stage) a kinda small compiler which uses a custom IR. The problem I am having is to choose an efficient container data structure to contain all the instructions.
A basic block will contain ~10000 instructions and an instruction will be like ~250 Bytes big.
I thought about using a list because the compiler will have lots of complex transformations (ie. lots of random insertions/removals) so having a container data structure which does not invalidate iterators would be good. As it would keep the transformation algorithm simple and easy to follow.
On the other hand, it would be a lost of performance because of the known problem with cache misses and memory fragmentation. An std::vector would help here but I imagine it would be a pain to implement transformations with a vector.
So the questions is, if there is another data structure which has low memory fragmentation to reduce memory cache misses and does not invalidate iterators.
Or if I should ignore this and keep using a list.
Start with using Container = std::vector<Instruction>. std::vector is a pretty good default container. Once performance becomes an issue profile the program with a couple of different containers. You should be able to swap out the Container without needing much change in the rest of the code. I imagine some kind of array-list would be best, but you should probably check.

Most efficient way to grow array C++

Apologies if this has been asked before, I can't find a question that fully answers what I want to know. They mention ways to do this, but don't compare approaches.
I am writing a program in C++ to solve a PDE to steady state. I don't know how many time steps this will take. Therefore I don't know how long my time arrays will be. This will have a maximum time of 100,000s, but the time step could be as small as .001, so it could be as many as 1e8 doubles in length in the worst case (not necessarily a rare case either).
What is the most efficient way to implement this in terms of memory allocated and running time?
Options I've looked at:
Dynamically allocating an array with 1e8 elements, most of which won't ever be used.
Allocating a smaller array initially, creating a larger array when needed and copying elements over
Using std::vector and it's size increasing functionality
Are there any other options?
I'm primarily concerned with speed, but I want to know what memory considerations come into it as well
If you are concerned about speed just allocate 1e8 doubles and be done with it.
In most cases vector should work just fine. Remember that amortized it's O(1) for the append.
Unless you are running on something very weird the OS memory allocation should take care of most fragmentation issues and the fact that it's hard to find a 800MB free memory block.
As noted in the comments, if you are careful using vector, you can actually reserve the capacity to store the maximum input size in advance (1e8 doubles) without paging in any memory.
For this you want to avoid the fill constructor and methods like resize (which would end up accessing all the memory) and use reserve and push_back to fill it and only touch memory as needed. That will allow most operating systems to simply page in chunks of your accessed vector at a time instead of the entire contents all at once.
Yet I tend to avoid this solution for the most part at these kinds of input scales, but for simple reasons:
A possibly-paranoid portability fear that I may encounter an operating system which doesn't have this kind of page-on-demand behavior.
A possibly-paranoid fear that the allocation may fail to find a contiguous set of unused pages and face out of memory errors (this is a grey zone -- I tend to worry about this for arrays which span gigabytes, hundreds of megabytes is borderline).
Just a totally subjective and possibly dumb/old bias towards not leaning too heavily on the operating system's behavior for paging in allocated memory, and preferring to have a data structure which simply allocates on demand.
Debugging.
Among the four, the first two could simply be paranoia. The third might just be plain dumb. Yet at least on operating systems like Windows, when using a debug build, the memory is initialized in its entirety early, and we end up mapping the allocated pages to DRAM immediately on reserving capacity for such a vector. Then we might end up leading to a slight startup delay and a task manager showing 800 megabytes of memory usage for a debug build even before we've done anything.
While generally the efficiency of a debug build should be a minor concern, when the potential discrepancy between release and debug is enormous, it can start to render production code almost incapable of being effectively debugged. So when the differences are potentially vast like this, my preference is to "chunk it up".
The strategy I like to apply here is to allocate smaller chunks -- smaller arrays of N elements, where N might be, say, 512 doubles (just snug enough to fit a common denominator page size of 4 kilobytes -- possibly minus a couple of doubles for chunk metadata). We fill them up with elements, and when they get full, create another chunk.
With these chunks, we can aggregate them together by either linking them (forming an unrolled list) or storing a vector of pointers to them in a separate aggregate depending on whether random-access is needed or merely sequential access will suffice. For the random-access case, this incurs a slight overhead, yet one I've tended to find relatively small at these input scales which often have times dominated by the upper levels of the memory hierarchy rather than register and instruction level.
This might be overkill for your case and a careful use of vector may be the best bet. Yet if that doesn't suffice and you have similar concerns/needs as I do, this kind of chunky solution might help.
The only way to know which option is 'most efficient' on your machine is to try a few different options and profile. I'd probably start with the following:
std::vector constructed with the maximum possible size.
std::vector constructed with a conservative ballpark size and push_back.
std::deque and push_back.
The std::vector vs std::deque debate is ongoing. In my experience, when the number of elements is unknown and not too large, std::deque is almost never faster than std::vector (even if the std::vector needs multiple reallocations) but may end up using less memory. When the number of elements is unknown and very large, std::deque memory consumption seems to explode and std::vector is the clear winner.
If after profiling, none of these options offers satisfactory performance, then you may want to consider writing a custom allocator.

Very fast object allocator for small object of same size

I just have to write some code that has to have best performance.
Requirements:
I need a very fast object allocator for quick creation of object. My object as only 3 doubles in it. Allocation and deallocation will occurs one object at a time only.
I made a lots of research and come up with:
std::vector<MyClass, boost::fast_pool_allocator<MyClass>>
I wonder (in 2014-07):
Does stl have something equivalent to boost::boost:fast_pool_allocator ?
Is there a better solution to what I have found ?
There is additional information to answer some comments:
The code will be used to optimize my algorithm for: Code Project article on Convex Hull
I need to convert C# code to C or C++ to improve performance. I should compete against another algorithm written in pure "C". I just discover that my comparison chart in my article have errors because I tested against code compiled in C for x86-Debug. In x64-release the "C" code is a lot faster (a factor of 4 to 5 times faster than in x86-debug).
According to This Boost documentation and this Answer at StackOverFlow, boost:fast_pool_allocator seems to be the best allocator to use for small memory chunk of same size query one by one. But I would like to make sure nothing else exists that is either more standard (part of stl) or faster.
My code will be developed on Visual Studio 2013 and target any windows platform (no phone or tablet).
My intend is not to have fast code, it is to have the fastest code. I prefer not having too much twisted code if possible and also look for code that is maintainable (at least a minimum).
If possible, I also would like to know the impact of using std:vector vs array (ie: []).
For more info, you could see Wikipedia - Object pool pattern
The closest thing to what I was looking for was Paulo Zemek Code Project article: O(1) Object Pool in C++.
But I finally did allocate/reserve memory of size= the maximum size possible * my object size.
Because I was not using any objects that required to live longer than my algorithm loop, I cheated by saying that location in reserved memory space was object. After my algorithm loop, I flushed the reserved memory space. It appears to me to be the fastest. Very not elegant but extremely fast and only require one allocation and one deallocation.
I was not totally satisfy with the answer, that's why I answered myself. I also added a comments to the questions about every comments and added this answer to make thinks clear. I know that my decision/implementation was not totally in accordance with the question but I think that it should have conducted to something similar.
Search for memory-pool heaps. Basically, you create a heap dedicated for objects of a single size (typically powers of 2, 4 bytes, 16 bytes, etc) and allocate objects from the heap that can contain blocks of the smallest size that can fit your object in. As each heap only contains fixed-size blocks, its very easy to manage the blocks that are allocated in it, a bitmap can show you which blocks are free or in-use so insertion can be very fast (especially if you just allocate at the end and increment a pointr)
As an example, here's one, you may be able to take it and optimise it explicitly for your particular object size and requirements.
I found this solution very useful to me: Fast C++11 allocator for STL containers. It slightly speeds up STL containers on VS2017 (~5x) as well as on GCC (~7x). Moreover you can manually specify grow size or if you know the maximal number of elements in your list, you can preallocate them all.

How do i overload new/stl to make unknown objects faster?

In my question profiling: deque is 23% of my runtime i have a problem with 'new' being a large % of my runtime. The problems are
I have to use the new keyword a lot and on many different classes/structs (i have >200 of them and its by design). I use lots of stl objects, iterators and strings. I use strdup and other allocation (or free) functions.
I have one function that is called >2million times. All it did was create stl iterators and it took up >20% of the time (however from what i remember stl is optimized pretty nicely and debug makes it magnitudes slower).
But keeping in mind i need to allocate and free these iterators >2m times along with other functions that are called often. How do i optimize the new and malloc keyword/function? Especially for all these classes/structs and classes/struct i didnt write (stl and others)
Although profiling says i (and stl?) use the new keyword more then anything else.
Look for opportunities to avoid the allocation/freeing, either by adding your own management layer to recycle memory and objects that have already been allocated, or modifying their allocators. There are plenty of articles on STL Allocators:
http://www.codeguru.com/cpp/cpp/cpp_mfc/stl/article.php/c4079
http://bmagic.sourceforge.net/memalloc.html
http://www.codeproject.com/KB/stl/blockallocator.aspx
I have seen large multimap code go from unusably slow to very fast simply by replacing the default allocator.
You can't make malloc faster. You might be able to make new faster, but I bet you can find ways not to call them so much.
One way to find excess calls to anything is to peruse the code looking for them, but that's slow and error-prone, and they're not always visible.
A simple and foolproof way to find them is to pause the program a few times and look at the stack.
Notice you don't really need to measure anything. If something is happening that takes a large fraction of time, that is the probability you will see it on each pause, and the goal is to find it.
You do get a rough measurement, but that's only a by-product of finding the problem.
Here's an example where this was done in a series of stages, resulting in large speedup factors.

Higher dimensional array vs 1-D array efficiency in C++

I'm curious about the efficiency of using a higher dimensional array vs a one dimensional array. Do you lose anything when defining, and iterating through an array like this:
array[i][j][k];
or defining and iterating through an array like this:
array[k + j*jmax + i*imax];
My inclination is that there wouldn't be a difference, but I'm still learning about high efficiency programming (I've never had to care about this kind of thing before).
Thanks!
The only way to know for sure is to benchmark both ways (with optimization flags on in the compiler of course). The one think you lose for sure in the second method is the clarity of reading.
The former way and the latter way to access arrays are identical once you compile it. Keep in mind that accessing memory locations that are close to one another does make a difference in performance, as they're going to be cached differently. Thus, if you're storing a high-dimensional matrix, ensure that you store rows one after the other if you're going to be accessing them that way.
In general, CPU caches optimize for temporal and spacial ordering. That is, if you access memory address X, the odds of you accessing X+1 are higher. It's much more efficient to operate on values within the same cache line.
Check out this article on CPU caches for more information on how different storage policies affect performance: http://en.wikipedia.org/wiki/CPU_cache
If you can rewrite the indexing, so can the compiler. I wouldn't worry about that.
Trust your compiler(tm)!
It probably depends on implementation, but I'd say it more or less amounts to your code for one-dimensional array.
Do yourself a favor and care about such things after profiling the code. It is very unlikely that something like that will affect the performance of the application as a whole. Using the correct algorithms is much more important
And even if it does matter, it is most certainly only a single inner loop that needs attention.