stl vector size on 64-bit machines - c++

I have an application that will use millions of vectors.
It appears that most implementations of std::vector use 4 pointers (_First, _Last, _End, and _Alloc), which consumes 32 bytes on 64-bit machines. For most "practical" use cases of vector, one could probably get away with a single pointer and two 'unsigned int' fields to store the current size & allocated size, respectively. Ignoring the potential challenge of supporting customized allocation (instead of assuming that allocations must go through the global new & delete operator), it seems that it is possible to build an STL compliant vector class that uses only 16 bytes (or at worst 24 bytes to support the _Alloc pointer).
Before I start coding this up, 1) are there any pitfalls I should be aware of and 2) does an open source implementation exist?

You could accomplish something like this -- but it's not likely you'd gain all that much.
First, there's the performance aspect. You are trading time for memory consumption. Whatever memory you save is going to be offset by having to do the addition and a multiply on every call to end (okay, if it's a vector where sizeof(vector<t>::value_type) == 1 the multiply can be optimized out). Note that most handwritten looping code over vectors calls end on every loop iteration. On modern CPUs that's actually going to be a major win, because it allows the processor to keep more things in cache; unless those couple of extra instructions in an inner loop force the processor to swap things in the instruction cache too often)
Moreover, the memory savings is likely to be small in terms of the overall memory use in the vector, for the following reasons:
Memory manager overhead. Each allocation from the memory manager, (which vector of course needs) is going to add 16-24 bytes of overhead on its own in most memory manager implementations. (Assuming something like dlmalloc (UNIX/Linux/etc.) or RtlHeap (Windows))
Overprovisioning load. In order to achieve amortized constant insertion and removal at the end, when vector resizes, it resizes to some multiple of the size of the data in the vector. This means that the typical memory capacity vector allocates is enough for 1.6 (MSVC++) or 2 (STLPort, libstdc++) times the number of elements actually stored in the vector.
Alignment restrictions. If you are putting those many vectors into an array (or another vector), then keep in mind the first member of that vector is still a pointer to the allocated memory block. This pointer generally needs to be 8 byte aligned anyway -- so the 4 bytes you save are lost to structure padding in arrays.
I'd use the plain implementation of vector for now. If you run your code through a memory profiler and find that a significant savings would be made by getting rid of these couple of pointers, then you're probably off implementing your own optimized class which meets your performance characteristics rather than relying on the built in vector implementation. (An example of one such optimized class is std::string on those platforms that implement the small string optimization)
(Note: the only compiler of which I am aware that optimizes out the Alloc pointer is VC11, which isn't yet released. Though Nim says that the current prerelease version of libstdc++ does it as well...)

Unless these vectors are going to have contents that are extremely small, the difference between 16 vs. 32 bytes to hold the contents will be a small percentage of the total memory consumed by them. It will require lots of effort to reinvent this wheel, so be sure you're getting an adequate pay off for all that work.
BTW, there's value in education too, and you will learn a lot by doing this. If you choose to proceed, you might consider writing a test suite first, and exercise it on the current implementation and then on the one you invent.

To answer if it is worth the effort, find or write a compatible implementation that fits your needs (maybe there are other things in std::vector that you do not need), and compare the performance with std::vector<your_type> on relevant platforms. Your suggestion can at least improve performance of for the move constructor, as well as the move assignment operator:
typedef int32_t v4si __attribute__ ((vector_size (16)));
v4si data;
T* pointer;
uint32_t length;
uint32_t capacity;
} content;
} m_data;
This only covers "sane" T (noexcept move semantices).


Utilize memory past the end of a std::vector using a custom overallocating allocator

Let's say I have an allocator my_allocator that will always allocate memory for n+x (instead of n) elements when allocate(n) is called.
Can I savely assume that memory in the range [data()+n, data()+n+x) (for a std::vector<T, my_allocator<T>>) is accessible/valid for further use (i.e. placement new or simd loads/stores in case of fundamentals (as long as there is no reallocation)?
Note: I'm aware that everything past data()+n-1 is uninitialized storage. The use case would be a vector of fundamental types (which do not have a constructor anyway) using the custom allocator to avoid having special corner cases when throwing simd intrinsics at the vector. my_allocator shall allocate storage that is 1.) properly aligned and has 2.) a size that is a multiple of the used register size.
To make things a little bit more clear:
Let's say I have two vectors and I want to add them:
std::vector<double, my_allocator<double>> a(n), b(n);
// fill them ...
auto c = a + b;
assert(c.size() == n);
If the storage obtained from my_allocator now allocates aligned storage and if sizeof(double)*(n+x) is always a multiple of the used simd register size (and thus a multiple of the number of values per register) I assume that I can do something like
for(size_t i=0u; i<(n+x); i+=y)
{ // where y is the number of doubles per register and and divisor of (n+x)
auto ma = _aligned_load( + i);
auto mb = _aligned_load( + i);
_aligned_store( + i, _simd_add(ma, mb));
where I don't have to care about any special case like unaligned loads or backlog from some n that is not dividable by y.
But still the vectors only contain n values and can be handled like vectors of size n.
Stepping back a moment, if the problem you are trying to solve is to allow the underlying memory to be processed effectively by SIMD intrinsics or unrolled loops, or both, you don't necessarily need to allocate memory beyond the used amount just to "round off" the allocation size to a multiple of vector width.
There are various approaches used to handle this situation, and you mentioned a couple, such as special lead-in and lead-out code to handle the leading and trailing portions. There are actually two distinct problems here - handling the fact the data isn't a multiple of the vector width, and handling (possibly) unaligned starting addresses. Your over-allocation method is tackling the first issue - but there's probably a better way...
Most SIMD code in practice can simply read beyond the end of the processed region. Some might argue that this is technically UB - but when using SIMD intrinsics you are already venturing beyond the walls of Standard C++. In fact, this technique is already widely used in the standard library and so it is implicitly endorsed by compiler and library maintainers. It is also a standard method for handling SIMD codes in general, so you can be pretty sure it's not going to suddenly break.
They key to making it work is the observation that if you can validly read even a single byte at some location N, then any a naturally aligned read of any size1 won't trigger a fault. Of course, you still need to ignore or otherwise handle the data you read beyond the end of the officially allocated area - but you'll need to do that anyway with your "allocate extra" approach, right? Depending on the algorithm, you may mask away the invalid data, or exclude invalid data after the SIMD portion is done (i.e., if you are searching for a byte, if you find a byte after the allocated area, it's the same as "not found").
To make this work, you need to be reading in an aligned fashion, but that's probably something you already want to do I think. You can either arrange to have your memory allocated aligned in the first place, or do an overlapping read at the start (i.e., one unaligned read first, then all aligned with the first aligned read overlapping the unaligned portion), or use the same trick as the tail to read before the array (with the same reasoning as to why this is safe). Furthermore, there are various tricks to request aligned memory without needing to write your own allocator.
Overall, my recommendation is to try to avoid writing a custom allocator. Unless the code is fairly tightly contained, you may run into various pitfalls, including other code making wrong assumptions about how your memory was allocated and the various other pitfalls Leon mentions in his answer. Furthermore, using a custom allocator disables a bunch of optimizations used by the standard container algorithms, unless you use it everywhere, since many of them apply only to containers using the same allocator.
Furthermore, when I was actually implementing custom allocators2 , I found that it was a nice idea in theory, but a bit too obscure to be well-supported in an identical fashion across all the compilers. Now the compilers have become a lot more compliant over time (I'm looking mostly at you, Visual Studio), and template support has also improved, so perhaps that's not an issue, but I feel it still falls into the category of "do it only if you must".
Keep in mind also that custom allocators don't compose well - you only get the one! If someone else on your project wants to use a custom allocator for your container for some other reason, they won't be able to do it (although you could coordinate and create a combined allocator).
This question I asked earlier - also motivated by SIMD - covers a lot of the ground about the safety of reading past the end (and, implicitly, before the beginning), and is probably a good place to start if you are considering this.
1 Technically, the restriction is any aligned read up to the page size, which at 4K or larger is plenty for any of the current vector-oriented general purpose ISAs.
2 In this case, I was doing it not for SIMD, but basically to avoid malloc() and to allow partially on-stack and contiguous fast allocations for containers with many small nodes.
For your use case you shouldn't have any doubts. However, if you decide to store anything useful in the extra space and will allow the size of your vector to change during its lifetime, you will probably run into problems dealing with the possibility of reallocation - how are you going to transfer the extra data from the old allocation to the new allocation given that reallocation happens as a result of separate calls to allocate() and deallocate() with no direct connection between them?
EDIT (addressing the code added to the question)
In my original answer I meant that you shouldn't have any problem accessing the extra bytes allocated by your allocator in excess of what was requested. However, writing data in the memory range, that is outside the range currently utilized by the vector object but belongs to the range that would be spanned by the unmodified allocation, asks for trouble. An implementation of std::vector is free to request from the allocator more memory than would be exposed through its size()/capacity() functions and store auxiliary data in the unused area. Though this is highly theoretical, not accounting for that possibility means opening a door into undefined behavior.
Consider the following possible layout of the vector's allocation:
=== - used capacity of the vector
+++ - unused capacity of the vector
--- - overallocated by the vector (but not shown as part of its capacity)
... - overallocated by your allocator
You MUST NOT write anything in the regions 2 (---) and 3 (+++). All your writes must be constrained to the region 4 (...), otherwise you may corrupt important bits.

What is the theoretical impact of direct index access with "high" memory usage vs. "shifted" index access with "low" memory usage?

Well I am really curious as to what practice is better to keep, I know it (probably?) does not make any performance difference at all (even in performance critical applications?) but I am more curious about the impact on the generated code with optimization in mind (and for the sake of completeness, also "performance", if it makes any difference).
So the problem is as following:
element indexes range from A to B where A > 0 and B > A (eg, A = 1000 and B = 2000).
To store information about each element there are a few possible solutions, two of those which use plain arrays include direct index access and access by manipulating the index:
example 1
//declare the array with less memory, "just" 1000 elements, all elements used
std::array<T, B-A> Foo;
//but make accessing by index slower?
//accessing index N where B > N >= A
example 2
//or declare the array with more memory, 2000 elements, 50% elements not used, not very "efficient" for memory
std::array<T, B> Foo;
//but make accessing by index faster?
//accessing index N where B > N >= A
I'd personally go for #2 because I really like performance, but I think in reality:
the compiler will take care of both situations?
What is the impact on optimizations?
What about performance?
does it matter at all?
Or is this just the next "micro optimization" thing that no human being should worry about?
Is there some Tradeoff ratio between memory usage : speed which is recommended?
Accessing any array with an index involves adding an index multiplied by element size and adding it to the base-address of the array itself.
Since we are already adding one number to another, making the adjustment for foo[N-A] could easily be done by adjusting the base-address down by N * sizeof(T) before adding A * sizeof(T), rather than actually calculating (A-N)*sizeof(T).
In other words, any decent compiler should comletely hide this subtraction, assuming it is a constant value.
If it's not a constant [say you are using std::vector instread of std::array, then you will indeed subtract A from N at some point in the code. It is still pretty cheap to do this. Most modern processors can do this in one cycle with no latency for the result, so at worst adds a single clock-cycle to the access.
Of course, if the numbers are 1000-2000, probably makes really little difference in the whole scheme of things - either the total time to process that is nearly nothing, or it's a lot becuase you do complicated stuff. But if you were to make it a million elements, offset by half a million, it may make the difference between a simple or complex method of allocating them, or some such.
Also, as Hans Passant implies: Modern OS's with virutal memory handling, memory that isn't actually used doesn't get populated with "real memory". At work I was investigating a strange crash on a board that has 2GB of RAM, and when viewing the memory usage, it showed that this one applciation had allocated 3GB of virtual memory. This board does not have a swap-disk (it's an embedded system). It turns out that some code was simply allocating large chunks of memory that wasn't filled with anything, and it only stopped working when it reached 3GB (32-bit processor, 3+1GB memory split between user/kernel space). So even for LARGE lumps of memory, if you only have half of it, it won't actually take up any RAM, if you do not actually access it.
As ALWAYS when it comes to performance, compilers and such, if it's important, do not trust "the internet" to tell you the answer. Set up a test with the code you actually intend to use, using the actual compiler(s) and processor type(s) that you plan to produce your code with/for, and run benchmarks. Some compiler may well have a misfeature (on processor type XYZ9278) that makes it produce horrible code for a case that most other compilers do this "with no overhead at all".

What's generally the size limit to switch from a vector to a deque?

I recent wrote this post:
How best to store VERY large 2D list of floats in c++? Error-handling?
Some suggested that I implemented my 2D list-like structure of floats as a vector, others said a deque.
From what I gather vector requires continuous memory, but is hence more efficient. Obviously this would be desirable if possible.
Thus my question is what's a good rule of how long a basic structure can be in terms of...
1. float
2. int
...before you should switch from a vector to a deque to avoid memory problems?
e.g. I'm looking for answer like "At around 4 million floats or 8 million ints, you should switch..." ...if possible.
Well, here are two opinions. The C++ standard says (23.1.1/2):
vector is the type of sequence that should be used by default.
list should be used when there are frequent insertions and deletions from the middle of the sequence.
deque is the data structure of choice when most insertions and deletions take place at the beginning or at the end of the sequence.
Herb Sutter argues the following (the article contains his rationale and a performance analysis):
I'd like to present an amiably dissenting point of view: I recommend that you consider preferring deque by default instead of vector, especially when the contained type is a class or struct and not a builtin type, unless you really need the container's memory to be contiguous.
Again, there is no size limit above which deque is or not better than vector. Memory fragmentation implications are pretty much the same in either case, except when you have already done a huge load of allocations/deallocations and there is not enough contiguous space left for a big vector. But this case is very rare. Remember that memory space is per process (google for virtual memory). And you can remedy it by allocating the memory for the vector (by the reserve method) before the cluttering takes place.
The tradeoff is in term of what you want to do with it. If the structure is basically immutable and you only want to access it / overwrite it by index access, go for vector.
Deque is when you need to do insertions either at the end, the beginning or in the middle, something vector cannot handle naturally (except for inserting at the end).
Herb Sutter's articles are in general of great quality, but you'll notice that when you do "number crunching" in C++, most of the stuff you're taught in "general C++" books must be taken with an extra word of caution. The poor indexing performance you experience with deques is perhaps important for your application. In this case, don't use deque.
If you need insertions at the beginning, then go with deque.
Otherwise, I always like to point to this article on vector vs. deque (in addition to those linked by James McNellis here). Assuming an implementation of deque that uses page-based allocation, this article has good comparisons of allocation time (& deallocation time) for vector with & without reserve() vs. deque. Basically, using reserve() makes vector allocation time very similar to deque. Informative, and useful if you can guess the right value to reserve ahead of time.
There are so many factors to consider that it's impossible to give a clear answer. The amount of memory on the machine, how fragmented it is, how fragmented it may become, etc. My suggestion is to just choose one and use it. If it causes problems switch. Chances are you aren't going to hit those edge cases anyway.
If you are truly worried, then you could probably implement a sort of pseudo PIMPL:
template<typename T>
class VectorDeque
std::deque<T> m_d;
std::vector<T> m_v;
TYPE m_type;
void resize(size_t n)
case NONE:
m_type = VECTOR;
catch(std::bad_alloc &ba)
m_type = DEQUE;
But this seems like total overkill. Have you done any benchmarking or tests? Do you have any reason to believe that memory allocations will fail or that deques will be too slow?
You switch after testing and profiling indicate that one is worse than the other for your application. There is no "around N floats or M ints" universal answer.
Well, regarding memory, I can share some experience that may help you decide when contiguous memory blocks (malloc or std::vector) may become too large:
The application I work with does record measurement data, mostly 4byte float, and for this it allocates internal buffers to store the data. These buffers heavily vary in size, but the typical range may be say, several dozen of 1-10MB and a very few of >100MB. The buffers are allways allocated with calloc, i.e. one big chunk of memory. If a buffer-allocation fails, an error is logged and the user has the choice to try again.
Buffer sizes: Say you want to record 1000 channels at 100Hz for 10 Minutes: 4byte x 1000 x 100 x 60x10 == 228 MB (approx.) ... or 100 channels at 10Hz for 12 hours == 41 MB
We (nearly) never had any problems allocating 40MB buffers (and that's about 10 millon floats) and the 200-300 MB buffers fail from time to time -- all this on normal WinXP/32bit boxes with 4GB RAM.
Given that you don't insert after creation, you should probably either use plain old std::vector, or if fragmentation really does become an issue, a custom vector-like Sequence implemented as a vector or array of pointers to fixed-size arrays.

Determining maximum possible alignment in C++

Is there any portable way to determine what the maximum possible alignment for any type is?
For example on x86, SSE instructions require 16-byte alignment, but as far as I'm aware, no instructions require more than that, so any type can be safely stored into a 16-byte aligned buffer.
I need to create a buffer (such as a char array) where I can write objects of arbitrary types, and so I need to be able to rely on the beginning of the buffer to be aligned.
If all else fails, I know that allocating a char array with new is guaranteed to have maximum alignment, but with the TR1/C++0x templates alignment_of and aligned_storage, I am wondering if it would be possible to create the buffer in-place in my buffer class, rather than requiring the extra pointer indirection of a dynamically allocated array.
I realize there are plenty of options for determining the max alignment for a bounded set of types: A union, or just alignment_of from TR1, but my problem is that the set of types is unbounded. I don't know in advance which objects must be stored into the buffer.
In C++11 std::max_align_t defined in header cstddef is a POD type whose alignment requirement is at least as strict (as large) as that of every scalar type.
Using the new alignof operator it would be as simple as alignof(std::max_align_t)
In C++0x, the Align template parameter of std::aligned_storage<Len, Align> has a default argument of "default-alignment," which is defined as (N3225 ยง20.7.6.6 Table 56):
The value of default-alignment shall be the most stringent alignment requirement for any C++ object type whose size is no greater than Len.
It isn't clear whether SSE types would be considered "C++ object types."
The default argument wasn't part of the TR1 aligned_storage; it was added for C++0x.
Unfortunately ensuring max alignment is a lot tougher than it should be, and there are no guaranteed solutions AFAIK. From the GotW blog (Fast Pimpl article):
union max_align {
short dummy0;
long dummy1;
double dummy2;
long double dummy3;
void* dummy4;
/*...and pointers to functions, pointers to
member functions, pointers to member data,
pointers to classes, eye of newt, ...*/
union {
max_align m;
char x_[sizeofx];
This isn't guaranteed to be fully
portable, but in practice it's close
enough because there are few or no
systems on which this won't work as
That's about the closest 'hack' I know for this.
There is another approach that I've used personally for super fast allocation. Note that it is evil, but I work in raytracing fields where speed is one of the greatest measures of quality and we profile code on a daily basis. It involves using a heap allocator with pre-allocated memory that works like the local stack (just increments a pointer on allocation and decrements one on deallocation).
I use it for Pimpls particularly. However, just having the allocator is not enough; for such an allocator to work, we have to assume that memory for a class, Foo, is allocated in a constructor, the same memory is likewise deallocated only in the destructor, and that Foo itself is created on the stack. To make it safe, I needed a function to see if the 'this' pointer of a class is on the local stack to determine if we can use our super fast heap-based stack allocator. For that we had to research OS-specific solutions: I used TIBs and TEBs for Win32/Win64, and my co-workers found solutions for Linux and Mac OS X.
The result, after a week of researching OS-specific methods to detect stack range, alignment requirements, and doing a lot of testing and profiling, was an allocator that could allocate memory in 4 clock cycles according to our tick counter benchmarks as opposed to about 400 cycles for malloc/operator new (our test involved thread contention so malloc is likely to be a bit faster than this in single-threaded cases, perhaps a couple of hundred cycles). We added a per-thread heap stack and detected which thread was being used which increased the time to about 12 cycles, though the client can keep track of the thread allocator to get the 4 cycle allocations. It wiped out memory allocation based hotspots off the map.
While you don't have to go through all that trouble, writing a fast allocator might be easier and more generally applicable (ex: allowing the amount of memory to allocate/deallocate to be determined at runtime) than something like max_align here. max_align is easy enough to use, but if you're after speed for memory allocations (and assuming you've already profiled your code and found hotspots in malloc/free/operator new/delete with major contributors being in code you have control over), writing your own allocator can really make the difference.
Short of some maximally_aligned_t type that all compilers promised faithfully to support for all architectures everywhere, I don't see how this could be solved at compile time. As you say, the set of potential types is unbounded. Is the extra pointer indirection really that big a deal?
Allocating aligned memory is trickier than it looks - see for example Implementation of aligned memory allocation
This is what I'm using. In addition to this, if you're allocating memory then a new()'d array of char with length greater than or equal to max_alignment will be aligned to max_alignment so you can then use indexes into that array to get aligned addresses.
enum {
max_alignment = boost::mpl::deref<
boost::mpl::int_<boost::alignment_of<signed char>::value>::type,
boost::mpl::int_<boost::alignment_of<short int>::value>::type,
boost::mpl::int_<boost::alignment_of<int>::value>::type, boost::mpl::int_<boost::alignment_of<long int>::value>::type,
boost::mpl::int_<boost::alignment_of<long double>::value>::type,

Are there any practical limitations to only using std::string instead of char arrays and std::vector/list instead of arrays in c++?

I use vectors, lists, strings and wstrings obsessively in my code. Are there any catch 22s involved that should make me more interested in using arrays from time to time, chars and wchars instead?
Basically, if working in an environment which supports the standard template library is there any case using the primitive types is actually better?
For 99% of the time and for 99% of Standard Library implementations, you will find that std::vectors will be fast enough, and the convenience and safety you get from using them will more than outweigh any small performance cost.
For those very rare cases when you really need bare-metal code, you can treat a vector like a C-style array:
vector <int> v( 100 );
int * p = &v[0];
p[3] = 42;
The C++ standard guarantees that vectors are allocated contiguously, so this is guaranteed to work.
Regarding strings, the convenience factor becomes almnost overwhelming, and the performance issues tend to go away. If you go beack to C-style strings, you are also going back to the use of functions like strlen(), which are inherently very inefficent themselves.
As for lists, you should think twice, and probably thrice, before using them at all, whether your own implementation or the standard. The vast majority of computing problems are better solved using a vector/array. The reason lists appear so often in the literature is to a large part because they are a convenient data structure for textbook and training course writers to use to explain pointers and dynamic allocation in one go. I speak here as an ex training course writer.
I would stick to STL classes (vectors, strings, etc). They are safer, easier to use, more productive, with less probability to have memory leaks and, AFAIK, they make some additional, run-time checking of boundaries, at least at DEBUG time (Visual C++).
Then, measure the performance. If you identify the bottleneck(s) is on STL classes, then move to C style strings and arrays usage.
From my experience, the chances to have the bottleneck on vector or string usage are very low.
One problem is the overhead when accessing elements. Even with vector and string when you access an element by index you need to first retrieve the buffer address, then add the offset (you don't do it manually, but the compiler emits such code). With raw array you already have the buffer address. This extra indirection can lead to significant overhead in certain cases and is subject to profiling when you want to improve performance.
If you don't need real time responses, stick with your approach. They are safer than chars.
You can occasionally encounter scenarios where you'll get better performance or memory usage from doing some stuff yourself (example, std::string typically has about 24 bytes of overhead, 12 bytes for the pointers in the std::string itself, and a header block on its dynamically allocated piece).
I have worked on projects where converting from std::string to const char* saved noticeable memory (10's of MB). I don't believe these projects are what you would call typical.
Oh, using STL will hurt your compile times, and at some point that may be an issue. When your project results in over a GB of object files being passed to the linker, you might want to consider how much of that is template bloat.
I've worked on several projects where the memory overhead for strings has become problematic.
It's worth considering in advance how your application needs to scale. If you need to be storing an unbounded number of strings, using const char*s into a globally managed string table can save you huge amounts of memory.
But generally, definitely use STL types unless there's a very good reason to do otherwise.
I believe the default memory allocation technique is a buffer for vectors and strings is one that allocates double the amount of memory each time the currently allocated memory gets used up. This can be wasteful. You can provide a custom allocator of course...
The other thing to consider is stack vs. heap. Staticly sized arrays and strings can sit on the stack, or at least the compiler handles the memory management for you. Newer compilers will handle dynamically sized arrays for you too if they provide the relevant C99/C++0x feature. Vectors and strings will always use the heap, and this can introduce performance issues if you have really tight constraints.
As a rule of thumb use whats already there unless it hurts your project with its speed/memory overhead... you'll probably find that for 99% of stuff the STL provided classes save you time and effort with little to no impact on your applications performance. (i.e. "avoid premature optimisation")