What's the real memory cost for std::set<MyClass>? - c++

Suppose the sizeof MyClass is N Bytes, and I put M objects of MyClass into a std::set<MyClass>. The theoretic memory cost is N*M Bytes, however as I know std::set is implemented using Tree structure so I reckon there must be extra memory cost. How can I estimate an approximate and reasonable memory cost for a set, because when I run my program, the memory cost is higher than what I expected?

As mentioned by #SirDarius, this is implementation dependent and therefore you would need to check on a per implementation basis.
Usually, std::set is implemented as a red-black tree with lean iterators. This means that for each element in the set, there is one node, and this node roughly contains:
3 pointers: 1 for the parent node, 1 for the left-side child and one for the right-side child
a tag: red or black
the actual user type
The minimal foot-print of such a node on a 64 bits platform is therefore footprint(node):
3*8 bytes
sizeof(MyClass), rounded to 8 bytes (for alignment reasons: look up "struct padding")
Note: it is easy to use an unused bit somewhere to stash the red/black flag at no additional cost, for example taking into account the fact that given node alignments the least-significant bits of the pointers are always 0, or inserting in MyClass padding if any.
However, when you allocate an object T via new (which is what std::allocator does under the hood, and which itself uses malloc unless overloaded), you do not allocate just sizeof(T) bytes:
a typical implementation of malloc uses size buckets, meaning that if you have 8 bytes buckets and 16 bytes buckets when you allocate a 9 bytes object it goes into a 16 bytes bucket (wasting 7 bytes)
furthermore, the implementation also has some overhead to track those allocated blocks of memory (to know not to reuse that memory until it's freed); though minimal, the overhead can be non-negligible
finally, the OS itself has some overhead for each page of memory that the program requests; although that is probably negligible
Therefore, indeed, the use of a set will most likely consume more than just set.size() * sizeof(MyClass), though the overhead depends on your Standard Library implementation. This overhead will likely be large for small objects (such as int32_t: on a 64-bits platform you are looking at least at a +500% overhead) because the cost of linking the nodes together is usually fixed, whereas for large objects it might not be noticeable.
In order to reduce the overhead, you traditionally use arrays. This is what a B-Tree does for example, though it does so at the cost of stability (ie, each element staying at a fixed memory address regardless of what happens to the others).

The ordered associative containers have constraints which pretty much require that they are implemented as balanced tree. Most likely the nodes in each tree will have two pointers to the children and a pointer to the parent node. Depending on whether the inplementation plays bit tricks to store additional information in the pointers or not there may be an extra word in the allocated nodes. Further, the memory management system may store a word for each allocated chunk of memory and align it, e.g., to a 16 byte boundary.
What the implementation actually does is certainly not specified in the standard but I'd think the minimum memory consumption for M elements of size N is
M * (N + 2 * sizeof(void*) + sizeof(int))
It is probably somewhat more: I would probably use a pointer to the parent, too.

Related

How much is there overhead per single object memory allocation? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
UPDATE 1: It may sound like a 'too broad question' and, obviously, a C/C++ library specific, but what I am interested in, is a minimum extra memory needed to support a single allocation. Even not a system or implementation specific. For example, for binary tree, there is a must of 2 pointers - left and right children, and no way you can squeeze it.
UPDATE 2:
I decided to check it for myself on Windows 64.
#include <stdio.h>
#include <conio.h>
#include <windows.h>
#include <psapi.h>
void main(int argc, char *argv[])
{
int m = (argc > 1) ? atoi(argv[1]) : 1;
int n = (argc > 2) ? atoi(argv[2]) : 0;
for (int i = 0; i < n; i++)
malloc(m);
size_t peakKb(0);
PROCESS_MEMORY_COUNTERS pmc;
if ( GetProcessMemoryInfo(GetCurrentProcess(), &pmc, sizeof(pmc)) )
peakKb = pmc.PeakWorkingSetSize >> 10;
printf("requested : %d kb, total: %d kb\n", (m*n) >> 10, peakKb);
_getch();
}
requested : 0 kb, total: 2080 kb
1 byte:
requested : 976 kb, total: 17788 kb
extra: 17788 - 2080 - 976 = 14732 (+1410%)
2 bytes:
requested : 1953 kb, total: 17784 kb
extra: 17784 - 2080 - 1953 = (+605% over)
4 bytes:
requested : 3906 kb, total: 17796 kb
extra: 17796 - 2080 - 3906 = 10810 (+177%)
8 bytes:
requested : 7812 kb, total: 17784 kb
extra: 17784 - 2080 - 7812 = (0%)
UPDATE 3: THIS IS THE ANSWER TO MY QUESTION I’VE BEEN LOOKING FOR: In addition to being slow, the genericity of the default C++ allocator makes it very space inefficient for small objects. The default allocator manages a pool of memory, and such management often requires some extra memory. Usually, the bookkeeping memory amounts to a few extra bytes (4 to 32) for each block allocated with new. If you allocate 1024-byte blocks, the per-block space overhead is insignificant (0.4% to 3%). If you allocate 8-byte objects, the per-object overhead becomes 50% to 400%, a figure big enough to make you worry if you allocate many such small objects.
For allocated objects, no additional metadata is theoretically required. A conforming implementation of malloc could round up all allocation requests to a fixed maximum object size, for example. So for malloc (25), you would actually receive a 256-byte buffer, and malloc (257) would fail and return a null pointer.
More realistically, some malloc implementations encode the allocation size in the pointer itself, either directly using bit patterns corresponding to specific fixed sized classes, or indirectly using a hash table or a multi-level trie. If I recall correctly, the internal malloc for Address Sanitizer is of this type. For such mallocs, at least part of the immediate allocation overhead does not come from the addition of metadata for heap management, but from rounding the allocation size up to a supported size class.
Other mallocs have a per-allocation header of a single word. (dlmalloc and its derivative are popular examples). The actual per-allocation overhead is usually slightly larger because due to the header word, you get weired supported allocation sizes (such as 24, 40, 56, … bytes with 16-byte alignment on a 64-bit system).
One thing to keep in mind is that many malloc implementations put a lot of data deallocated objects (which have not yet been returned to the operating system kernel), so that malloc (the function) can quickly find an unused memory region of the appropriate size. Particularly for dlmalloc-style allocators, this also provides a constraint on minimum object size. The use of deallocated objects for heap management contributes to malloc overhead, too, but its impact on individual allocations is difficult to quantify.
Say, if I call malloc(sizeof(int)), requesting 4 bytes, how much extra will be added by system (or std library?) to support memory management infrastructure? I believe there should be some. Otherwise, how the system would know how many bytes to dispose of when I call free(ptr).
This is entirely library specific. The answer could be anything from zero to whatever. Your library could add data to the front of the block. Some add data to the front and back of the block to track overwrites. The amount of overhead added varies among libraries.
The length could be tracked within the library itself with a table. In that case, there may not be hidden field added to the allocated memory.
The library might only allocate blocks in fixed sizes. The amount you ask for gets rounded up to the next block size.
The pointer itself is essentially overhead and can be a dominant driver of memory use in some programs.
The theoretical minimum overhead, might be sizeof(void*) for some theoretical system and use, but that combination of CPU, Memory and usage pattern is so unlikely to exist as to be absolutely worthless for consideration. The standard requires that memory returned by malloc be suitably aligned for any data type, therefore there will always be some overhead; in the form of unused memory between the end of one allocated block, and the beginning of the next allocated block, except in the rare cases where all memory usage is sized to a multiple of the block size.
The minimum implementation of malloc/free/realloc, assumes the heap manager has one contiguous block of memory at it's disposal, located somewhere in system memory, the pointer that said heap manager uses to reference that original block, is overhead (again sizeof(void*)). One could imagine a highly contrived application that requested that entire block of memory, thus avoiding the need for additional tracking data. At this point we have 2 * sizeof(void*) worth of overhead, one internal to the heap manager, plus the returned pointer to the one allocated block (the theoretical minimum). Such a conforming heap manager is unlikely to exist as it must also have some means of allocating more than one block from its pool and that implies at a minimum, tracking which blocks within its pool are in use.
One scheme to avoid overhead involves using pointer sizes that are larger than the physical or logical memory available to the application. One can store some information in those unused bits, but they would also count as overhead if their number exceeds a processor word size. Generally, only a hand full of bits are used and those identify which of the memory managers internal pools the memory comes from. The later, implies additonal overhead of pointers to pools. This brings us to real world systems, where the heap manager implementation is tuned to the OS, hardware architecture and typical usage patterns.
Most hosted implementations (hosted == runs on OS) request one or more memory blocks from the operating system in the c-runtime initialization phase. OS memory management calls are expensive in time and space. The OS has it's own memory manager, with its own overhead set, driven by its own design criteria and usage patterns. So c-runtime heap managers attempt to limit the number of calls into the OS memory manager, in order to reduce the latency of the average call to malloc() and free(). Most request the first block from the OS when malloc is first called, but this usually happens at some point in the c-runtime initialization code. This first block is usually a low multiple of the system page size, which can be one or more orders of magnitude larger than the size requested in the initial malloc() call.
At this point it is obvious that heap manager overhead is extremely fluid and difficult to quantify. On a typical modern system, the heap manager must track multiple blocks of memory allocated from the OS, how many bytes are currently allocated to the application in each of those blocks and potentially, how much time has passed since a block went to zero. Then there's the overhead of tracking allocations from within each of those blocks.
Generally malloc rounds up to a minimum alignment boundary and often this is not special cased for small allocations as applications are expected to aggregate many of these into a single allocation. The minimum alignment is often based on the largest required alignment for a load instruction in the architecture the code is running on. So with 128-bit SIMD (e.g. SSE or NEON) the minimum is 16 bytes. In practice there is a header as well which causes the minimum cost in size to be doubled. As SIMD register widths have increased, malloc hasn't increased it's guaranteed alignment.
As was pointed out, the minimum possible overhead is 0. Though the pointer itself should probably be counted in any reasonable analysis. In a garbage collector design, at least one pointer to the data has to be present. In straight a non-GC design, one has to have a pointer to call free, but there's not an iron clad requirement it has to be called. One could theoretically compress a bunch of pointers together into less space as well, but now we're into an analysis of the entropy of the bits in the pointers. Point being you likely need to specify some more constraints to get a really solid answer.
By way of illustration, if one needs arbitrary allocation and deallocation of just int size, one can allocate a large block and create a linked list of indexes using each int to hold the index of the next. Allocation pulls an item off the list and deallocation adds one back. There is a constraint that each allocation is exactly an int. (And that the block is small enough that the maximum index fits in an int.) Multiple sizes can be handled by having different blocks and searching for which block the pointer is in when deallocation happens. Some malloc implementations do something like this for small fixed sizes such as 4, 8, and 16 bytes.
This approach doesn't hit zero overhead as one needs to maintain some data structure to keep track of the blocks. This is illustrated by considering the case of one-byte allocations. A block can at most hold 256 allocations as that is the maximum index that can fit in the block. If we want to allow more allocations than this, we will need at least one pointer per block, which is e.g. 4 or 8 bytes overhead per 256 bytes.
One can also use bitmaps, which amortize to one bit per some granularity plus the quantization of that granularity. Whether this is low overhead or not depends on the specifics. E.g. one bit per byte has no quantization but eats one eighth the allocation size in the free map. Generally this will all require storing the size of the allocation.
In practice allocator design is hard because the trade-off space between size overhead, runtime cost, and fragmentation overhead is complicated, often has large cost differences, and is allocation pattern dependent.

How to avoid wasting memory on 64-bit pointers

I'm hoping for some high-level advice on how to approach a design I'm about to undertake.
The straightforward approach to my problem will result in millions and millions of pointers. On a 64-bit system these will presumably be 64-bit pointers. But as far as my application is concerned, I don't think I need more than a 32-bit address space. I would still like for the system to take advantage of 64-bit processor arithmetic, however (assuming that is what I get by running on a 64-bit system).
Further background
I'm implementing a tree-like data structure where each "node" contains an 8-byte payload, but also needs pointers to four neighboring nodes (parent, left-child, middle-child, right-child). On a 64-bit system using 64-bit pointers, this amounts to 32 bytes just for linking an 8-byte payload into the tree -- a "linking overhead" of 400%.
The data structure will contain millions of these nodes, but my application will not need much memory beyond that, so all these 64-bit pointers seem wasteful. What to do? Is there a way to use 32-bit pointers on a 64-bit system?
I've considered
Storing the payloads in an array in a way such that an index implies (and is implied by) a "tree address" and neighbors of a given index can be calculated with simple arithmetic on that index. Unfortunately this requires me to size the array according to the maximum depth of the tree, which I don't know beforehand, and it would probably incur even greater memory overhead due to empty node elements in the lower levels because not all branches of the tree go to the same depth.
Storing nodes in an array large enough to hold them all, and then using indices instead of pointers to link neighbors. AFAIK the main disadvantage here would be that each node would need the array's base address in order to find its neighbors. So they either need to store it (a million times over) or it needs to be passed around with every function call. I don't like this.
Assuming that the most-significant 32 bits of all these pointers are zero, throwing an exception if they aren't, and storing only the least-significant 32 bits. So the required pointer can be reconstructed on demand. The system is likely to use more than 4GB, but the process will never. I'm just assuming that pointers are offset from a process base-address and have no idea how safe (if at all) this would be across the common platforms (Windows, Linux, OSX).
Storing the difference between 64-bit this and the 64-bit pointer to the neighbor, assuming that this difference will be within the range of int32_t (and throwing if it isn't). Then any node can find it's neighbors by adding that offset to this.
Any advice? Regarding that last idea (which I currently feel is my best candidate) can I assume that in a process that uses less than 2GB, dynamically allocated objects will be within 2 GB of each other? Or not at all necessarily?
Combining ideas 2 and 4 from the question, put all the nodes into a big array, and store e.g. int32_t neighborOffset = neighborIndex - thisIndex. Then you can get the neighbor from *(this+neighborOffset). This gets rid of the disadvantages/assumptions of both 2 and 4.
If on Linux, you might consider using (and compiling for) the x32 ABI. IMHO, this is the preferred solution for your issues.
Alternatively, don't use pointers, but indexes into a huge array (or an std::vector in C++) which could be a global or static variable. Manage a single huge heap-allocated array of nodes, and use indexes of nodes instead of pointers to nodes. So like your §2, but since the array is a global or static data you won't need to pass it everywhere.
(I guess that an optimizing compiler would be able to generate clever code, which could be nearly as efficient as using pointers)
You can remove the disadvantage of (2) by exploiting the alignment of memory regions to find the base address of the the array "automatically". For example, if you want to support up to 4 GB of nodes, ensure your node array starts at a 4GB boundary.
Then within a node with address addr, you can determine the address of another at index as addr & -(1UL << 32) + index.
This is kind of the "absolute" variant of the accepted solution which is "relative". One advantage of this solution is that an index always has the same meaning within a tree, whereas in the relative solution you really need the (node_address, index) pair to interpret an index (of course, you can also use the absolute indexes in the relative scenarios where it is useful). It means that when you duplicate a node, you don't need to adjust any index values it contains.
The "relative" solution also loses 1 effective index bit relative to this solution in its index since it needs to store a signed offset, so with a 32-bit index, you could only support 2^31 nodes (assuming full compression of trailing zero bits, otherwise it is only 2^31 bytes of nodes).
You can also store the base tree structure (e.g,. the pointer to the root and whatever bookkeeping your have outside of the nodes themselves) right at the 4GB address which means that any node can jump to the associated base structure without traversing all the parent pointers or whatever.
Finally, you can also exploit this alignment idea within the tree itself to "implicitly" store other pointers. For example, perhaps the parent node is stored at an N-byte aligned boundary, and then all children are stored in the same N-byte block so they know their parent "implicitly". How feasible that is depends on how dynamic your tree is, how much the fan-out varies, etc.
You can accomplish this kind of thing by writing your own allocator that uses mmap to allocate suitably aligned blocks (usually just reserve a huge amount of virtual address space and then allocate blocks of it as needed) - ether via the hint parameter or just by reserving a big enough region that you are guaranteed to get the alignment you want somewhere in the region. The need to mess around with allocators is the primary disadvantage compared to the accepted solution, but if this is the main data structure in your program it might be worth it. When you control the allocator you have other advantages too: if you know all your nodes are allocated on an 2^N-byte boundary you can "compress" your indexes further since you know the low N bits will always be zero, so with a 32-bit index you could actually store 2^(32+5) = 2^37 nodes if you knew they were 32-byte aligned.
These kind of tricks are really only feasible in 64-bit programs, with the huge amount of virtual address space available, so in a way 64-bit giveth and also taketh away.
Your assertion that a 64 bit system necessarily has to have 64 bit pointers is not correct. The C++ standard makes no such assertion.
In fact, different pointer types can be different sizes: sizeof(double*) might not be the same as sizeof(int*).
Short answer: don't make any assumptions about the sizes of any C++ pointer.
Sounds like to me that you want to build you own memory management framework.

Is there a memory overhead associated with heap memory allocations (eg markers in the heap)?

Thinking in particular of C++ on Windows using a recent Visual Studio C++ compiler, I am wondering about the heap implementation:
Assuming that I'm using the release compiler, and I'm not concerned with memory fragmentation/packing issues, is there a memory overhead associated with allocating memory on the heap? If so, roughly how many bytes per allocation might this be?
Would it be larger in 64-bit code than 32-bit?
I don't really know a lot about modern heap implementations, but am wondering whether there are markers written into the heap with each allocation, or whether some kind of table is maintained (like a file allocation table).
On a related point (because I'm primarily thinking about standard-library features like 'map'), does the Microsoft standard-library implementation ever use its own allocator (for things like tree nodes) in order to optimize heap usage?
Yes, absolutely.
Every block of memory allocated will have a constant overhead of a "header", as well as a small variable part (typically at the end). Exactly how much that is depends on the exact C runtime library used. In the past, I've experimentally found it to be around 32-64 bytes per allocation. The variable part is to cope with alignment - each block of memory will be aligned to some nice even 2^n base-address - typically 8 or 16 bytes.
I'm not familiar with how the internal design of std::map or similar works, but I very much doubt they have special optimisations there.
You can quite easily test the overhead by:
char *a, *b;
a = new char;
b = new char;
ptrdiff_t diff = a - b;
cout << "a=" << a << " b=" << b << " diff=" << diff;
[Note to the pedants, which is probably most of the regulars here, the above a-b expression invokes undefined behaviour, since subtracting the address of one piece of allocated and the address of another, is undefined behaviour. This is to cope with machines that don't have linear memory addresses, e.g. segmented memory or "different types of data is stored in locations based on their type". The above should definitely work on any x86-based OS that doesn't use a segmented memory model with multiple data segments in for the heap - which means it works for Windows and Linux in 32- and 64-bit mode for sure].
You may want to run it with varying types - just bear in mind that the diff is in "number of the type, so if you make it int *a, *b will be in "four bytes units". You could make a reinterpret_cast<char*>(a) - reinterpret_cast<char *>(b);
[diff may be negative, and if you run this in a loop (without deleting a and b), you may find sudden jumps where one large section of memory is exhausted, and the runtime library allocated another large block]
Visual C++ embeds control information (links/sizes and possibly some checksums) near the boundaries of allocated buffers. That also helps to catch some buffer overflows during memory allocation and deallocation.
On top of that you should remember that malloc() needs to return pointers suitably aligned for all fundamental types (char, int, long long, double, void*, void(*)()) and that alignment is typically of the size of the largest type, so it could be 8 or even 16 bytes. If you allocate a single byte, 7 to 15 bytes can be lost to alignment only. I'm not sure if operator new has the same behavior, but it may very well be the case.
This should give you an idea. The precise memory waste can only be determined from the documentation (if any) or testing. The language standard does not define it in any terms.
Yes. All practical dynamic memory allocators have a minimal granularity1. For example, if the granularity is 16 bytes and you request only 1 byte, the whole 16 bytes is allocated nonetheless. If you ask for 17 bytes, a block whose size is 32 bytes is allocated etc...
There is also a (related) issue of alignment.2
Quite a few allocators seem to be a combination of a size map and free lists - they split potential allocation sizes to "buckets" and keep a separate free list for each of them. Take a look at Doug Lea's malloc. There are many other allocation techniques with various tradeoffs but that goes beyond the scope here...
1 Typically 8 or 16 bytes. If the allocator uses a free list then it must encode two pointers inside every free slot, so a free slot cannot be smaller than 8 bytes (on 32-bit) or 16 byte (on 16-bit). For example, if allocator tried to split a 8-byte slot to satisfy a 4-byte request, the remaining 4 bytes would not have enough room to encode the free list pointers.
2 For example, if the long long on your platform is 8 bytes, then even if the allocator's internal data structures can handle blocks smaller than that, actually allocating the smaller block might push the next 8-byte allocation to an unaligned memory address.

Smaller per allocation overhead in allocators library

I'm currently writing a memory management library for c++ that is based around the concept of allocators. It's relatively simple, for now, all allocators implement these 2 member functions:
virtual void * alloc( std::size_t size ) = 0;
virtual void dealloc( void * ptr ) = 0;
As you can see, I do not support alignment in the interface but that's actually my next step :) and the reason why I'm asking this question.
I want the allocators to be responsible for alignment because each one can be specialized. For example, the block allocator can only return block-sized aligned memory so it can handle failure and return NULL if a different alignment is asked for.
Some of my allocators are in fact sub-allocators. For example, one of them is a linear/sequential allocator that just pointer-bumps on allocation. This allocator is constructed by passing in a char * pBegin and char * pEnd and it allocates from within that region in memory. For now, it works great but I get stuff that is 1-byte aligned. It works on x86 but I heard it can be disastrous on other CPUs (consoles?). It's also somewhat slower for reads and writes on x86.
The only sane way I know of implementing aligned memory management is to allocate an extra sizeof( void * ) + (alignement - 1) bytes and do pointer bit-masking to return the aligned address while keeping the original allocated address in the bytes before the user-data (the void * bytes, see above).
OK, my question...
That overhead, per allocation, seems big to me. For 4-bytes alignment, I would have 7 bytes of overhead on a 32-bit cpu and 11 bytes on a 64-bit one. That seems like a lot.
First, is it a lot? Am I on par with other memory management libs you might have used in the past or are currently using? I've looked into malloc and it seems to have a minimum of 16-byte overhead, is that right?
Do you know of a better way, smaller overhead, of returning aligned memory to my lib's users?
You could store an offset, rather than a pointer, which would only need to be large enough to store the largest supported alignment. A byte might even be sufficient if you only support smallish alignments.
How about you implement a buddy system which can be x-byte aligned depending on your requirement.
General Idea:
When your lib is initialized, allocate a big chunk of memory. For our example lets assume 16B. (Only this block needs to be aligned, the algorithm will not require you to align any other block)
Maintain lists for memory chunks of power 2. i.e 4B, 8B, 16B, ... 64KB, ... 1MB, 2MB, ... 512MB.
If a user asks for 8B of data, check the list for 8B, if not available, check list of 16B and split it into 2 blocks of 8B. Give one back to user and the other, gets appended to the list of 8B.
If the user asks for 16B, check if you have at least 2 8B available. If yes, combine them and give back to user. If not, the system does not have enough memory.
Pros:
No internal or external fragmentation.
No alignment required.
Fast access to memory chunks as they are pre-allocated.
If the list is an array, direct access to memory chunks of different size
Cons:
Overhead for list of memory.
If the list is a linked-list, traversal would be slow.

Casting size_t to allow more elements in a std::vector

I need to store a huge number of elements in a std::vector (more that the 2^32-1 allowed by unsigned int) in 32 bits. As far as I know this quantity is limited by the std::size_t unsigned int type. May I change this std::size_t by casting to an unsigned long? Would it resolve the problem?
If that's not possible, suppose I compile in 64 bits. Would that solve the problem without any modification?
size_t is a type that can hold size of any allocable chunk of memory. It follows that you can't allocate more memory than what fits in your size_t and thus can't store more elements in any way.
Compiling in 64-bits will allow it, but realize that the array still needs to fit in memory. 232 is 4 billion, so you are going to go over 4 * sizeof(element) GiB of memory. More than 8 GiB of RAM is still rare, so that does not look reasonable.
I suggest replacing the vector with the one from STXXL. It uses external storage, so your vector is not limited by amount of RAM. The library claims to handle terabytes of data easily.
(edit) Pedantic note: size_t needs to hold size of maximal single object, not necessarily size of all available memory. In segmented memory models it only needs to accommodate the offset when each object has to live in single segment, but with different segments more memory may be accessible. It is even possible to use it on x86 with PAE, the "long" memory model. However I've not seen anybody actually use it.
There are a number of things to say.
First, about the size of std::size_t on 32-bit systems and 64-bit systems, respectively. This is what the standard says about std::size_t (§18.2/6,7):
6 The type size_t is an implementation-defined unsigned integer type that is large enough to contain the size
in bytes of any object.
7 [ Note: It is recommended that implementations choose types for ptrdiff_t and size_t whose integer
conversion ranks (4.13) are no greater than that of signed long int unless a larger size is necessary to
contain all the possible values. — end note ]
From this it follows that std::size_t will be at least 32 bits in size on a 32-bit system, and at least 64 bits on a 64-bit system. It could be larger, but that would obviously not make any sense.
Second, about the idea of type casting: For this to work, even in theory, you would have to cast (or rather: redefine) the type inside the implementation of std::vector itself, wherever it occurs.
Third, when you say you need this super-large vector "in 32 bits", does that mean you want to use it on a 32-bit system? In that case, as the others have pointed out already, what you want is impossible, because a 32-bit system simply doesn't have that much memory.
But, fourth, if what you want is to run your program on a 64-bit machine, and use only a 32-bit data type to refer to the number of elements, but possibly a 64-bit type to refer to the total size in bytes, then std::size_t is not relevant because that is used to refer to the total number of elements, and the index of individual elements, but not the size in bytes.
Finally, if you are on a 64-bit system and want to use something of extreme proportions that works like a std::vector, that is certainly possible. Systems with 32 GB, 64 GB, or even 1 TB of main memory are perhaps not extremely common, but definitely available.
However, to implement such a data type, it is generally not a good idea to simply allocate gigabytes of memory in one contiguous block (which is what a std::vector does), because of reasons like the following:
Unless the total size of the vector is determined once and for all at initialization time, the vector will be resized, and quite likely re-allocated, possibly many times as you add elements. Re-allocating an extremely large vector can be a time-consuming operation. [ I have added this item as an edit to my original answer. ]
The OS will have difficulties providing such a large portion of unfragmented memory, as other processes running in parallel require memory, too. [Edit: As correctly pointed out in the comments, this isn't really an issue on any standard OS in use today.]
On very large servers you also have tens of CPUs and typically NUMA-type memory architectures, where it is clearly preferable to work with relatively smaller chunks of memory, and have multiple threads (possibly each running on a different core) access various chunks of the vector in parallel.
Conclusions
A) If you are on a 32-bit system and want to use a vector that large, using disk-based methods such as the one suggested by #JanHudec is the only thing that is feasible.
B) If you have access to a large 64-bit system with tens or hundreds of GB, you should look into an implementation that divides the entire memory area into chunks. Essentially something that works like a std::vector<std::vector<T>>, where each nested vector represents one chunk. If all chunks are full, you append a new chunk, etc. It is straight-forward to implement an iterator type for this, too. Of course, if you want to optimize this further to take advantage of multi-threading and NUMA features, it will get increasingly complex, but that is unavoidable.
A vector might be the wrong data structure for you. It requires storage in a single block of memory, which is limited by the size of size_t. This you can increase by compiling for 64 bit systems, but then you can't run on 32 bit systems which might be a requirement.
If you don't need vector's particular characteristics (particularly O(1) lookup and contiguous memory layout), another structure such as a std::list might suit you, which has no size limits except what the computer can physically handle as it's a linked list instead of a conveniently-wrapped array.