How to avoid wasting memory on 64-bit pointers

How to avoid wasting memory on 64-bit pointers - c++

I'm hoping for some high-level advice on how to approach a design I'm about to undertake.
The straightforward approach to my problem will result in millions and millions of pointers. On a 64-bit system these will presumably be 64-bit pointers. But as far as my application is concerned, I don't think I need more than a 32-bit address space. I would still like for the system to take advantage of 64-bit processor arithmetic, however (assuming that is what I get by running on a 64-bit system).
Further background
I'm implementing a tree-like data structure where each "node" contains an 8-byte payload, but also needs pointers to four neighboring nodes (parent, left-child, middle-child, right-child). On a 64-bit system using 64-bit pointers, this amounts to 32 bytes just for linking an 8-byte payload into the tree -- a "linking overhead" of 400%.
The data structure will contain millions of these nodes, but my application will not need much memory beyond that, so all these 64-bit pointers seem wasteful. What to do? Is there a way to use 32-bit pointers on a 64-bit system?
I've considered
Storing the payloads in an array in a way such that an index implies (and is implied by) a "tree address" and neighbors of a given index can be calculated with simple arithmetic on that index. Unfortunately this requires me to size the array according to the maximum depth of the tree, which I don't know beforehand, and it would probably incur even greater memory overhead due to empty node elements in the lower levels because not all branches of the tree go to the same depth.
Storing nodes in an array large enough to hold them all, and then using indices instead of pointers to link neighbors. AFAIK the main disadvantage here would be that each node would need the array's base address in order to find its neighbors. So they either need to store it (a million times over) or it needs to be passed around with every function call. I don't like this.
Assuming that the most-significant 32 bits of all these pointers are zero, throwing an exception if they aren't, and storing only the least-significant 32 bits. So the required pointer can be reconstructed on demand. The system is likely to use more than 4GB, but the process will never. I'm just assuming that pointers are offset from a process base-address and have no idea how safe (if at all) this would be across the common platforms (Windows, Linux, OSX).
Storing the difference between 64-bit this and the 64-bit pointer to the neighbor, assuming that this difference will be within the range of int32_t (and throwing if it isn't). Then any node can find it's neighbors by adding that offset to this.
Any advice? Regarding that last idea (which I currently feel is my best candidate) can I assume that in a process that uses less than 2GB, dynamically allocated objects will be within 2 GB of each other? Or not at all necessarily?

Combining ideas 2 and 4 from the question, put all the nodes into a big array, and store e.g. int32_t neighborOffset = neighborIndex - thisIndex. Then you can get the neighbor from *(this+neighborOffset). This gets rid of the disadvantages/assumptions of both 2 and 4.

If on Linux, you might consider using (and compiling for) the x32 ABI. IMHO, this is the preferred solution for your issues.
Alternatively, don't use pointers, but indexes into a huge array (or an std::vector in C++) which could be a global or static variable. Manage a single huge heap-allocated array of nodes, and use indexes of nodes instead of pointers to nodes. So like your §2, but since the array is a global or static data you won't need to pass it everywhere.
(I guess that an optimizing compiler would be able to generate clever code, which could be nearly as efficient as using pointers)

You can remove the disadvantage of (2) by exploiting the alignment of memory regions to find the base address of the the array "automatically". For example, if you want to support up to 4 GB of nodes, ensure your node array starts at a 4GB boundary.
Then within a node with address addr, you can determine the address of another at index as addr & -(1UL << 32) + index.
This is kind of the "absolute" variant of the accepted solution which is "relative". One advantage of this solution is that an index always has the same meaning within a tree, whereas in the relative solution you really need the (node_address, index) pair to interpret an index (of course, you can also use the absolute indexes in the relative scenarios where it is useful). It means that when you duplicate a node, you don't need to adjust any index values it contains.
The "relative" solution also loses 1 effective index bit relative to this solution in its index since it needs to store a signed offset, so with a 32-bit index, you could only support 2^31 nodes (assuming full compression of trailing zero bits, otherwise it is only 2^31 bytes of nodes).
You can also store the base tree structure (e.g,. the pointer to the root and whatever bookkeeping your have outside of the nodes themselves) right at the 4GB address which means that any node can jump to the associated base structure without traversing all the parent pointers or whatever.
Finally, you can also exploit this alignment idea within the tree itself to "implicitly" store other pointers. For example, perhaps the parent node is stored at an N-byte aligned boundary, and then all children are stored in the same N-byte block so they know their parent "implicitly". How feasible that is depends on how dynamic your tree is, how much the fan-out varies, etc.
You can accomplish this kind of thing by writing your own allocator that uses mmap to allocate suitably aligned blocks (usually just reserve a huge amount of virtual address space and then allocate blocks of it as needed) - ether via the hint parameter or just by reserving a big enough region that you are guaranteed to get the alignment you want somewhere in the region. The need to mess around with allocators is the primary disadvantage compared to the accepted solution, but if this is the main data structure in your program it might be worth it. When you control the allocator you have other advantages too: if you know all your nodes are allocated on an 2^N-byte boundary you can "compress" your indexes further since you know the low N bits will always be zero, so with a 32-bit index you could actually store 2^(32+5) = 2^37 nodes if you knew they were 32-byte aligned.
These kind of tricks are really only feasible in 64-bit programs, with the huge amount of virtual address space available, so in a way 64-bit giveth and also taketh away.

Your assertion that a 64 bit system necessarily has to have 64 bit pointers is not correct. The C++ standard makes no such assertion.
In fact, different pointer types can be different sizes: sizeof(double*) might not be the same as sizeof(int*).
Short answer: don't make any assumptions about the sizes of any C++ pointer.
Sounds like to me that you want to build you own memory management framework.

Related

Allocate small struct to 32 bit aligned in 64 bit system

The problem: I'm implementing a non-blocking data structure, where threads alter a shared pointer using a CAS operation. As pointers can be recycled, we have the ABA issue. To avoid this, I want to attach a version to each pointer. This is called a versioned pointer. A CAS128 is considered more expensive than a CAS64, so I'm trying to avoid going above 64 bits.
I'm trying to implement a versioned pointer. In a 32b system, the versioned pointer is a 64b struct, where the top 32 bits are the pointer and the bottom 32 is its version. This allows me to use CAS64 to atomically alter the pointer.
I'm having issues with a 64b system. In this case, I still want to use CAS64 instead of CAS128, so I'm trying to allocate a pointer aligned to 4GB (i.e., 32 zeros). I can then use masks to infer the pointer and version.
The solutions I've tried using alligned_malloc, padding, and std::align, but these involve allocating very large amounts of memory, e.g., alligned_malloc(1LL << 32, (1LL << 32)* sizeof(void*)) allocates 4GB of memory. Another solution is using a memory mapped file, but this involves synchronization that we're trying to avoid.
Is there a way to allocate 8B of memory aligned to 4GB that I'm missing?

First off, a non-portable solution that limits the code complexity creep to the point of allocation (see below for another approach that makes point of use more complicated, but should be portable); it only works on POSIX systems (not Windows), but you could reduce your overhead to the size of a page (not 8 bytes, but in the context of a 64 bit system, wasting 4088 bytes isn't too bad if you're not doing it too often; obviously, the nature of your problem means that you can't possibly waste more than sysconf(_SC_PAGESIZE) - 8 bytes per 4 GB, so that's not too bad) by the following mechanism:
mmap 4 GB of memory anonymously (not file-backed; pass fd of -1 and include the MAP_ANONYMOUS flag)
Compute the address of the 4 GB aligned pointer within that block
munmap the memory preceding that address, and the memory beginning sysconf(_SC_PAGE_SIZE) bytes after that address
This works because memory mappings aren't monolithic; they can be unmapped piecemeal, individual pages can be remapped without error, etc.
Note that if you're short on swap space, the brief request for 4 GB might cause problems (e.g. on a Linux system with heuristic overcommit disabled, it might fail to allocate the memory if it can't back it with swap, even though you never use most of it). You can experiment with passing MAP_NORESERVE to the original request, then performing the unmapping, then remapping that single page with MAP_FIXED (without MAP_NORESERVE) to ensure the allocation can be used without triggering a SIGSEGV at time of writing.
If you can't use POSIX mmap, should it really be impossible to use CAS128, you may want to consider a segmented memory model like the old x86 scheme for these pointers. You block allocate 4 GB segments (they don't need any special alignment) up front, and have your "pointers" be 32 bit offsets from the base address of the segment; you can't use the whole 64 bit address space (unless you allow for multiple selectors, possibly by repurposing part of the version number field for example; you can probably make do with a few million versions rather than four billion after all), but if you don't need to do so, this lets you have a base address that never changes after allocation (so no atomics needed), with offsets that fit within your desired 32 bit field. So instead of getting your data via:
data = *mystruct.pointer;
you have a segment pointer like this initialized early:
char *const base_address = new char[1ULL << 32]; // Or use smart pointer of your choosing
wrap it in a suballocator to partition the space, and now lookup is instead:
data = *reinterpret_cast<REAL_TYPE_HERE*>(&base_address[mystruct.pointer]);
I'm sure there are nifty ways to wrap this up better with custom allocators, custom operator news, what have you, but I've never had to do this in C++ (I've done similar magic in C, where there are no facilities to make it "pretty"), and I'd probably get it wrong, so I'll leave that as an exercise.

Emulate memory-mapping of a game console, access different locations based on the address provided

I am implementing an emulator for an old game console, mostly for learning purposes.
This console maps roms, and a lot of other things, to regions within its address space. Certain locations are also mirrored so that multiple addresses can correspond to the same physical location. I would like to emulate this, but I am not sure what would be a good approach to do so (and also have no idea what this process is called, hence this somewhat generic question).
One thing that does work is a simple, unordered map. Have it contain absolute addresses and the corresponding pointers to my data structures. This way, I can easily map everything I need into the system's address space. The problem with this approach is that, it's obviously a memory hog. Even with small roms, I end up with close to ten million entries, thanks to the aforementioned mirroring. Surely, this can't be the rigth thing to do?
Any help is much appreciated.
Edit:
To provide some details as to how exactly I am doing this:
The system in question is, of course, the SNES. Using this wiki as my primary resource, I implemeted what I mentioned above as follows:
Create a std::unordered_map<uint32,uint8_t*> mMemoryMap;
Check whether the rom is LoRom or HiRom
For each byte in the rom
Calculate the address where it should be mapped and emplace both the address and a pointer to said byte in the map
If the section needs to be mirrored somewhere else, repeat the above
This will be applied to anything else I need to make available, such as video- or system-memory
If I now want to access anything within the address space, I can simply use the address the system would use internally.

I'm assuming that for contiguous addresses the physical locations are also contiguous, within certain blocks of memory or "chunks". That is, if the address 0x0000 maps to 0xFF00, then 0x0004 maps to 0xFF04.
If they work like that, then you can make a list that contains the information of those chunks. Say:
struct Chunk
{
int addressStart, memoryStart, size;
}
The chunks may be ordered by the addressStart, so you can find out the correct chunk you would need for any address. This requires you to iterate the list, but if you have only a few chunks this may be acceptable.

Rather than use simple maps (which even with ranges can grow to large sizes) you can instead use a more intelligent map.
For instance if the console maps 0x10XXXX through 0x1FXXXX all to the same 0x20XXXX you can design a structure which holds that repetition (start 0x100000 end 0x1FFFFF repeat 0x010000 although you may want to use a bitmask rather than repeat).

I'm currently in the same boat as you and doing an NES emulator for learning purposes. The complete memory map is declared as an array of pointers to bytes. Each pointer can point into another array, which allows me to have pointers to the same data in the case of mirroring.
byte *memoryMap[cpuMemSize];
I iterate over the addresses that repeat and map them to point to the bytes in the following array. The ram array is the memory that gets mapped 4 times across the CPU's memory map.
byte ram[ramSize];
The following code goes through the RAM array and maps it across the CPU memory map 4 times.
// map RAM 4 times to emulate mirroring
mem_address memAddr = 0;
for (int x = 0; x < 4; ++x)
{
for (mem_address ramAddr = 0; ramAddr < ramSize; ++ramAddr.value)
{
memoryMap[memAddr.value++] = &ram[ramAddr.value];
}
}
You can then write the a value to the 256th byte using something like this, which would of course be propagated to the other parts of the memory map since they point to the same byte in memory:
*memoryMap[0x00ff] = 10;
I haven't really tested this and want to do some more testing with regard to CPU cache use and performance. I was clearly out searching for other ways of doing this when I stumbled on your question and figured I'd put in my (unverified) two cents. Hope this makes sense.

What's the real memory cost for std::set<MyClass>?

Suppose the sizeof MyClass is N Bytes, and I put M objects of MyClass into a std::set<MyClass>. The theoretic memory cost is N*M Bytes, however as I know std::set is implemented using Tree structure so I reckon there must be extra memory cost. How can I estimate an approximate and reasonable memory cost for a set, because when I run my program, the memory cost is higher than what I expected?

As mentioned by #SirDarius, this is implementation dependent and therefore you would need to check on a per implementation basis.
Usually, std::set is implemented as a red-black tree with lean iterators. This means that for each element in the set, there is one node, and this node roughly contains:
3 pointers: 1 for the parent node, 1 for the left-side child and one for the right-side child
a tag: red or black
the actual user type
The minimal foot-print of such a node on a 64 bits platform is therefore footprint(node):
3*8 bytes
sizeof(MyClass), rounded to 8 bytes (for alignment reasons: look up "struct padding")
Note: it is easy to use an unused bit somewhere to stash the red/black flag at no additional cost, for example taking into account the fact that given node alignments the least-significant bits of the pointers are always 0, or inserting in MyClass padding if any.
However, when you allocate an object T via new (which is what std::allocator does under the hood, and which itself uses malloc unless overloaded), you do not allocate just sizeof(T) bytes:
a typical implementation of malloc uses size buckets, meaning that if you have 8 bytes buckets and 16 bytes buckets when you allocate a 9 bytes object it goes into a 16 bytes bucket (wasting 7 bytes)
furthermore, the implementation also has some overhead to track those allocated blocks of memory (to know not to reuse that memory until it's freed); though minimal, the overhead can be non-negligible
finally, the OS itself has some overhead for each page of memory that the program requests; although that is probably negligible
Therefore, indeed, the use of a set will most likely consume more than just set.size() * sizeof(MyClass), though the overhead depends on your Standard Library implementation. This overhead will likely be large for small objects (such as int32_t: on a 64-bits platform you are looking at least at a +500% overhead) because the cost of linking the nodes together is usually fixed, whereas for large objects it might not be noticeable.
In order to reduce the overhead, you traditionally use arrays. This is what a B-Tree does for example, though it does so at the cost of stability (ie, each element staying at a fixed memory address regardless of what happens to the others).

The ordered associative containers have constraints which pretty much require that they are implemented as balanced tree. Most likely the nodes in each tree will have two pointers to the children and a pointer to the parent node. Depending on whether the inplementation plays bit tricks to store additional information in the pointers or not there may be an extra word in the allocated nodes. Further, the memory management system may store a word for each allocated chunk of memory and align it, e.g., to a 16 byte boundary.
What the implementation actually does is certainly not specified in the standard but I'd think the minimum memory consumption for M elements of size N is
M * (N + 2 * sizeof(void*) + sizeof(int))
It is probably somewhat more: I would probably use a pointer to the parent, too.

Casting size_t to allow more elements in a std::vector

I need to store a huge number of elements in a std::vector (more that the 2^32-1 allowed by unsigned int) in 32 bits. As far as I know this quantity is limited by the std::size_t unsigned int type. May I change this std::size_t by casting to an unsigned long? Would it resolve the problem?
If that's not possible, suppose I compile in 64 bits. Would that solve the problem without any modification?

size_t is a type that can hold size of any allocable chunk of memory. It follows that you can't allocate more memory than what fits in your size_t and thus can't store more elements in any way.
Compiling in 64-bits will allow it, but realize that the array still needs to fit in memory. 232 is 4 billion, so you are going to go over 4 * sizeof(element) GiB of memory. More than 8 GiB of RAM is still rare, so that does not look reasonable.
I suggest replacing the vector with the one from STXXL. It uses external storage, so your vector is not limited by amount of RAM. The library claims to handle terabytes of data easily.
(edit) Pedantic note: size_t needs to hold size of maximal single object, not necessarily size of all available memory. In segmented memory models it only needs to accommodate the offset when each object has to live in single segment, but with different segments more memory may be accessible. It is even possible to use it on x86 with PAE, the "long" memory model. However I've not seen anybody actually use it.

There are a number of things to say.
First, about the size of std::size_t on 32-bit systems and 64-bit systems, respectively. This is what the standard says about std::size_t (§18.2/6,7):
6 The type size_t is an implementation-deﬁned unsigned integer type that is large enough to contain the size
in bytes of any object.
7 [ Note: It is recommended that implementations choose types for ptrdiff_t and size_t whose integer
conversion ranks (4.13) are no greater than that of signed long int unless a larger size is necessary to
contain all the possible values. — end note ]
From this it follows that std::size_t will be at least 32 bits in size on a 32-bit system, and at least 64 bits on a 64-bit system. It could be larger, but that would obviously not make any sense.
Second, about the idea of type casting: For this to work, even in theory, you would have to cast (or rather: redefine) the type inside the implementation of std::vector itself, wherever it occurs.
Third, when you say you need this super-large vector "in 32 bits", does that mean you want to use it on a 32-bit system? In that case, as the others have pointed out already, what you want is impossible, because a 32-bit system simply doesn't have that much memory.
But, fourth, if what you want is to run your program on a 64-bit machine, and use only a 32-bit data type to refer to the number of elements, but possibly a 64-bit type to refer to the total size in bytes, then std::size_t is not relevant because that is used to refer to the total number of elements, and the index of individual elements, but not the size in bytes.
Finally, if you are on a 64-bit system and want to use something of extreme proportions that works like a std::vector, that is certainly possible. Systems with 32 GB, 64 GB, or even 1 TB of main memory are perhaps not extremely common, but definitely available.
However, to implement such a data type, it is generally not a good idea to simply allocate gigabytes of memory in one contiguous block (which is what a std::vector does), because of reasons like the following:
Unless the total size of the vector is determined once and for all at initialization time, the vector will be resized, and quite likely re-allocated, possibly many times as you add elements. Re-allocating an extremely large vector can be a time-consuming operation. [ I have added this item as an edit to my original answer. ]
The OS will have difficulties providing such a large portion of unfragmented memory, as other processes running in parallel require memory, too. [Edit: As correctly pointed out in the comments, this isn't really an issue on any standard OS in use today.]
On very large servers you also have tens of CPUs and typically NUMA-type memory architectures, where it is clearly preferable to work with relatively smaller chunks of memory, and have multiple threads (possibly each running on a different core) access various chunks of the vector in parallel.
Conclusions
A) If you are on a 32-bit system and want to use a vector that large, using disk-based methods such as the one suggested by #JanHudec is the only thing that is feasible.
B) If you have access to a large 64-bit system with tens or hundreds of GB, you should look into an implementation that divides the entire memory area into chunks. Essentially something that works like a std::vector<std::vector<T>>, where each nested vector represents one chunk. If all chunks are full, you append a new chunk, etc. It is straight-forward to implement an iterator type for this, too. Of course, if you want to optimize this further to take advantage of multi-threading and NUMA features, it will get increasingly complex, but that is unavoidable.

A vector might be the wrong data structure for you. It requires storage in a single block of memory, which is limited by the size of size_t. This you can increase by compiling for 64 bit systems, but then you can't run on 32 bit systems which might be a requirement.
If you don't need vector's particular characteristics (particularly O(1) lookup and contiguous memory layout), another structure such as a std::list might suit you, which has no size limits except what the computer can physically handle as it's a linked list instead of a conveniently-wrapped array.

Tagging/Encoding Pointers

I need a way to tag a pointer as being either part of set x or part of set y (ie: the tag has only 2 'states'), I'm that means one can assume untagged = x and tagged = y.
Currently I'm looking at using bitwise xor to do this:
ptr ^ magic = encoded_ptr
encoded_ptr ^ magic = ptr
but I'm stumped at how to determine if the pointer is tagged in the first place.
I'm using this to mark what pools nodes in a linked list come from, so that when the are delinked, they can go back to the correct perants.
Update
Just to make it clear to all those people suggesting to store the flag in extra data members, I'm limited to sizeof(void*), so I can't add new members, else I would have. Also the pools aren't contiguous, they consist of many pages, tracking the ranges would add too much overhead (I'm after a fast & simple solution, if one can call it that).

Most solution will be platform specific. here a few of them:
1) A pointer returned by malloc or new will be aligned (4, 8, 16, 32 bytes, you name it). So, on most architectures, several LSB bits of the address will be always 0.
2) And a Win32 specific way: unless your program uses 3GB switch, values of all usermode pointers are less than 0x80000000, so the highest bit can be used as flag. As bonus, it will also crash when the flagged pointer is dereferenced without being repaired.

There is no safe and portable way to make that sort of thing work. You might be able to find some system-specific bits that are always a known value (say, the most significant n bits), but that's an extremely fragile and dangerous thing to rely on. You can't tell whether a pointer is "marked" or not unless some of the bits in the pointer have known values in the first place.
A much better way to do this is to store an identifier in the structure the pointer points to.

Surely if you only have two pools, when you allocate memory for each pool you know the possible address range - so why not check whether your given pointer occurs in one or the other address range with simple pointer arithmetic?

If performance is not a big issue, two std::set's can be used.
If it's important to get this information quickly, and it's acceptable to use only 2-byte aligned pointers, the lowest bit can be used to store this information. But having "hacked" pointers may appear to be quite error-prone...

You might have ptr1 ^ magic = ptr2 with ptr1 in set X and ptr2 in set Y (unless you prove otherwise). Since (I guess) you don't have control on the pointers addresses you are given, your technique seems to be inadequate.
An alternative to Vinay solution is to store the tags as bits of a pre-allocated buffer (specially easy if the size of the list is bounded since you don't have to grow or shrink the buffer). This is a very compact and efficient solution that does not require to modify the pointed data structure.
Cheers,
-stan

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js