Random-access container that does not fit in memory? - c++

I have an array of objects (say, images), which is too large to fit into memory (e.g. 40GB). But my code needs to be able to randomly access these objects at runtime.
What is the best way to do this?
From my code's point of view, it shouldn't matter, of course, if some of the data is on disk or temporarily stored in memory; it should have transparent access:
container.getObject(1242)->process();
container.getObject(479431)->process();
But how should I implement this container? Should it just send the requests to a database? If so, which one would be the best option? (If a database, then it should be free and not too much administration hassle, maybe Berkeley DB or sqlite?)
Should I just implement it myself, memoizing objects after acces sand purging the memory when it's full? Or are there good libraries (C++) for this out there?
The requirements for the container would be that it minimizes disk access (some elements might be accessed more frequently by my code, so they should be kept in memory) and allows fast access.
UPDATE: I turns out that STXXL does not work for my problem because the objects I store in the container have dynamic size, i.e. my code may update them (increasing or decreasing the size of some objects) at runtime. But STXXL cannot handle that:
STXXL containers assume that the data
types they store are plain old data
types (POD).
http://algo2.iti.kit.edu/dementiev/stxxl/report/node8.html
Could you please comment on other solutions? What about using a database? And which one?

Consider using the STXXL:
The core of STXXL is an implementation
of the C++ standard template library
STL for external memory (out-of-core)
computations, i.e., STXXL implements
containers and algorithms that can
process huge volumes of data that only
fit on disks. While the compatibility
to the STL supports ease of use and
compatibility with existing
applications, another design priority
is high performance.

You could look into memory mapped files, and then access one of those too.

I would implement a basic cache. With this workingset size you will have the best results with a set-associative-cache with x byte cache-lines ( x == what best matches your access pattern ). Just implement in software what every modern processor already has in hardware. This should give you imho the best results. You could than optimize it further if you can optimize the accesspattern to be somehow linear.

One solution is to use a structure similar to a B-Tree, indices and "pages" of arrays or vectors. The concept is that the index is used to determine which page to load into memory to access your variable.
If you make the page size smaller, you can store multiple pages in memory. A caching system based on frequency of use or other rule, will reduce the number of page loads.

I've seen some very clever code that overloads operator[]() to perform disk access on the fly and load required data from disk/database transparently.

Related

Container Data Structure similar to std::vector and std::list

I am currently developing (at Design-Stage) a kinda small compiler which uses a custom IR. The problem I am having is to choose an efficient container data structure to contain all the instructions.
A basic block will contain ~10000 instructions and an instruction will be like ~250 Bytes big.
I thought about using a list because the compiler will have lots of complex transformations (ie. lots of random insertions/removals) so having a container data structure which does not invalidate iterators would be good. As it would keep the transformation algorithm simple and easy to follow.
On the other hand, it would be a lost of performance because of the known problem with cache misses and memory fragmentation. An std::vector would help here but I imagine it would be a pain to implement transformations with a vector.
So the questions is, if there is another data structure which has low memory fragmentation to reduce memory cache misses and does not invalidate iterators.
Or if I should ignore this and keep using a list.
Start with using Container = std::vector<Instruction>. std::vector is a pretty good default container. Once performance becomes an issue profile the program with a couple of different containers. You should be able to swap out the Container without needing much change in the rest of the code. I imagine some kind of array-list would be best, but you should probably check.

Alternative to std::vector to store a sequence of objects

I am dealing with several million data elements that are to be accessed sequentially. The elements rarely grow and shrink but do so in known chunk sizes in a predictable manner.
I am looking for a efficient collection similar to std::vector which does not reallocate but holds the data in multiple chunks of memory. Whenever I push more objects in to the collection and if the last chunk is exhausted, then a new chunk gets created and populated. I am not keen to have a random access operator. I cannot use std::list due to performance issues and few other issues that are beyond the scope of the question at hand.
Is there a ready made collection that fits my requirement in boost or any other library. I want to make sure that there is nothing that is available of the shelf before I try and cook something myself.
It sounds to me like your best bet would be many std::vectors stored within a B-Tree. The B-Tree lets you refer to areas in memory without actually visiting them during tree traversal, allowing for minimal file access.

list/map of key-value pairs backed up by file on disk

I need to make a list of key-value pairs (similar to std::map<std::string, std::string>) that is stored on disk, can be accessed by multiple threads at once. keys can be added or removed, values can be changed, keys are unique. Supposedly the whole thing might not fit into memory at once, so updates to the map must be saved to the disk.
The problem is that I'm not sure how to approach this problem. I understand how to deal with multithreading issues, but I'm not sure which data structure is suitable for storing data on disk. Pretty much anything I can think of can dramatically change structure and cause massive overwrite of the disk storage, if I approach problem head-on. On other hand, relational databases and windows registry deal with this problem, so there must be a way to approach it.
Is there a data structure that is "made" for such scenario?
Or do I simply use any traditional data structure(trees or skip lists, for example) and make some kind of "memory manager" (disk-backed "heap") that allocates chunks of disk space, loads them into memory on request and unloads them onto disk, when necessary? I can imagine how to write such "disk-based heap", but that solution isn't very elegant, especially when you add multi-threading to the picture.
Ideas?
The data structure that is "made" for your scenario is B-tree or its variants, like B+ tree.
Long and short of it: once you write things to disk you are not longer dealing with "data structures" - you are dealing with "serialization" and "databases."
The C++ STL and its data structures do not really address these issues, but, fortunately, they have already been addressed thousands of times by thousands of programmers already. Chances are 99.9% that they've already written something that will work well for you.
Based on your description, sqlite sounds like it would be a decent, balanced choice for your application.
If you only need to do lookups (and insertions, deletions) by key, and not more complex field-based queries, BDB may be a better choice for your application.

std::map vs. self-written std::vector based dictionary

I'm building a content storage system for my game engine and I'm looking at possible alternatives for storing the data. Since this is a game, it's obvious that performance is important. Especially considering various entities in the engine will be requesting resources from the data structures of the content manager upon their creation. I'd like to be able to search resources by a name instead of an index number, so a dictionary of some sort would be appropriate.
What are the pros and cons to using an std::map and to creating my own dictionary class based on std::vector? Are there any speed differences (if so, where will performance take a hit? I.e. appending vs. accessing) and is there any point in taking the time to writing my own class?
For some background on what needs to happen:
Writing to the data structures occurs only at one time, when the engine loads. So no writing actually occurs during gameplay. When the engine exits, these data structures are to be cleaned up. Reading from them can occur at any time, whenever an entity is created or a map is swapped. There can be as little as one entity being created at a time, or as many as 20, each needing a variable number of resources. Resource size can also vary depending on the size of the file being read in at the start of the engine, images being the smallest and music being the largest depending on the format (.ogg or .midi).
Map: std::map has guaranteed logarithmic lookup complexity. It's usually implemented by experts and will be of high quality (e.g. exception safety). You can use custom allocators for custom memory requirements.
Your solution: It'll be written by you. A vector is for contiguous storage with random access by position, so how will you implement lookup by value? Can you do it with guaranteed logarithmic complexity or better? Do you have specific memory requirements? Are you sure you can implement a the lookup algorithm correctly and efficiently?
3rd option: If you key type is string (or something that's expensive to compare), do also consider std::unordered_map, which has constant-time lookup by value in typical situations (but not quite guaranteed).
If you want the speed guarantee of std::map as well as the low memory usage of std::vector you could put your data in a std::vector, std::sort it and then use std::lower_bound to find the elements.
std::map is written with performance in mind anyway, whilst it does have some overhead as they have attempted to generalize to all circumstances, it will probably end up more efficient than your own implementation anyway. It uses a red-black binary tree, giving all of it's operations O[log n] efficiency (aside from copying and iterating for obvious reasons).
How often will you be reading/writing to the map, and how long will each element be in it? Also, you have to consider how often will you need to resize etc. Each of these questions is crucial to choosing the correct data structure for your implementation.
Overall, one of the std functions will probably be what you want, unless you need functionality which is not in a single one of them, or if you have an idea which could improve on their time complexities.
EDIT: Based on your update, I would agree with Kerrek SB that if you're using C++0x, then std::unordered_map would be a good data structure to use in this case. However, bear in mind that your performance can degrade to linear time complexity if you have conflicting hashes (this cannot happen with std::map), as it will store the two pair's in the same bucket. Whilst this is rare, the probability of it obviously increases with the number of elements. So if you're writing a huge game, it's possible that std::unordered_map could become less optimal than std::map. Just a consideration. :)

Is there a muti index container for the harddisk storage rather than memory?

I need a muti index container based on red-black trees (something like boost::multi_index::multi_index_container) for the case of the harddisk storage. All data must be store on hard disk rather than in memory.
Is there an open source container such that described conditions hold?
Note. I use C++.
If you have an in-memory solution, you can use a memory-mapped file and a custom allocator to achieve persistent storage.
I am afraid I don't know any.
For hard-disk storage I can only recommend a look to STXXL, which proposes STL containers and algorithms adapted to data that can only fit on disk. They have implemented many things to allow for a smoother manipulation, essentially by caching in memory as much as possible and delaying disk access when possible.
Now that won't get you a multi index, but at least you'll have a STL :)
Then, if you are determined, you can port multi-index to use the facilities provided by STXXL: they have decorrelated the IO access / memory caching from the containers themselves.
Or you can simply write what you need based on their STL-compliant containers.
How about SQLite? It can use disk as backing store, and supports multiple indexes on data, as Boost Multi Index does.