C++ file-backed, tree-like datastructure - c++

I'm currently using google's protobuffer library to store and load data from disk.
It's pretty convenient because it's fast, provides a nice way of defining my datastructures, allows to compress/decompress the data upon writing/reading from the file.
So far that served me well. The problem is now that I have to deal with a datastructure that is several hundred gigabytes large and with protobuf I can only write and load the whole file.
The datastructure looks something like that
struct Layer {
std::vector<float> weights_;
std::vector<size_t> indices_;
};
struct Cell {
std::vector<Layer> layers_;
};
struct Data {
int some_header_fields;
...
std::vector<Cell> cells_;
};
There are two parts to the algorithm.
In the first part, data is added (not in sequence, the access pattern is random, weights and indices might be added to any layer of any cell). No data is removed.
In the second part, the algorithm accesses one cell at a time and processes the data in it, but the access order of cells is random.
What I'd like would be something similar to protobuf, but backed by some file storage that doesn't need to be serialized/deserialized in one go.
Something that would allow me to do things like
Data.cells_[i].layers_[j].FlushToDisk();
at which point the weights_ and indices_ arrays/lists would write their current data to the disk (and free the associated memory) but retain their indices, so that I can add more data to it as I go.
And later during the second part of the algorithm, I could do something like
Data.cells_[i].populate(); //now all data for cell i has been loaded into ram from the file
... process cell i...
Data.cells_[i].dispose(); //now all data for cell i is removed from memory but remains in the file
Additionally to store data to the disk, I'd like it to also support compression of data. It should also allow multithreaded access.
What library would enable me to do this? Or can I still use protobuf for this in some way? (I guess not, because I would not write the data to disk in a serialized fashion)
//edit:
performance is very important. So when I populate the cell, I need the data to be in main memory and in continguous arrays

Related

Training data structure and access

I'm writing up an implementation of backpropagation for a feedforward neural network in C++ and I'm using the Armadillo library. Right now, I'm loading training data with the method load for the class matrix in the Armadillo library. Two questions:
1) Is this a reasonable choice for storing pre-formatted (CSV), numeric data that fits into main memory (<2GB)? Certainly there are better ways to do this than others and it'd be nice to know if this is not a good practice. Part of me feels like this isn't a good choice for holding the data as there are likely more data-ish structures/frameworks (like I should be accessing some SQL database or something). Another part of me feels like numeric data is by definition just matrices so this should be wonderful.
2) I need to sample without replacement from a data set in my implementation and I see two routes: either I could shuffle the rows of the data set or shuffle an array that indexes the data set. There is a shuffle method for the matrix class in the Armadillo library and I'm suspicious that what is shuffled is addresses and not the rows themselves. Wouldn't that be just as efficient as shuffling an indexing array?
1) Yes, this is fine and it's how I would do it, but note that Armadillo matrices are column-major and thus you may need to transpose the CSV that you load. If your data is sufficiently large that it won't fit in main memory, you could consider writing a custom CSV parser that looks at the data in a streaming sense (i.e. one point at a time), thus reducing your RAM footprint, or you could even use mmap() to map a file full of packed doubles as your matrix and let the kernel work out what needs to be swapped in when.
2) Because all matrix data is stored contiguously (i.e. double* not double**), shuffle() will be moving the elements in the matrix. What I generally do in this type of situation is create a vector of indices and shuffle it:
uvec indices = linspace<uvec>(0, n, n);
shuffle(indices);
// Now loop over each shuffled point...
for (uword i = 0; i < n; ++i)
{
// access the point with data.col(indices[i]) and do whatever
}
(The above code isn't tested, but it should work or easily be adapted into something that works.)
For what it's worth, mlpack (http://www.mlpack.org/) does have a not-yet-stable neural network infrastructure that uses Armadillo, and it may be worth your time to check out; the link below is to the relevant source directly, but poking around on Github and the mlpack website should reveal better documentation.
https://github.com/mlpack/mlpack/tree/master/src/mlpack/methods/ann

Recover object lazy-loading the containing file

I'm using a binary file to recover an object using boost::binary_iarchive_ia but it is too heavy (18GB) and that object loads the entire file to memory. Is there a way to read the file by parts (a lazy load) to avoid the memory use?
What I have:
std::ifstream ifs(filename);
boost::archive::binary_iarchive_ia(ifs);
MyObject obj;
ia >> obj;
Upgrading my comment to an answer:
#cmaster got really close to an approach that can workm but he accidentally put the problem upside down.
The raw file was never the issue (it was streaming all along).
The problem is that deserialization tries to put the data all in memory (the vector, e.g.). So the only real solutions would be to
is to put this data into a (shared?) memory map. You can use the allocators from Boost Interprocess to help you achieve this. This is a lot of effort, but relatively straight forward, conceptually.
one could modify the deserialization code to convert to a different on-disk format on the fly (instead of inserting into e.g. that vector), which would then allow mmap as cmaster suggested it.
In other words, you'd "canibalize" the boost serialization implementation to migrate the data away from boost serialization towards a raw binary format that affords using it directly in mapped memory.
You can use mmap() to map the file into your address space. With that, it doesn't matter that the file is too large because the kernel knows that any data in the mapped region is just a copy of the file on the hard disk. Consequently, it does not even need to swap the data out when it needs the memory for something else. The kernel will just lazily load the parts of the file that you need as you touch them, which is especially good if you don't need everything in the file.
The nice thing about mmap() is that you have the entire file contents accessible as a huge char array, which is quite convenient for many use cases. The only precondition that must be met is that your process runs as a 64 bit process, otherwise your virtual address space will be too small to fit the file into it.

Best way to store, load and use an inverted index in C++ (~500 Mo)

I'm developing a tiny search engine using TF-IDF and cosine similarity. When pages are added, I build an inverted index to keep words frequency in the different pages. I remove stopwords and more common words, and plural/verb/etc. are stemmed.
My inverted index looks like:
map< string, map<int, float> > index
[
word_a => [ id_doc=>frequency, id_doc2=>frequency2, ... ],
word_b => [ id_doc->frequency, id_doc2=>frequency2, ... ],
...
]
With this data structure, I can get the idf weight with word_a.size(). Given a query, the program loops over the keywords and scores the documents.
I don't know well data structures and my questions are:
How to store a 500 Mo inverted index in order to load it at search time? Currently, I use boost to serialize the index:
ofstream ofs_index("index.sr", ios::binary);
boost::archive::bynary_oarchive oa(ofs_index);
oa << index;
And then I load it at search time:
ifstream ifs_index("index.sr", ios::binary);
boost::archive::bynary_iarchive ia(ifs_index);
ia >> index;
But it is very slow, it takes sometines 10 seconds to load.
I don't know if map are efficient enough for inverted index.
In order to cluster documents, I get all keywords from each document and I loop over these keywords to score similar documents, but I would like to avoid reading again each document and use only this inverted index. But I think this data structure would be costly.
Thank you in advance for any help!
The answer will pretty much depend on whether you need to support data comparable to or larger than your machine's RAM and whether in your typical use case you are likely to access all of the indexed data or rather only a small fraction of it.
If you are certain that your data will fit into your machine's memory, you can try to optimize the map-based structure you are using now. Storing your data in a map should give pretty fast access, but there will always be some initial overhead when you load the data from disk into memory. Also, if you only use a small fraction of the index, this approach may be quite wasteful as you always read and write the whole index, and keep all of it in memory.
Below I list some suggestions you could try out, but before you commit too much time to any of them, make sure that you actually measure what improves the load and run times and what does not. Without profiling the actual working code on actual data you use, these are just guesses which may be completely wrong.
map is implemented as a tree (usually black-red tree). In many cases, a hash_map may give you better performance as well as better memory usage (fewer allocations and less fragmentation for example).
Try reducing the size of the data - less data means it will be faster to read it from disk, potentially less memory allocation, and sometimes better in-memory performance due to better locality. You may for example consider that you use float to store the frequency, but perhaps you could store only the number of occurrences as an unsigned short in the map values and in a separate map store the number of all words for each document (also as a short). Using the two numbers, you can re-calculate the frequency, but use less disk space when you save the data to disk, which could result in faster load times.
Your map has quite a few entries, and sometimes using custom memory allocators helps improve performance in such a case.
If your data could potentially grow beyond the size of your machine's RAM, I would suggest you use memory-mapped files for storing the data. Such an approach may require re-modelling your data structures and either using custom STL allocators or using completely custom data structures instead of std::map but it may improve your performance an order of magnitude if done well. In particular, this approach frees your from having to load the whole structure into memory at once, so your startup times will improve dramatically at the cost of slight delays related to disk accesses distributed over time as you touch different parts of the structure for the first time. The subject is quite broad, and requires much deeper changes to your code than just tuning the map, but if you plan handling huge data, you should certainly have a look at mmap and friends.

C++ spatialindex library: loading/storing main memory RTree from/to disk

I have created a main memory R* index with the help of spatialindex library in the following way (DBStream implements the interface for bulkLoading)
// creating a main memory RTree
memStorage = StorageManager::createNewMemoryStorageManager();
size_t capacity = 1024;
bool bWriteThrough = false;
fileInMem = StorageManager
::createNewRandomEvictionsBuffer(*memStorage, capacity, bWriteThrough);
DBStream dstream(streets);
tree = RTree::createAndBulkLoadNewRTree(SpatialIndex::RTree::BLM_STR, dstream,
*fileInMem,
fillFactor, indexCapacity,
leafCapacity, dimension, rv, indexIdentifier);
My data is read-only, i.e., I want to build the tree only once, save it, and reload from persistent storage every time I use my program. Clearly, I can save and load the memStorage myself, but how to recreate the RTree from it?
Since you are bulk-loading the tree anyway, there is little to gain here, actually. All that a STR bulk load does is sort the data. This is O(n log n) theoretically, but if you have the data sorted appropriately it will actually be in O(n) with most sorting implementations.
So most likely, serializing the tree into a file and back is not much cheaper than bulk-loading it again each time. It does take away some flexibility though.
R-Trees in general are meant for dynamic data IMHO. Sure, they do work for static data. But their key strength (opposed to other structures) is that the tree supports balancing on insert.
After extensive research I must conclude that it is possible to save the MainMemoryStorage object but it is impossible to load it. Saving the object is possible through a derived class, that keeps track of all used page IDs (and later saving them to the file). Loading these pages is very problematic as one needs a direct access to
std::vector<Entry*> m_buffer;
std::stack<id_type> m_emptyPages;
from MemoryStorageManager.h. These guys are private, and the MemoryStorageManager.h is not available as this is a private include available only to spatiallibrary.
What a depressing answer.

How to store a hash table in a file?

How can I store a hash table with separate chaining in a file on disk?
Generating the data stored in the hash table at runtime is expensive, it would be faster to just load the HT from disk...if only I can figure out how to do it.
Edit:
The lookups are done with the HT loaded in memory. I need to find a way to store the hashtable (in memory) to a file in some binary format. So that next time when the program runs it can just load the HT off disk into RAM.
I am using C++.
What language are you using? The common method is to do some sort binary serialization.
Ok, I see you have edited to add the language. For C++ there a few options. I believe the Boost serialization mechanism is pretty good. In addition, the page for Boost's serialization library also describes alternatives. Here is the link:
http://www.boost.org/doc/libs/1_37_0/libs/serialization/doc/index.html
Assuming C/C++: Use array indexes and fixed size structs instead of pointers and variable length allocations. You should be able to directly write() the data structures to file for later read()ing.
For anything higher-level: A lot of higher language APIs have serialization facilities. Java and Qt/C++ both have methods that sprint immediately to mind, so I know others do as well.
You could just write the entire data structure directly to disk by using serialization (e.g. in Java). However, you might be forced to read the entire object back into memory in order to access its elements. If this is not practical, then you could consider using a random access file to store the elements of the hash table. Instead of using a pointer to represent the next element in the chain, you would just use the byte position in the file.
Ditch the pointers for indices.
This is a bit similar to constructing an on-disk DAWG, which I did a while back. What made that so very sweet was that it could be loaded directly with mmap instead reading the file. If the hash-space is manageable, say 216 or 224 entries, then I think I would do something like this:
Keep a list of free indices. (if the table is empty, each chain-index would point at the next index.)
When chaining is needed use the free space in the table.
If you need to put something in an index that's occupied by a squatter (overflow from elsewhere) :
record the index (let's call it N)
swap the new element and the squatter
put the squatter in a new free index, (F).
follow the chain on the squatter's hash index, to replace N with F.
If you completely run out of free indices, you probably need a bigger table, but you can cope a little longer by using mremap to create extra room after the table.
This should allow you to mmap and use the table directly, without modification. (scary fast if in the OS cache!) but you have to work with indices instead of pointers. It's pretty spooky to have megabytes available in syscall-round-trip-time, and still have it take up less than that in physical memory, because of paging.
Perhaps DBM could be of use to you.
If your hash table implementation is any good, then just store the hash and each object's data - putting an object into the table shouldn't be expensive given the hash, and not serialising the table or chain directly lets you vary the exact implementation between save and load.