How to store a hash table in a file? - c++

How can I store a hash table with separate chaining in a file on disk?
Generating the data stored in the hash table at runtime is expensive, it would be faster to just load the HT from disk...if only I can figure out how to do it.
Edit:
The lookups are done with the HT loaded in memory. I need to find a way to store the hashtable (in memory) to a file in some binary format. So that next time when the program runs it can just load the HT off disk into RAM.
I am using C++.

What language are you using? The common method is to do some sort binary serialization.
Ok, I see you have edited to add the language. For C++ there a few options. I believe the Boost serialization mechanism is pretty good. In addition, the page for Boost's serialization library also describes alternatives. Here is the link:
http://www.boost.org/doc/libs/1_37_0/libs/serialization/doc/index.html

Assuming C/C++: Use array indexes and fixed size structs instead of pointers and variable length allocations. You should be able to directly write() the data structures to file for later read()ing.
For anything higher-level: A lot of higher language APIs have serialization facilities. Java and Qt/C++ both have methods that sprint immediately to mind, so I know others do as well.

You could just write the entire data structure directly to disk by using serialization (e.g. in Java). However, you might be forced to read the entire object back into memory in order to access its elements. If this is not practical, then you could consider using a random access file to store the elements of the hash table. Instead of using a pointer to represent the next element in the chain, you would just use the byte position in the file.

Ditch the pointers for indices.
This is a bit similar to constructing an on-disk DAWG, which I did a while back. What made that so very sweet was that it could be loaded directly with mmap instead reading the file. If the hash-space is manageable, say 216 or 224 entries, then I think I would do something like this:
Keep a list of free indices. (if the table is empty, each chain-index would point at the next index.)
When chaining is needed use the free space in the table.
If you need to put something in an index that's occupied by a squatter (overflow from elsewhere) :
record the index (let's call it N)
swap the new element and the squatter
put the squatter in a new free index, (F).
follow the chain on the squatter's hash index, to replace N with F.
If you completely run out of free indices, you probably need a bigger table, but you can cope a little longer by using mremap to create extra room after the table.
This should allow you to mmap and use the table directly, without modification. (scary fast if in the OS cache!) but you have to work with indices instead of pointers. It's pretty spooky to have megabytes available in syscall-round-trip-time, and still have it take up less than that in physical memory, because of paging.

Perhaps DBM could be of use to you.

If your hash table implementation is any good, then just store the hash and each object's data - putting an object into the table shouldn't be expensive given the hash, and not serialising the table or chain directly lets you vary the exact implementation between save and load.

Related

C++ - Managing References in Disk Based Vector

I am developing a set of vector classes that all derived from an abstract vector. I am doing this so that in our software that makes use of these vectors, we can quickly switch between the vectors without any code breaking (or at least minimize failures, but my goal is full compatibility). All of the vectors match.
I am working on a Disk Based Vector that mostly conforms to match the STL Vector implementation. I am doing this because we need to handle large out of memory files that contain various formats of data. The Disk Vector handles data read/write to disk by using template specialization/polymorphism of serialization and deserialization classes. The data serialization and deserialization has been tested, and it works (up to now). My problem occurs when dealing with references to the data.
For example,
Given a DiskVector dv, a call to dv[10] would get a point to a spot on disk, then seek there, read out the char stream. This stream gets passed to a deserializor which converts the byte stream into the appropriate data type. Once I have the value, I my return it.
This is where I run into a problem. In the STL, they return it as a reference, so in order to match their style, I need to return a reference. What I do it store the value in an unordered_map with the given index (in this example, 10). Then I return a reference to the value in the unordered_map.
If this continues without cleanup, then the purpose of the DiskVector is lost because all the data just gets loaded into memory, which is bad due to data size. So I clean up this map by deleting the indexes later on when other calls are made. Unfortunately, if a user decided to store this reference for a long time, and then it gets deleted in the DiskVector, we have a problem.
So my questions
Is there a way to see if any other references to a certain instance are in use?
Is there a better way to solve this while still maintaining the polymorphic style for reasons described at the beginning?
Is it possible to construct a special class that would behave as a reference, but handle the disk IO dynamically so I could just return that instead?
Any other ideas?
So a better solution at what I was trying to do is to use SQLite as the backend for the database. Use BLOBs as the column types for key and value columns. This is the approach I am taking now. That said, in order to get it to work well, I need to use what cdhowie posted in the comments to my question.

best way to store adresses of pointer to an object of class

I have a class A whose objects are created dynamically:
A *object;
object = new A;
there will be many objects of A in an execution.
i have created a method in A to return the address of a particular object depending on the passed id.
A *A::get_obj(int id)
implemetation of get_obj requires itteration so i chose vectors to store the addresses of the objects.
vector<A *> myvector
i think another way to do this is by creating a file & storing the address as a text on a particular line (this will be the id).
this will help me reduce memory usage as i will not create a vector then.
what i dont know is that will this method consume more time than the vector method?
any other option of doing the same is welcome.
Don't store pointers in files. Ever. The objects A are taking up more space than the pointers to them anyway. If you need more A's than you can hold onto at one time, then you need to create them as needed and serialize the instance data to disk if you need to get them back later before deleting, but not the address.
will this method consume more time than the vector method?
Yes, it will consume a lot more time - every lookup will take several thousand times longer. This does not hurt if lookups are rare, but if they are frequent, it could be bad.
this will help me reduce memory usage
How many object are you expecting to manage this way? Are you certain that memory-usage will be a problem?
any other option of doing the same is welcome
These are your two options, really. You can either manage the list in memory, or on disk. Depending on your usage scenario, you can combine both methods. You could, for instance, keep frequently used objects in memory, and write infrequently used ones out to disk (this is basically caching).
Storing your data in a file will be considerably slower than in RAM.
Regarding the data structure itself, if you usually use all the ID's, that is if your vector usually has empty cells, then std::vector is probably the most suitable approach. But if your vector will have many empty cells, std::map may give you a better solution. It will consume less memory and give O(logN) access complexity.
The most important thing here, imho, is the size of your data set and your platform. For a modern PC, handling an in-memory map of thousands of entries is very fast, but if you handle gigabytes of data, you'd better store it in a real on-disk database (e.g. MySQL).

Size/Resize of GHashTable

Here is my use case: I want to use glib's GHashTable and use IP-addresses as keys, and the olume of data sent/received by this IP-address as the value. For instance I succeeded to implement the whole issue in user-space using some kernel variables in order to look to the volume per IP-address.
Now the question is: Suppose I have a LOT of IP-addresses (i.e. 500,000 up to 1,000,000 uniques) => it is really not clear what is the space allocated and the first size that was given to a new hash table created when using (g_hash_table_new()/g_hash_table_new_full()), and how the whole thing works in the background. It is known that when resizing a hash table it can take a lot of time. So how can we play with these parameters?
Neither g_hash_table_new() nor g_hash_table_new_full() let you specify the size.
The size of a hash table is only available as the number of values stored in it, you don't have access to the actual array size that typically is used in the implementation.
However, the existance of g_spaced_primes_closest() kind of hints that glib's hash table uses a prime-sized internal array.
I would say that although a million keys is quite a lot, it's not extraordinary. Try it, and then measure the performance to determine if it's worth digging deeper.

C++ fstream Erase the file contents from a selected Point

I need to Erase the file contents from a selected Point (C++ fstream) which function should i use ?
i have written objects , i need to delete these objects in middle of the file
C++ has no standard mechanism to truncate a file at a given point. You either have to recreate the file (open with ios::trunc and write the contents you want to keep) or use OS-specific API calls (SetEndOfFile on Windows, truncate or ftruncate on Unix).
EDIT: Deleting stuff in the middle of a file is an exceedingly precarious business. Long before considering any other alternatives, I would try to use a server-less database engine like SQLite to store serialised objects. Better still, I would use SQLite as intended by storing the data needed by those objects in a proper schema.
EDIT 2: If the problem statement requires raw file access...
As a general rule, you don't delete data from the middle of a file. If the objects can be serialised to a fixed size on disk, you can work with them as records, and rather than trying to delete data, you use a table that indexes records within the file. E.g., if you write four records in sequence, the table will hold [0, 1, 2, 3]. In order to delete the second record, you simply remove its entry from the table: [0, 2, 3]. There are at least two ways to reuse the holes left behind by the table:
On each insertion, scan for the first unused index and write the object out at the corresponding record location. This will become more expensive, though, as the file grows.
Maintain a free list. Store, as a separate variable, the index of the most recently freed record. In the space occupied by that record encode the index of the record freed before it, and so on. This maintains a handy linked-list of free records while only requiring space fo one additional number. It is more complicated to work with, however, and requires an extra disk I/O when deleting and inserting.
If the objects can't be serialised to a fixed-length, then this approach becomes much, much harder. Variable-length record management code is very complex.
Finally, if the problem statement requires keeping records in order on disk, then it's a stupid problem statement, because insertion/removal in the middle of a file is ridiculously expensive; no sane design would require this.
The general method is to open the file for read access, open a new file for write access, read the content of the first file and write the data you want retained to the second file. When complete, you delete the first file and rename the second to that of the first.

mmap-loadable data structure library for C++ (or C)

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used for user input processing on iPhoneOS.) mmap-ing a file seems to be the best choice.
Are there any data structure libraries for C++ (or C)? Something along the line
ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq = table.get('a');
...
// munmap(...); inside the d'tor.
Thank you!
Details:
I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. The library should
Contain a creation routine that serialize the data structure into file. This part doesn't need to be fast.
Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing.
Use O(N) amount of disk/memory space with a small constant factor. (The device has serious memory constraint.)
Small time overhead to accessors. (i.e. the complexity isn't modified.)
Assumptions:
Bit representation of data (e.g. endianness, encoding of float, etc.) does not matter since it is only used locally.
So far the possible types of data I need are integers, strings, and struct's of them. Pointers do not appear.
P.S. Can Boost.intrusive help?
You could try to create a memory mapped file and then create the STL map structure with a customer allocator. Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size.
In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later.
You will have to check if memory is free'd by the STL map. If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it.
Sounds like maybe you could use one of the "perfect hash" utilities out there. These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. Obviously, this is intended to be generated rarely but used frequently.
CMPH claims to cope with large numbers of keys. However, I have never used it.
There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself.
Just thought of another option - Datadraw. Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator.
WRT boost.intrusive, I've just been having a look. It's interesting. And annoying, as it makes one of my own libraries look a bit pointless.
I thought this section looked particularly relevant.
If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). An array subscript might be equally valid.
There's certainly unordered set/multiset support (C++ code for hash tables).
Using cmph would work. It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. If not, for such a small set, I would say it is overkill.
Probably the best option is to use google's sparse_hash_map. It has very low overhead and also has the serialization hooks that you need.
http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io
GVDB (GVariant Database), the core of Dconf is exactly this.
See git.gnome.org/browse/gvdb, dconf and bv
and developer.gnome.org/glib/2.30/glib-GVariant.html