Indexing a large file (32gb worth of file)

Indexing a large file (32gb worth of file) - c++

Apologies in advance as I think I need to give a background of my problem.
We have a proprietary database engine written in native c++ built for 32bit runtime, the database records are identified by their record number (basically offset in the file where a record is written) and a "unique id" (which is nothing more than -100 to LONG_MIN).
Previously the engine limits a database to only 2gb (where block of record could be a minimum size of 512bytes up to 512*(1 to 7)). This effectively limits the number of records to about 4 million.
We are indexing this 4 million records and storing the index in a hashtable (we implemented an extensible hashing for this) and works brilliantly for 2gb database. Each of the index is 24bytes each. each record's record number is indexed as well as the record's "unique id" (the indexes reside in the heap and both record number and "unique id" can point to the same index in heap). The indexes are persisted in memory and stored in the file (however only the record number based indexes are stored in file). While in memory, a 2gb database's index would consume about 95mb which still is fine in a 32bit runtime (but we limited the software to host about 7 databases per database engine for safety measure)
The problem begins when we decided to increases the size of the database from 2gb to 32gb. This effectively increased the number of records to about 64 million, which would mean the hashtable would contain 1.7gb worth of index in heap memory for a single 32gb database alone.
I ditched the in memory hashtable and wrote the index straight to a file, but I failed to consider the time it would take to search for an index in the file, considering I could not sort the indexes on demand (because write to the database happens all the time which means the indexes must be updated almost immediately). Basically I'm having problems with re-indexing, that is our software needs to check if a record exist and it does so by checking the current indexes if it resides there, but since I now changed it from in-memory to file I/O index, its now taking forever just to finish 32gb indexing (2gb indexing as I have computed it will apparently take 3 days to complete).
I then decided to store the indexes in order based on record number so I dont have to search them in file, and structure my index as such:
struct node {
long recNum; // Record Number
long uId; // Unique Id
long prev;
long next;
long rtype;
long parent;
}
It works perfectly if I use recNum to determine where in file the index record is stored and retrieves it using read(...), but my problem is if the search based on "unique id".
When I do a search on the index file based on "unique id", what I'm doing essentially is loading chunks of the 1.7gb index file and checking the "unique id" until I get a hit, however this proves to be a very slow process. I attempted to create an Index of the Index so that I could loop quicker but it still is slow. Basically, there is a function in the software that will eventually check every record in the database by checking if it exist in the index first using the "unique id" query, and if this function comes up, to finish the 1.7gb index takes 4 weeks in my calculation if I implement a file based index query and write.
So I guess what 'm trying to ask is, when dealing with large databases (such as 30gb worth of database) persisting the indexes in memory in a 32bit runtime probably isn't an option due to limited resource, so how does one implement a file based index or hashtable with out sacrificing time (at least not so much that its impractical).

It's quite simple: Do not try to reinvent the wheel.
Any full SQL database out there is easily capable of storing and indexing tables with several million entries.
For a large table you would commonly use a B+Tree. You don't need to balance the tree on every insert, only when a node exceeds the minimum or maximum size. This gives a bad worst case runtime, but the cost is amortized.
There is also a lot of logic involved in efficiently, dynamically caching and evicting parts of the index in memory. I strongly advise against trying to re-implement all of that on your own.

Related

Searching in large memory mapped files

I have a large data structure stored in memory mapped file. Data structure is very simple:
struct Header {
...some metadata...
uint32_t index_size;
uint64_t index[]
};
This header is placed in the beginning of the file, it uses a structure hack - variable sized structure, size of the last element is not set in stone and can be changed.
char* mmaped_region = ...; // This memory comes from memory mapped file!
Header* pheader = reinterpret_cast<Header*>(mmaped_region);
Memory mapped region starts with Header and Header::index_size contains correct length of the Header::index array. This array contains offsets of the data elements, we can do this:
uint64_t offset = pheader->index[x];
DataItem* item = reinterpret_cast<DataItem*>(mmaped_region + offset);
// At this point, variable item contains pointer to data element
// if variable x contains correct index value (less than pheader->index_size)
All the data elements is sorted (less than relation defined for data elements). Their are stored in the same memory mapped region as Header but starting from the end to the beginning. Data elements can't be moved, because their are of variable size, instead of that - indexes in header are moved during sort procedure. This is very much like B-tree page in modern databases, index array is usually called an indirection vector.
Searches
This data-structure is searched with interpolation search algorithm (with limited amount of steps) and than with binary search. First, I have a whole index array to search, I'm trying to calculate - where searched element can be stored if distribution is uniform. I get some calculated index - look at element at this index and it usually doesn't match. Than I narrow the search range and repeat. Number of interpolation search steps is limited by some small number. After that data-structure is searched with binary search. This works very good with small data-sets, because distribution is usually uniform. Few iterations of the interpolation search and we're done.
Problem definition.
Memory mapped region can be very large in reality. For testing I use 32Gb file backed storage and search for some random keys. This is very slow because this pattern cause lot of random disk reads (all data can't be cached in memory).
What can be done here? I think that setting MADV_RANDOM with madvise syscall can help, but probably not very much. I want to get on par with B-tree search speed. Maybe it is possible to use mincore syscall to check what data-elements can be painlessly checked during interpolation search? Maybe I can use prefetching of some sort?

The interpolation search appears to be a good idea here. It usually has a small benefit, but in this case even a small number of iterations saved helps a lot since they're s slow (disk I/O).
However, real databases duplicate the actual key values in their indices. The space overhead for that is fully justified in the performance improvement. Btrees are a further improvement because they pack multiple related nodes in a single contiguous block of memory, further reducing disk seeks.
This is probably the correct solution for you as well. You should duplicate the keys to avoid disk I/O. You can probably get away by duplicating the keys in a separate structure and keeping that that fully in memory, if you can't alter the existing header.
A compromise is possible, where you just cache the top (2^N)-1 keys for the first N levels of binary search. That means you have to give up your interpolation for that part of the search, but as noted before interpolation is not a huge win anyway. The disk seeks saved will easily pay off. Even caching just the median key (N=1) will already save you one disk seek per lookup. And you can still use interpolation once you've run out of the cache.
In comparison, any attempt to fiddle with memory mapping parameters will give you a few percent speed improvement at best. "On par with B-trees" is not going to happen. If your algorithm needs those physical seeks, you lose. No magical pixie dust will fix a bad algorithm or a bad datastructure.

Best way to store, load and use an inverted index in C++ (~500 Mo)

I'm developing a tiny search engine using TF-IDF and cosine similarity. When pages are added, I build an inverted index to keep words frequency in the different pages. I remove stopwords and more common words, and plural/verb/etc. are stemmed.
My inverted index looks like:
map< string, map<int, float> > index
[
word_a => [ id_doc=>frequency, id_doc2=>frequency2, ... ],
word_b => [ id_doc->frequency, id_doc2=>frequency2, ... ],
...
]
With this data structure, I can get the idf weight with word_a.size(). Given a query, the program loops over the keywords and scores the documents.
I don't know well data structures and my questions are:
How to store a 500 Mo inverted index in order to load it at search time? Currently, I use boost to serialize the index:
ofstream ofs_index("index.sr", ios::binary);
boost::archive::bynary_oarchive oa(ofs_index);
oa << index;
And then I load it at search time:
ifstream ifs_index("index.sr", ios::binary);
boost::archive::bynary_iarchive ia(ifs_index);
ia >> index;
But it is very slow, it takes sometines 10 seconds to load.
I don't know if map are efficient enough for inverted index.
In order to cluster documents, I get all keywords from each document and I loop over these keywords to score similar documents, but I would like to avoid reading again each document and use only this inverted index. But I think this data structure would be costly.
Thank you in advance for any help!

The answer will pretty much depend on whether you need to support data comparable to or larger than your machine's RAM and whether in your typical use case you are likely to access all of the indexed data or rather only a small fraction of it.
If you are certain that your data will fit into your machine's memory, you can try to optimize the map-based structure you are using now. Storing your data in a map should give pretty fast access, but there will always be some initial overhead when you load the data from disk into memory. Also, if you only use a small fraction of the index, this approach may be quite wasteful as you always read and write the whole index, and keep all of it in memory.
Below I list some suggestions you could try out, but before you commit too much time to any of them, make sure that you actually measure what improves the load and run times and what does not. Without profiling the actual working code on actual data you use, these are just guesses which may be completely wrong.
map is implemented as a tree (usually black-red tree). In many cases, a hash_map may give you better performance as well as better memory usage (fewer allocations and less fragmentation for example).
Try reducing the size of the data - less data means it will be faster to read it from disk, potentially less memory allocation, and sometimes better in-memory performance due to better locality. You may for example consider that you use float to store the frequency, but perhaps you could store only the number of occurrences as an unsigned short in the map values and in a separate map store the number of all words for each document (also as a short). Using the two numbers, you can re-calculate the frequency, but use less disk space when you save the data to disk, which could result in faster load times.
Your map has quite a few entries, and sometimes using custom memory allocators helps improve performance in such a case.
If your data could potentially grow beyond the size of your machine's RAM, I would suggest you use memory-mapped files for storing the data. Such an approach may require re-modelling your data structures and either using custom STL allocators or using completely custom data structures instead of std::map but it may improve your performance an order of magnitude if done well. In particular, this approach frees your from having to load the whole structure into memory at once, so your startup times will improve dramatically at the cost of slight delays related to disk accesses distributed over time as you touch different parts of the structure for the first time. The subject is quite broad, and requires much deeper changes to your code than just tuning the map, but if you plan handling huge data, you should certainly have a look at mmap and friends.

Efficiently and safely assigning unique IDs

I am writing a database and I wish to assign every item of a specific type a unique ID (for internal data management purposes). However, the database is expected to run for a long (theoretically infinite) time and with a high turnover of entries (as in with entries being deleted and added on a regular basis).
If we model our unique ID as a unsigned int, and assume that there will always be less than 2^32 - 1 (we cannot use 0 as a unique ID) entries in the database, we could do something like the following:
void GenerateUniqueID( Object* pObj )
{
static unsigned int iCurrUID = 1;
pObj->SetUniqueID( iCurrUID++ );
}
However, this is fine until entries start getting removed and other ones added in their place, there may still be less than 2^32-1 entries, but we may overflow the iCurrUID and end up assigning "unique" IDs which already are being used.
One idea I had was to use a std::bitset<std::numeric_limits<unsigned int>::max-1> and then traversing that to find the first free unique ID, but this would have a high memory consumption and will take linear complexity to find a free unique ID, so I'm looking for a better method if one exists?
Thanks in advance!
I'm aware that changing the datatype to a 64-bit integer, instead of a 32-bit integer would resolve my problem; however, because I am working in the Win32 environment, and working with lists (with DWORD_PTR being 32-bits), I am looking for an alternative solution. Moreover, the data is sent over a network and I was trying to reduce bandwidth consumption by using a smaller size unique ID.

With an uint64_t (64bit), it would take you well, well over 100 years, even if you insert somewhere close to 100k entries per second.
Over 100 years, you should insert somewhere around 315,360,000,000,000 records (not taking into account leap years and leap seconds, etc). This number will fit into 49 bits.
How long to you anticipate that application to run?
Over 100 years?
This is the common thing database administrators do when they have an autoincrement field that apprpaches the 32bit limit. They change the value to the native 64bit type (or 128bit) for their DB system.

The real question is how many entries can you have until you are
guaranteed that the first one is deleted. And how often you
create new entries. An unsigned long long is guaranteed to
have a maximum value of at least 2^64, about 1.8x10^19. Even at
one creation per microsecond, this will last for a couple of
thousand centuries. Realistically, you're not going to be able
to create entries that fast (since disk speed won't allow it),
and your program isn't going to run for hundreds of centuries
(because the hardware won't last that long). If the unique id's
are for something disk based, you're safe using unsigned long
long for the id.
Otherwise, of course, generate as many bits as you think you
might need. If you're really paranoid, it's trivial to use
a 256 bit unsigned integer, or even longer. At some point,
you'll be fine even if every atom in the universe creates a new
entry every picosecond, until the end of the universe. (But
realistically... unsigned long long should suffice.)

Size/Resize of GHashTable

Here is my use case: I want to use glib's GHashTable and use IP-addresses as keys, and the olume of data sent/received by this IP-address as the value. For instance I succeeded to implement the whole issue in user-space using some kernel variables in order to look to the volume per IP-address.
Now the question is: Suppose I have a LOT of IP-addresses (i.e. 500,000 up to 1,000,000 uniques) => it is really not clear what is the space allocated and the first size that was given to a new hash table created when using (g_hash_table_new()/g_hash_table_new_full()), and how the whole thing works in the background. It is known that when resizing a hash table it can take a lot of time. So how can we play with these parameters?

Neither g_hash_table_new() nor g_hash_table_new_full() let you specify the size.
The size of a hash table is only available as the number of values stored in it, you don't have access to the actual array size that typically is used in the implementation.
However, the existance of g_spaced_primes_closest() kind of hints that glib's hash table uses a prime-sized internal array.
I would say that although a million keys is quite a lot, it's not extraordinary. Try it, and then measure the performance to determine if it's worth digging deeper.

Fastest way to do many small, blind writes on a huge file (in C++)?

I have some very large (>4 GB) files containing (millions of) fixed-length binary records. I want to (efficiently) join them to records in other files by writing pointers (i.e. 64-bit record numbers) into those records at specific offsets.
To elaborate, I have a pair of lists of (key, record number) tuples sorted by key for each join I want to perform on a given pair of files, say, A and B. Iterating through a list pair and matching up the keys yields a list of (key, record number A, record number B) tuples representing the joined records (assuming a 1:1 mapping for simplicity). To complete the join, I conceptually need to seek to each A record in the list and write the corresponding B record number at the appropriate offset, and vice versa. My question is what is the fastest way to actually do this?
Since the list of joined records is sorted by key, the associated record numbers are essentially random. Assuming the file is much larger than the OS disk cache, doing a bunch of random seeks and writes seems extremely inefficient. I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory. This has the benefit of greatly increasing the chances that the appropriate records will be cached for a cluster after updating its first pointer. However, even at this point, is it generally better to do a bunch of seeks and blind writes, or to read chunks of the file manually, update the appropriate pointers, and write the chunks back out? While the former method is much simpler and could be optimized by the OS to do the bare minimum of sector reads (since it knows the sector size) and copies (it can avoid copies by reading directly into properly aligned buffers), it seems that it will incur extremely high syscall overhead.
While I'd love a portable solution (even if it involves a dependency on a widely used library, such as Boost), modern Windows and Linux are the only must-haves, so I can make use of OS-specific APIs (e.g. CreateFile hints or scatter/gather I/O). However, this can involve a lot of work to even try out, so I'm wondering if anyone can tell me if it's likely worth the effort.

It looks like you can solve this by use of data structures. You've got three constraints:
Access time must be reasonably quick
Data must be kept sorted
You are on a spinning disk
B+ Trees were created specifically to address the kind of workload you're dealing with here. There are several links to implementations in the linked Wikipedia article.
Essentially, a B+ tree is a binary search tree, except groups of nodes are held together in groups. That way, instead of having to seek around for each node, the B+ tree loads only a chunk at a time. And it keeps a bit of information around to know which chunk it's going to need in a search.
EDIT: If you need to sort by more than one item, you can do something like:
+--------+-------------+-------------+---------+
| Header | B+Tree by A | B+Tree by B | Records |
+--------+-------------+-------------+---------+
|| ^ | ^ | ^
|\------/ | | | |
\-------------------/ | |
| | |
\----------+----------/
I.e. you have seperate B+Trees for each key, and a seperate list of records, pointers to which are stored in the B+ trees.

I've tried partially sorting the record numbers by putting the A->B and B->A mappings in a sparse array, and flushing the densest clusters of entries to disk whenever I run out of memory.
it seems that it will incur extremely high syscall overhead.
You can use memory mapped access to the file to avoid syscall overhead. mmap() on *NIX, and CreateFileMapping() on Windows.
Split file logically into blocks, e.g. 32MB. If somethings needs to be changed in the block, mmap() it , modify data, optionally msync() if desired, munmap() and then move to the next block.
That would have been something I have tried first. OS would automatically read whatever needs to be read (on first access to the data), and it will queue IO anyway it likes.
Important things to keep in mind is that the real IO isn't that fast. Performance-wise limiting factors for random access are (1) the number of IOs per second (IOPS) storage can handle and (2) the number of disk seeks. (Usual IOPS is in hundreds range. Usual seek latency is 3-5ms.) Storage for example can read/write 50MB/s: one continuous block of 50MB in one second. But if you would try to patch byte-wise 50MB file, then seek times would simply kill the performance. Up to some limit, it is OK to read more and write more, even if to update only few bytes.
Another limit to observe is the OS' max size of IO operation: it depends on the storage but most OSs would split IO tasks larger than 128K. The limit can be changed and best if it is synchronized with the similar limit in the storage.
Also keep in mind the storage. Many people forget that storage is often only one. I'm trying here to say that starting crapload of threads doesn't help IO, unless you have multiple storages. Even single CPU/core is capable of easily saturating RAID10 with its 800 read IOPS and 400 write IOPS limits. (But a dedicated thread per storage at least theoretically makes sense.)
Hope that helps. Other people here often mention Boost.Asio which I have no experience with - but it is worth checking.
P.S. Frankly, I would love to hear other (more informative) responses to your question. I was in the boat several times already, yet had no chance to really get down to it. Books/links/etc related to IO optimizations (regardless of platform) are welcome ;)

Random disk access tends to be orders of magnitude slower than sequential disk access. So much so that it can be useful to choose algorithms that might sound badly inefficient at first blush. For example, you might try this:
Create your join index, but instead of using it, just write out the list of pairs (A index, B index) to a disk file.
Sort this new file of pairs by the A index. Use a sort algorithm designed for external sorting (though I've not tried it myself, the STXXL library from stxxl.sourceforge.net looked promising when I was researching a similar problem)
Sequentially walk through the A record file and the sorted pair list. Read a huge chunk in, make all the relevant changes in memory, write the chunk out. Never touch that portion of the A record file again (since the changes you planned to make come in sequential order)
Go back, sort the pair file by the B index (again, using an external sort). Use this to update the B record file in the same manner.

Instead of building a list of (key, record number A, record number B) I would leave out the key to save space and just build (record number A, record number B). I'd sort that table or file by the A's, sequentially seek to each A record, write the B number, then sort the list by the B's, sequentially seek to each B record, write the A number.
I'm doing very similar large file manipulations, and these newer machines are so damn fast it doesn't take long at all:
On a cheapo 2.4gHz HP Pavilion with 3gb ram and 32-bit Vista, writing 3 million sequential 1,008-byte records to a new file takes 56 seconds, using Delphi library routines (as opposed to the Win API).
Sequentially seeking to each record in the file and writing 8 bytes using Win API FileSeek/FileWrite on a booted machine takes 136 seconds. That's 3 million updates. Immediately rerunning the same code takes 108 seconds, since the O/S has some things cached.
Sorting record offsets first, then sequentially updating the files, is the way to go.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js