Best Data structure to store and search phrases in C++

Best Data structure to store and search phrases in C++ - c++

I use the tries data structure to store words. Now, I have a requirement which needs , to find, given a paragraph, if certain phrases are present in the same paragraph.
What would be the most efficient way for doing this? The total number of phrases will not be more than 100.

If I were you, I would just throw something together using boost::multi_index_container first, because then if you get even more requirements later it will be quite easy to extend it further. If later you measure and find that it is not performing adequately, then you can replace it with an optimized data structure.

The trie specified is suboptimal in numerous ways.
For a start, it constructs multiple nodes per item inserted. As the author writes, "Every character of input key is inserted as an individual trie node." That's a horrible, and unnecessary penalty! The use of an ALPHABET_SIZE greater than 2 adds insult to injury here; not only would a phrase of fifty bytes require fifty nodes, but each node would likely be over one hundred bytes in size... Each item or "phrase" of fifty bytes in length might require up to about 5KB of storage using that code! That's not even the worst of it.
The algorithm provided embeds malloc internally, making it quite difficult to optimise. Each node is its own allocation, making insert very malloc-heavy. Allocation details should be separated from data structure processing, if not for the purpose of optimisation then for simplicity of use. Programs that use this code heavily are likely to run into performance issues related to memory fragmentation and/or cache misses, with no easy or significant optimisation in sight except for substituting the trie for something else.
That's not the only thing wrong here... This code isn't too portable, either! If you end up running this on an old (not that old; they do still exist!) mainframe that uses EBCDIC rather than ASCII, this code will produce buffer overflows, and the programmer (you) will be called in to fix it. <sarcasm>That's so optimal, right?</sarcasm>
I've written a PATRICIA trie implementation that uses exactly one node per item, an alphabet size of two (it uses the bits of each character, rather than each character) and allows you to use whichever allocation you wish... alas, I haven't yet put a lot of effort into refactoring its interface, but it should be fairly close to optimal. You can find that implementation here. You can see examples of inserting (using patricia_add), retrieving (using patricia_get) and removing (using patricia_remove) in the patricia_test.c testcase file.

Related

Implementing binary search in a sorted text file?

Is there a way by which instead of copying the data of file directly implement search in it?

in theory: yes, but it will be quite inefficient.
i'd recommend to put the data in a sqlite database, that way you still have a single file, but can nicely query/search for entries.

tl;dr: Yes, but it's often not worth it
You neglected to mention how the text file is sorted, exactly, and whether there are escaped characters, quotation marks, multi-octet characters etc. - these would all impact the answer.
But let's make the following assumptions:
Plain printable ASCII text, with no newlines in each string.
Newlines (i.e. 0xA characters) separate strings.
This is still not enough for a set of assumptions, because - maybe some of the strings are much longer than the others? In fact, what about the not-that-extreme case of n strings overall, but a few of them take up most of the characters? If you start sampling characters in the file, you'll need to go back and forth, linearly, at least to both edges of a single string (or forwarded until you hit a newline twice).
So let's add more assumptions, although frankly - they're quite invalid:
You know the minimum Min and maximum Max string lengths.
The ratio R of minimum to maximum length is not very high
This makes it at least theoretically reasonable to start reading from some arbitrary point in the file, and look for a complete string. However, files are usually on disks; and disks are accessed by blocks. So for reading even a single character from the file you need to read a whole block of size B (think of B as, say, 1 KiB as a reasonable example). We'll assume Max < B, otherwise you're in the huge-strings case.
Another point to be made is that disk latencies are high. This is especially true for magnetic (or optical disks), where you can wait as much as 10 msec for a single read! If you read sequentially, there's no need to "seek" or lookup the position you're interested in, and you could make use of the disk's full bandwidth. This is less of a problem with SSDs, but it's still not negligible.
So, as you can see, there's quite a bit of overhead for your binary search. It may still be worth it your file is really really large relative to Min, Max, R and B. So in a file of several Gigabytes, I'd certainly consider it. Otherwise - probably not worth the bother.

Algorithm for ordering strings to and from disk efficiently using minimal internal memory resources

I have a very (multiple terrabytes) large amount of strings stored on disk that I need to sort alphabetically and store in another file as quickly as possible (preferrably in C/C++) and using as little internal memory as possible. It is not an option to pre-index the strings beforehand, so I need to sort the strings whenever needed in a close to real-time fashion.
What would be the best algorithm to use in my case? I would prefer a suggestion for a linear algorithm rather than just a link to an existing software library like Lucene.

You usually sort huge external data by chunking it into smaller pieces, operating on them and eventually merging them back. When choosing the sorting algorithm you usually take a look at your requirements:
If you need a time-complexity guarantee that is also stable you can go for a mergesort (O(nlogn) guaranteed) although it requires an additional O(n) space.
If severely memory-bound you might want to try Smoothsort (constant memory, time O(nlogn))
Otherwise you might want to take a look at the research stuff in the gpgpu accelerators field like GPUTeraSort.
Google servers usually have this sort of problems.

Construct simply digital tree (Trie)
Memory will be much less than input data, because many words will be have common prefix. While adding data to tree u mark (incrementation) last child as end of word. If u add all words then u doing a DFS (with priority as u want sorting ex a->z ) and you output data to file. Time-complexity is exactly the same as memory size. It is hard to say about how is complexity because it depends on strings (many short strings better complexity) but it is still much better than input data O(n*k) where n-count of strings; k-the average length of string. Im sorry for my English.
PS. For solve problem with memorysize u can part file to smallest parts, sorting them with my method, and if u will be have for ex (1000 files) u will be remember in each first word (like queues) and next u will be output right word and input next in very short time.

I suggest you use the Unix "sort" command that can easily handle such files.
See How could the UNIX sort command sort a very large file? .
Before disk drives even existed, people wrote programs to sort lists that were far too large to hold in main memory.
Such programs are known as external sorting algorithms.
My understanding is that the Unix "sort" command uses the merge sort algorithm.
Perhaps the simplest version of the external sorting merge sort algorithm works like this (quoting from Wikipedia: merge sort):
Name four tape drives as A, B, C, D, with the original data on A:
Merge pairs of records from A; writing two-record sublists alternately to C and D.
Merge two-record sublists from C and D into four-record sublists; writing these alternately to A and B.
Merge four-record sublists from A and B into eight-record sublists; writing these alternately to C and D
Repeat until you have one list containing all the data, sorted --- in log2(n) passes.
Practical implementations typically have many tweaks:
Almost every practical implementation takes advantage of available RAM by reading many items into RAM at once, using some in-RAM sorting algorithm, rather than reading only one item at a time.
some implementations are able to sort lists even when some or every item in the list is too large to hold in the available RAM.
polyphase merge sort
As suggested by Kaslai, rather than only 4 intermediate files, it is usually quicker to use 26 or more intermediate files. However, as the external sorting article points out, if you divide up the data into too many intermediate files, the program spends a lot of time waiting for disk seeks; too many intermediate files make it run slower.
As Kaslai commented, using larger RAM buffers for each intermediate file can significantly decrease the sort time -- doubling the size of each buffer halves the number of seeks. Ideally each buffer should be sized so the seek time is a relatively small part of the total time to fill that buffer. Then the number of intermediate files should be picked so the total size of all those RAM buffers put together comes close to but does not exceed available RAM. (If you have very short seek times, as with a SSD, the optimal arrangement ends up with many small buffers and many intermediate files. If you have very long seek times, as with tape drives, the optimal arrangement ends up with a few large buffers and few intermediate files. Rotating disk drives are intermediate).
etc. -- See the Knuth book "The Art of Computer Programming, Vol. 3: Sorting and Searching" for details.

Use as much memory as you can and chunk your data. Read one chunk at a time into memory.
Step 1) Sort entries inside chunks
For each chunk:
Use IntroSort to sort your chunk. But to avoid copying your strings around and having to deal with variable sized strings and memory allocations (at this point it will be interesting and relevant if you actually have fixed or max size strings or not), preallocate a standard std array or other fitting container with pointers to your strings that point to a memory region inside the current data chunk. => So your IntroSort swaps the pointers to your strings, instead of swapping actual strings.
Loop over each entry in your sort-array and write the resulting (ordered) strings back to a corresponding sorted strings file for this chunk
Step 2) Merge all strings from sorted chunks into resulting sorted strings file
Allocate a "sliding" window memory region for all sorted strings files at once. To give an example: If you have 4 sorted strings files, allocate 4 * 256MB (or whatever fits, the larger the less (sequential) disk IO reads required).
Fill each window by reading the strings into it (so, read as much strings at once as your window can store).
Use MergeSort to compare any of your chunks, using a comparator to your window (e.g. stringInsideHunkA = getStringFromWindow(1, pointerToCurrentWindow1String) - pointerToCurrentWindow1String is a reference that the function advances to the next string). Note that if the string pointer to your window is beyond the window size (or the last record didn't fit to the window read the next memory region of that chunk into the window.
Use mapped IO (or buffered writer) and write the resulting strings into a giant sorted strings final
I think this could be an IO efficient way. But I've never implemented such thing.
However, in regards to your file size and yet unknown to me "non-functional" requirements, I suggest you to also consider benchmarking a batch-import using LevelDB [1]. It's actually very fast, minimizes disk IO, and even compresses your resulting strings file to about half the size without impact on speed.
[1] http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html

Here is a general algorithm that will be able to do what you want with just a few gigs of memory. You could get away with much less, but the more you have, the less disk overhead you have to deal with. This assumes that all of the strings are in a single file, however could be applied to a multiple file setup.
1: Create some files to store loosely sorted strings in. For terabytes of data, you'd probably want 676 of them. One for strings starting in "aa", one for "ab", and so on until you get to "zy" and "zz".
2: For each file you created, create a corresponding buffer in memory. A std::vector<std::string> perhaps.
3: Determine a buffer size that you want to work with. This should not exceed much beyond 1/2 of your available physical memory.
4: Load as many strings as you can into this buffer.
5: Truncate the file so that the strings in your buffer are no longer on disk. This step can be delayed for later or omitted entirely if you have the disk space to work with or the data is too sensitive to lose in the case of process failure. If truncating, make sure you load your strings from the end of the file, so that the truncation is almost a NOP.
6: Iterate over the strings and store them in their corresponding buffer.
7: Flush all of the buffers to their corresponding files. Clear all the buffers.
8: Go to step 4 and repeat until you have exhausted your source of strings.
9: Read each file to memory and sort it with whatever algorithm you fancy. On the off chance you end up with a file that is larger than your available physical memory, use a similar process from above to split it into smaller files.
10: Overwrite the unsorted file with this new sorted file, or append it to a monolithic file.
If you keep the individual files rather than a monolithic file, you can make insertions and deletions relatively quickly. You would only have to load in, insert, and sort the value into a single file that can be read entirely into memory. Now and then you might have to split a file into smaller files, however this merely amounts to looking around the middle of the file for a good place to split it and then just moving everything after that point to another file.
Good luck with your project.

remove all duplicate records efficiently

I have a file which might be 30+GB or more. And each line in this file is called a record and is composed of 2 cols, which goes like this
id1 id2
All of this 2 ids are integers (32-bit). My job is to write a program to remove all the duplicate record, make the record unique, finally output the unique id2 into a file.
There is some constraints, 30G memory is allowed at most, and better get the job done efficiently by a non-multithread/process program.
Initially I came up with an idea: because of the memory constraints, I decided to read the file n times, each only keep in memory those record with id1 % n = i (i = 0,1,2,..,n-1). The data structure I use is a std::map<int, std::set<int> >, it takes id1 as key, and put id2 in id1's std::set.
This way, memory constraints will not be violated, but it's quite slow. I think it's because as the std::map and std::set grows larger, the insertion speed goes down. Moreover, I need to read the file n times, when each round is done, I gotta clear the std::map for next round which also cost some time.
I also tried hash, but it doesn't satisfy me either, which I thought there might be too many collisions even with 300W buckets.
So, I post my problem here, help you guys can offer me any better data structure or algorithm.
Thanks a lot.
PS
Scripts (shell, python) are desired, if it can do it efficiently.

Unless I overlooked a requirement, it should be possible to do this on the Linux shell as
sort -u inputfile > outputfile
Many implementations enable you to use sort in a parallelised manner as well:
sort --parallel=4 -u inputfile > outputfile
for up to four parallel executions.
Note that sort might use a lot of space in /tmp temporarily. If you run out of disk space there, you may use the -T option to point it to an alternative place on disk to use as temporary directory.
(Edit:) A few remarks about efficiency:
A significant portion of the time spent during execution (of any solution to your problem) will be spent on IO, something that sort is highly optimised for.
Unless you have extremely much RAM, your solution is likely to end up performing some of the work on disk (just like sort). Again, optimising this means a lot of work, while for sort all of that work has been done.
One disadvantage of sort is that it operates on string representations of the input lines. If you were to write your own code, one thing you could do (similar to what you suggesed already) is to convert the input lines to 64-bit integers and hash them. If you have enough RAM, that may be a way to beat sort in terms of speed, if you get IO and integer conversions to be really fast. I suspect it may not be worth the effort as sort is easy to use and – I think – fast enough.

I just don't think you can do this efficiently without using a bunch of disk. Any form of data structure will introduce so much memory and/or storage overhead that your algorithm will suffer. So I would expect a sorting solution to be best here.
I reckon you can sort large chunks of the file at a time, and then merge (ie from merge-sort) those chunks after. After sorting a chunk, obviously it has to go back to disk. You could just replace the data in the input file (assuming it's binary), or write to a temporary file.
As far as the records, you just have a bunch of 64-bit values. With 30GB RAM, you can hold almost 4 billion records at a time. That's pretty sweet. You could sort that many in-place with quicksort, or half that many with mergesort. You probably won't get a contiguous block of memory that size. So you're going to have to break it up. That will make quicksort a little trickier, so you might want to use mergesort in RAM as well.
During the final merge it's trivial to discard duplicates. The merge might be entirely file-based, but at worst you'll use an amount of disk equivalent to twice the number of records in the input file (one file for scratch and one file for output). If you can use the input file as scratch, then you have not exceeded your RAM limits OR your disk limits (if any).
I think the key here is the requirement that it shouldn't be multithreaded. That lends itself well to disk-based storage. The bulk of your time is going to be spent on disk access. So you wanna make sure you do that as efficiently as possible. In particular, when you're merge-sorting you want to minimize the amount of seeking. You have large amounts of memory as buffer, so I'm sure you can make that very efficient.
So let's say your file is 60GB (and I assume it's binary) so there's around 8 billion records. If you're merge-sorting in RAM, you can process 15GB at a time. That amounts to reading and (over)writing the file once. Now there are four chunks. If you want to do pure merge-sort then you always deal with just two arrays. That means you read and write the file two more times: once to merge each 15GB chunk into 30GB, and one final merge on those (including discarding of duplicates).
I don't think that's too bad. Three times in and out. If you figure out a nice way to quicksort then you can probably do this with one fewer pass through the file. I imagine a data structure like deque would work well, as it can handle non-contiguous chunks of memory... But you'd probably wanna build your own and finely tune your sorting algorithm to exploit it.

Instead of std::map<int, std::set<int> > use std::unordered_multimap<int,int>. If you can not use C++11 - write your own.
The std::map is node based and it calls malloc on each insertion, this is probably why it is slow. With unodered map (hash table), if you know number of records, you can pre-allocate. Even if you don't, number of mallocs will be O(log N) instead of O(N) with std::map.
I can bet this will be several times faster and more memory efficient than using external sort -u.

This approach may help when there are not too many duplicate records in the file.
1st pass. Allocate most of the memory for Bloom filter. Hash every pair from input file and put the result into Bloom filter. Write each duplicate, found by Bloom filter into temporary file (this file will also contain some amount of false positives, which are not duplicates).
2nd pass. Load temporary file and construct a map from its records. Key is std::pair<int,int>, value is a boolean flag. This map may be implemented either as std::unordered_map/boost::unordered_map, or as std::map.
3rd pass. Read input file again, search each record in the map, output its id2 if either not found or flag is not yet set, then set this flag.

C++ Qt WordCount and large data sets

I want to count words occurrences in a set of plain text files. Just like here http://doc.trolltech.com/4.5/qtconcurrent-wordcount-main-cpp.html
The problem is that i need to process very big amount of plain text files - so my result srored in QMap could not fit into memory.
I googled external memory (file based) merge sort algorithm, but i'm too lazy to implement myself. So i want to divide result set by portions to fit each of them into memory. Then store this portions in files on disk. Then call magic function mergeSort(QList, result_file) and have final result in result_file.
Does anyone know Qt compatible implementation of this algo?
In short i'm looking for pythons heapq.merge (http://docs.python.org/library/heapq.html#heapq.merge) analog but for Qt containers.

You might wanna check out this one:
http://stxxl.sourceforge.net/
It's not exactly what you are looking for (close enough though), but I guess you will not find exactly what you want working with Qt lists. Since you are are implementing alghoritm creating this list, changing it's type shouldn't be a problem. As far as i remember on those list you can use standard stl sorting alghoritms. The only problem remains preformance.

I presume that the map contains the association between the word and the number of occurences. In this case, why do you say you have such a significant memory consumption? How many distinct words and forms could you have and what is the average memory consumption for one word?
Considering 1.000.000 words, with 1K memory consumption per word (that includes the word text, the QMap specific storage), that would lead to (aprox) 1GB of memory, which... doesn't seem so much to me.

mmap-loadable data structure library for C++ (or C)

I have a some large data structure (N > 10,000) that usually only needs to be created once (at runtime), and can be reused many times afterwards, but it needs to be loaded very quickly. (It is used for user input processing on iPhoneOS.) mmap-ing a file seems to be the best choice.
Are there any data structure libraries for C++ (or C)? Something along the line
ReadOnlyHashTable<char, int> table ("filename.hash");
// mmap(...) inside the c'tor
...
int freq = table.get('a');
...
// munmap(...); inside the d'tor.
Thank you!
Details:
I've written a similar class for hash table myself but I find it pretty hard to maintain, so I would like to see if there's existing solutions already. The library should
Contain a creation routine that serialize the data structure into file. This part doesn't need to be fast.
Contain a loading routine that mmap a file into read-only (or read-write) data structure that can be usable within O(1) steps of processing.
Use O(N) amount of disk/memory space with a small constant factor. (The device has serious memory constraint.)
Small time overhead to accessors. (i.e. the complexity isn't modified.)
Assumptions:
Bit representation of data (e.g. endianness, encoding of float, etc.) does not matter since it is only used locally.
So far the possible types of data I need are integers, strings, and struct's of them. Pointers do not appear.
P.S. Can Boost.intrusive help?

You could try to create a memory mapped file and then create the STL map structure with a customer allocator. Your customer allocator then simply takes the beginning of the memory of the memory mapped file, and then increments its pointer according to the requested size.
In the end all the allocated memory should be within the memory of the memory mapped file and should be reloadable later.
You will have to check if memory is free'd by the STL map. If it is, your customer allocator will lose some memory of the memory mapped file but if this is limited you can probably live with it.

Sounds like maybe you could use one of the "perfect hash" utilities out there. These spend some time opimising the hash function for the particular data, so there are no hash collisions and (for minimal perfect hash functions) so that there are no (or at least few) empty gaps in the hash table. Obviously, this is intended to be generated rarely but used frequently.
CMPH claims to cope with large numbers of keys. However, I have never used it.
There's a good chance it only generates the hash function, leaving you to use that to generate the data structure. That shouldn't be especially hard, but it possibly still leaves you where you are now - maintaining at least some of the code yourself.

Just thought of another option - Datadraw. Again, I haven't used this, so no guarantees, but it does claim to be a fast persistent database code generator.

WRT boost.intrusive, I've just been having a look. It's interesting. And annoying, as it makes one of my own libraries look a bit pointless.
I thought this section looked particularly relevant.
If you can use "smart pointers" for links, presumably the smart pointer type can be implemented using a simple offset-from-base-address integer (and I think that's the point of the example). An array subscript might be equally valid.
There's certainly unordered set/multiset support (C++ code for hash tables).

Using cmph would work. It does have the serialization machinery for the hash function itself, but you still need to serialize the keys and the data, besides adding a layer of collision resolution on top of it if your query set universe is not known before hand. If you know all keys before hand, then it is the way to go since you don't need to store the keys and will save a lot of space. If not, for such a small set, I would say it is overkill.
Probably the best option is to use google's sparse_hash_map. It has very low overhead and also has the serialization hooks that you need.
http://google-sparsehash.googlecode.com/svn/trunk/doc/sparse_hash_map.html#io

GVDB (GVariant Database), the core of Dconf is exactly this.
See git.gnome.org/browse/gvdb, dconf and bv
and developer.gnome.org/glib/2.30/glib-GVariant.html

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js