C++ Qt WordCount and large data sets - c++

I want to count words occurrences in a set of plain text files. Just like here http://doc.trolltech.com/4.5/qtconcurrent-wordcount-main-cpp.html
The problem is that i need to process very big amount of plain text files - so my result srored in QMap could not fit into memory.
I googled external memory (file based) merge sort algorithm, but i'm too lazy to implement myself. So i want to divide result set by portions to fit each of them into memory. Then store this portions in files on disk. Then call magic function mergeSort(QList, result_file) and have final result in result_file.
Does anyone know Qt compatible implementation of this algo?
In short i'm looking for pythons heapq.merge (http://docs.python.org/library/heapq.html#heapq.merge) analog but for Qt containers.

You might wanna check out this one:
http://stxxl.sourceforge.net/
It's not exactly what you are looking for (close enough though), but I guess you will not find exactly what you want working with Qt lists. Since you are are implementing alghoritm creating this list, changing it's type shouldn't be a problem. As far as i remember on those list you can use standard stl sorting alghoritms. The only problem remains preformance.

I presume that the map contains the association between the word and the number of occurences. In this case, why do you say you have such a significant memory consumption? How many distinct words and forms could you have and what is the average memory consumption for one word?
Considering 1.000.000 words, with 1K memory consumption per word (that includes the word text, the QMap specific storage), that would lead to (aprox) 1GB of memory, which... doesn't seem so much to me.

Related

Best Data structure to store and search phrases in C++

I use the tries data structure to store words. Now, I have a requirement which needs , to find, given a paragraph, if certain phrases are present in the same paragraph.
What would be the most efficient way for doing this? The total number of phrases will not be more than 100.
If I were you, I would just throw something together using boost::multi_index_container first, because then if you get even more requirements later it will be quite easy to extend it further. If later you measure and find that it is not performing adequately, then you can replace it with an optimized data structure.
The trie specified is suboptimal in numerous ways.
For a start, it constructs multiple nodes per item inserted. As the author writes, "Every character of input key is inserted as an individual trie node." That's a horrible, and unnecessary penalty! The use of an ALPHABET_SIZE greater than 2 adds insult to injury here; not only would a phrase of fifty bytes require fifty nodes, but each node would likely be over one hundred bytes in size... Each item or "phrase" of fifty bytes in length might require up to about 5KB of storage using that code! That's not even the worst of it.
The algorithm provided embeds malloc internally, making it quite difficult to optimise. Each node is its own allocation, making insert very malloc-heavy. Allocation details should be separated from data structure processing, if not for the purpose of optimisation then for simplicity of use. Programs that use this code heavily are likely to run into performance issues related to memory fragmentation and/or cache misses, with no easy or significant optimisation in sight except for substituting the trie for something else.
That's not the only thing wrong here... This code isn't too portable, either! If you end up running this on an old (not that old; they do still exist!) mainframe that uses EBCDIC rather than ASCII, this code will produce buffer overflows, and the programmer (you) will be called in to fix it. <sarcasm>That's so optimal, right?</sarcasm>
I've written a PATRICIA trie implementation that uses exactly one node per item, an alphabet size of two (it uses the bits of each character, rather than each character) and allows you to use whichever allocation you wish... alas, I haven't yet put a lot of effort into refactoring its interface, but it should be fairly close to optimal. You can find that implementation here. You can see examples of inserting (using patricia_add), retrieving (using patricia_get) and removing (using patricia_remove) in the patricia_test.c testcase file.

Algorithm for ordering strings to and from disk efficiently using minimal internal memory resources

I have a very (multiple terrabytes) large amount of strings stored on disk that I need to sort alphabetically and store in another file as quickly as possible (preferrably in C/C++) and using as little internal memory as possible. It is not an option to pre-index the strings beforehand, so I need to sort the strings whenever needed in a close to real-time fashion.
What would be the best algorithm to use in my case? I would prefer a suggestion for a linear algorithm rather than just a link to an existing software library like Lucene.
You usually sort huge external data by chunking it into smaller pieces, operating on them and eventually merging them back. When choosing the sorting algorithm you usually take a look at your requirements:
If you need a time-complexity guarantee that is also stable you can go for a mergesort (O(nlogn) guaranteed) although it requires an additional O(n) space.
If severely memory-bound you might want to try Smoothsort (constant memory, time O(nlogn))
Otherwise you might want to take a look at the research stuff in the gpgpu accelerators field like GPUTeraSort.
Google servers usually have this sort of problems.
Construct simply digital tree (Trie)
Memory will be much less than input data, because many words will be have common prefix. While adding data to tree u mark (incrementation) last child as end of word. If u add all words then u doing a DFS (with priority as u want sorting ex a->z ) and you output data to file. Time-complexity is exactly the same as memory size. It is hard to say about how is complexity because it depends on strings (many short strings better complexity) but it is still much better than input data O(n*k) where n-count of strings; k-the average length of string. Im sorry for my English.
PS. For solve problem with memorysize u can part file to smallest parts, sorting them with my method, and if u will be have for ex (1000 files) u will be remember in each first word (like queues) and next u will be output right word and input next in very short time.
I suggest you use the Unix "sort" command that can easily handle such files.
See How could the UNIX sort command sort a very large file? .
Before disk drives even existed, people wrote programs to sort lists that were far too large to hold in main memory.
Such programs are known as external sorting algorithms.
My understanding is that the Unix "sort" command uses the merge sort algorithm.
Perhaps the simplest version of the external sorting merge sort algorithm works like this (quoting from Wikipedia: merge sort):
Name four tape drives as A, B, C, D, with the original data on A:
Merge pairs of records from A; writing two-record sublists alternately to C and D.
Merge two-record sublists from C and D into four-record sublists; writing these alternately to A and B.
Merge four-record sublists from A and B into eight-record sublists; writing these alternately to C and D
Repeat until you have one list containing all the data, sorted --- in log2(n) passes.
Practical implementations typically have many tweaks:
Almost every practical implementation takes advantage of available RAM by reading many items into RAM at once, using some in-RAM sorting algorithm, rather than reading only one item at a time.
some implementations are able to sort lists even when some or every item in the list is too large to hold in the available RAM.
polyphase merge sort
As suggested by Kaslai, rather than only 4 intermediate files, it is usually quicker to use 26 or more intermediate files. However, as the external sorting article points out, if you divide up the data into too many intermediate files, the program spends a lot of time waiting for disk seeks; too many intermediate files make it run slower.
As Kaslai commented, using larger RAM buffers for each intermediate file can significantly decrease the sort time -- doubling the size of each buffer halves the number of seeks. Ideally each buffer should be sized so the seek time is a relatively small part of the total time to fill that buffer. Then the number of intermediate files should be picked so the total size of all those RAM buffers put together comes close to but does not exceed available RAM. (If you have very short seek times, as with a SSD, the optimal arrangement ends up with many small buffers and many intermediate files. If you have very long seek times, as with tape drives, the optimal arrangement ends up with a few large buffers and few intermediate files. Rotating disk drives are intermediate).
etc. -- See the Knuth book "The Art of Computer Programming, Vol. 3: Sorting and Searching" for details.
Use as much memory as you can and chunk your data. Read one chunk at a time into memory.
Step 1) Sort entries inside chunks
For each chunk:
Use IntroSort to sort your chunk. But to avoid copying your strings around and having to deal with variable sized strings and memory allocations (at this point it will be interesting and relevant if you actually have fixed or max size strings or not), preallocate a standard std array or other fitting container with pointers to your strings that point to a memory region inside the current data chunk. => So your IntroSort swaps the pointers to your strings, instead of swapping actual strings.
Loop over each entry in your sort-array and write the resulting (ordered) strings back to a corresponding sorted strings file for this chunk
Step 2) Merge all strings from sorted chunks into resulting sorted strings file
Allocate a "sliding" window memory region for all sorted strings files at once. To give an example: If you have 4 sorted strings files, allocate 4 * 256MB (or whatever fits, the larger the less (sequential) disk IO reads required).
Fill each window by reading the strings into it (so, read as much strings at once as your window can store).
Use MergeSort to compare any of your chunks, using a comparator to your window (e.g. stringInsideHunkA = getStringFromWindow(1, pointerToCurrentWindow1String) - pointerToCurrentWindow1String is a reference that the function advances to the next string). Note that if the string pointer to your window is beyond the window size (or the last record didn't fit to the window read the next memory region of that chunk into the window.
Use mapped IO (or buffered writer) and write the resulting strings into a giant sorted strings final
I think this could be an IO efficient way. But I've never implemented such thing.
However, in regards to your file size and yet unknown to me "non-functional" requirements, I suggest you to also consider benchmarking a batch-import using LevelDB [1]. It's actually very fast, minimizes disk IO, and even compresses your resulting strings file to about half the size without impact on speed.
[1] http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html
Here is a general algorithm that will be able to do what you want with just a few gigs of memory. You could get away with much less, but the more you have, the less disk overhead you have to deal with. This assumes that all of the strings are in a single file, however could be applied to a multiple file setup.
1: Create some files to store loosely sorted strings in. For terabytes of data, you'd probably want 676 of them. One for strings starting in "aa", one for "ab", and so on until you get to "zy" and "zz".
2: For each file you created, create a corresponding buffer in memory. A std::vector<std::string> perhaps.
3: Determine a buffer size that you want to work with. This should not exceed much beyond 1/2 of your available physical memory.
4: Load as many strings as you can into this buffer.
5: Truncate the file so that the strings in your buffer are no longer on disk. This step can be delayed for later or omitted entirely if you have the disk space to work with or the data is too sensitive to lose in the case of process failure. If truncating, make sure you load your strings from the end of the file, so that the truncation is almost a NOP.
6: Iterate over the strings and store them in their corresponding buffer.
7: Flush all of the buffers to their corresponding files. Clear all the buffers.
8: Go to step 4 and repeat until you have exhausted your source of strings.
9: Read each file to memory and sort it with whatever algorithm you fancy. On the off chance you end up with a file that is larger than your available physical memory, use a similar process from above to split it into smaller files.
10: Overwrite the unsorted file with this new sorted file, or append it to a monolithic file.
If you keep the individual files rather than a monolithic file, you can make insertions and deletions relatively quickly. You would only have to load in, insert, and sort the value into a single file that can be read entirely into memory. Now and then you might have to split a file into smaller files, however this merely amounts to looking around the middle of the file for a good place to split it and then just moving everything after that point to another file.
Good luck with your project.

Search for multiple words in very large text files (10 GB) using C++ the fastest way

I have this program where I have to search for specific values and its line number in very large text file and there might be multiple occurences for the same value.
I've tried a simple C++ programs which reads the text files line by line and searches for a the value using strstr but it's taking a very longgggggggggggggg time
I also tried to use a system command using grep but still it's taking a lot of time, not as long as before but it's still too much time.
I was searching for a library I can use to fasten the search.
Any help and suggestions? Thank you :)
There are two issues concerning the spead: the time it takes to actually
read the data, and the time it takes to search.
Generally speaking, the fastest way to read a file is to mmap it (or
the equivalent under Windows). This can get complicated if the entire
file won't fit into the address space, but you mention 10GB in the
header; if searching is all you do in the program, this shouldn't create
any problems.
More generally, if speed is a problem, avoid using getline on a
string. Reading large blocks, and picking the lines up (as char[])
out of them, without copying, is significantly faster. (As a simple
compromize, you may want to copy when a line crosses a block boundary.
If you're dealing with blocks of a MB or more, this shouldn't be too
often; I've used this technique on older, 16 bit machines, with blocks
of 32KB, and still gotten a significant performance improvement.)
With regards to searching, if you're searching for a single, fixed
string (not a regular expression or other pattern matching), you might
want to try a BM search. If the string you're searching for is
reasonably long, this can make a significant difference over other
search algorithms. (I think that some implementations of grep will
use this if the search pattern is in fact a fixed string, and is
sufficiently long for it to make a difference.)
Use multiple threads. Each thread can be responsible for searching through a portion of the file. For example on a 4 core machine spawn 12 threads. The first thread looks through the first 8%evening of the file, the second thread the second 8% of the file, etc. You will want to tune the number of threads per core to keep the cpu max utilized. Since this is an I/O bound operation you may never reach 100% cpu utilization.
Feeding data to the threads will be a bottleneck using this design. Memory mapping the file might help somewhat but at the end of the day the disk can only read one sector at a time. This will be a bottleneck that you will be hard pressed to resolve. You might consider starting one thread that does nothing but read all the data in to memory and kick off search threads as the data loads up.
Since files are sequential beasts searching from start to end is something that you may not get around however there are a couple of things you could do.
if the data is static you could generate a smaller lookup file (alt. with offsets into the main file), this works good if the same string is repeated multiple times making the index file much smaller. if the file is dynamic you maybe need to regenerate the index file occassionally (offline)
instead of reading line by line, read larger chunks from the file like several MB to speed up I/O.
If you'd like to do use a library you could use xapian.
You may also want to try tokenizing your text before doing the search and I'd also suggest you to try regex too but it will take a lot if you don't have an index on that text so I'd definitely suggest you to try xapian or some search engine.
If your big text file does not change often then create a database (for example SQLite) with a table:
create table word_line_numbers
(word varchar(100), line_number integer);
Read your file and insert a record in database for every word with something like this:
insert into word_line_numbers(word, line_number) values ('foo', 13452);
insert into word_line_numbers(word, line_number) values ('foo', 13421);
insert into word_line_numbers(word, line_number) values ('bar', 1421);
Create an index of words:
create index wird_line_numbers_idx on word_line_numbers(word);
And then you can find line numbers for words fast using this index:
select line_number from word_line_numbers where word='foo';
For added speed (because of smaller database size) and complexity you can use 2 tables: words(word_id integer primary key, word not null) and word_lines(word_id integer not null references words, line_number integer not null).
I'd try first loading as much of the file into the RAM as possible (memory mapping of the file is a good option) and then search concurrently in parts of it on multiple processors. You'll need to take special care near the buffer boundaries to make sure you aren't missing any words. Also, you may want to try something more efficient than the typical strstr(), see these:
Boyer–Moore string search algorithm
Knuth–Morris–Pratt algorithm

Delete a line in an ofstream in C++

I want to erase lines within a file. I know you can store the content of the file (in a vector for example), erase the line and write again. However, it feels very cumbersome, and not very efficient if the file gets bigger.
Anyone knows of a better, more efficient, more elegant way of doing it?
On most file-systems, this is the only option you have, short of switching to an actual database.
However, if you find yourself in this situation (i.e. very large files, with inserts/deletes in the middle), consider whether you can do something like maintaining a bitmap at the top of the file, where each bit represents one line of your file. To "delete" a line, simply flip the corresponding bit value.
There's nothing particularly magical about disk files. They still like to store their data in contiguous areas (typically called something like "blocks"). They don't have ways of leaving data-free holes in the middle of these areas. So if you want to "remove" three bytes from the middle of one of these areas, something somewhere is going to have to accomplish this by moving everything else in that area back by three bytes. No, it is not efficient.
This is why text editors (which have to do this kind of thing a lot) tend to load as much of the file as possible (if not all of it) into RAM, where moving data around is much faster. They typically only write changes back to disk when requested (or periodically). If you are going to have to make lots of changes like this, I'd suggest taking a page from their book and doing something similar.
The BerkeleyDB (dbopen(3)) has an access method called DB_RECNO. This allows one to manipulate files with arbitrary lengths using any sort of record delimiter. The default uses variable-length records with unix newlines as delimiters. You then access each "record" using an integer index. Using this, you can delete arbitrary lines from your text file. This isn't specific to C++, but if you are on most Unix/Linux systems, this API is already available to you.

How many character can a STL string class can hold?

I need to work with a series of characters. The number of characters can be upto 1011.
In a usual array, it's not possible. What should I use?
I wanted to use gets() function to hold the string. But, is this possible for STL containers?
If not, then what's the way?
Example:
input:
AMIRAHID
output: A.M.I.R.A.H.I.D
How to make this possible if the number of characters lessened to 10^10 in 32-bit machine ?
Thank you in advance.
Well, that's roughly 100GByte of data. No usual string class will be able to hold more than fits into your main memory. You might want to look at STXXL, which is an implementation of STL allowing to store part of the data on disk.
If your machine has 1011 == 93GB of memory then it's probably a 64bit machine, so string will work. Otherwise nothing will help you.
Edited answer for the edited question: In that case you don't really need to store the whole string in memory. You can store only small part of it that fits into the memory.
Just read every character from the input, write it to the output and write a dot after it. Repeat it until you get and EOF on the input. To increase performance you can read and write large chunks of the data but such that still can fit into the memory.
Such algorithms are called online algorithms.
It is possible for an array that large to be created. But not on a 32-bit machine. Switching to STL will likely not help, and is unnecessary.
You need to contemplate how much memory that is, and if you have any chance of doing it at all.
1011 is roughly 100 gigabytes, which means you will need a 64-bit system (and compiler) to even be able to address it.
STL's strings support a max of max_size() characters, so the answer can change with the implementation.
A string suffers from the same problem as an array: *it has to fit in memory.
10^11 characters would take up over 4GB. That's hard to fit into memory on a 32-bit machine which has a 4GB memory space. You either need to split up your data into smaller chunks, and only load a bit of it at a time, or switch to 64-bit, in which case both arrays and strings should be able to hold the data (although it may still be preferable to split it up into multiple smaller strings/arrays
The SGI version of STL has a ROPE class (A rope is a big string, get it).
I am not sure it is designed to handle that much data but you can have a look.
http://www.sgi.com/tech/stl/Rope.html
If all you're trying to do is read in some massive file and write to another file the same data with periods interspersed between each character, why bother reading the whole thing into memory at once? Pick some reasonable buffer size and do it in chunks.