This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How would you implement tail efficiently?
A friend of mine was asked how he'd implement tail -n.
To be clear, we are required to print the last n lines of the file specified.
I thought of using an array of n strings and overwriting them in a cyclic manner.
But if we are given, say a 10 GB file, this approach doesn't scale at all.
Is there a better way to do this?
Memory map the file, iterate from the end looking for end of line n times, write from that point to the end of file to standard out.
You could potentially complicate the solution by not mapping the whole file, but just the last X kb of memory (say a couple of memory pages) and seeking there. If there aren't enough lines, then memory map a larger region until you get what you want. You can use some heuristic to implement the guess for how much memory you want to map (say 1kb per line as a rough estimate). I would not really do this though.
"It depends", no doubt. Given the size of the file should be knowable, and given a sensible file-manipulation library which can 'seek' to the end of a very large file without literatally traversing each byte in turn or thrashing virtual memory, you could simply scan backwards from the end counting newlines.
When you're dealing with files that big though, what do you do about the degenerate case where n is close to the number of lines in the multi-gigabyte file? Storing stuff in temporary strings won't scale then, either.
Related
Problem
There is a huge file (10GB), one have to read the file and print out the number of words repeat exactly k times in the file
My Solution
Use ifstream to read the file word by word;
Insert the word into a map std::map<std::string, long> mp; mp[word] += 1;
Once file is read, find all the words in the map to get the words occuring k times
Question
How can multi-thread is used to read the file effectively [read by chunk]? OR
Any method to improve the read speed.
Is there any better data structure other than map can be employed to find the output effectively?
File info
each line can be maximum of 500 words length
each word can be maximum of 100 char length
How can multi-thread is used to read the file effectively [read by chunk]? OR Any method to improve the read speed.
I've been trying out the actual results and it's a good thing to multithread, unlike my previous advice here. The un-threaded variant runs in 1m44,711s, the 4-thread one (on 4 cores) runs in 0m31,559s and the 8-thread one (on 4 cores + HT) runs in 0m23,435s. Major improvement then - almost a factor of 5 in speedup.
So, how do you split up the workload? Split it into N chunks (n == thread count) and have each thread except for the first seek to the first non-word character first. This is the start of their logical chunk. Their logical chunk ends at their end boundary, rounded up to the first non-word character after that point.
Process these blocks in parallel, sync them all to one thread, and then make that thread do the merge of results.
Best next thing you can do to improve speed of reading is to ensure you don't copy data when possible. Read through a memory-mapped file and find strings by keeping pointers or indices to the start and end, instead of accumulating bytes.
Is there any better data structure other than map can be employed to find the output effectively?
Well, because I don't think you'll be using the order, unordered_map is a better choice. I would also make it an unordered_map<std::string_view, size_t> - string_view copies it even less than string would.
On profiling I find that 53% of time is spent in finding the exact bucket that holds a given word.
If you have a 64-bit system then you can memory-map the file, and use e.g. this solution to read from memory.
Combine with the answer from dascandy regarding std::unordered_map and std::string_view (if you have it), and you should be as fast as you can get in a single thread. You could use std::unordered_multiset instead of std::unordered_map, which one is "faster" I don't know.
Using threads is simple, just do what you do know, but each thread handles only part of the file. Merge the maps after all threads are done. But when you split the file into chunks for each thread, then you risk splitting words in the middle. Handling this is not trivial.
I have a very (multiple terrabytes) large amount of strings stored on disk that I need to sort alphabetically and store in another file as quickly as possible (preferrably in C/C++) and using as little internal memory as possible. It is not an option to pre-index the strings beforehand, so I need to sort the strings whenever needed in a close to real-time fashion.
What would be the best algorithm to use in my case? I would prefer a suggestion for a linear algorithm rather than just a link to an existing software library like Lucene.
You usually sort huge external data by chunking it into smaller pieces, operating on them and eventually merging them back. When choosing the sorting algorithm you usually take a look at your requirements:
If you need a time-complexity guarantee that is also stable you can go for a mergesort (O(nlogn) guaranteed) although it requires an additional O(n) space.
If severely memory-bound you might want to try Smoothsort (constant memory, time O(nlogn))
Otherwise you might want to take a look at the research stuff in the gpgpu accelerators field like GPUTeraSort.
Google servers usually have this sort of problems.
Construct simply digital tree (Trie)
Memory will be much less than input data, because many words will be have common prefix. While adding data to tree u mark (incrementation) last child as end of word. If u add all words then u doing a DFS (with priority as u want sorting ex a->z ) and you output data to file. Time-complexity is exactly the same as memory size. It is hard to say about how is complexity because it depends on strings (many short strings better complexity) but it is still much better than input data O(n*k) where n-count of strings; k-the average length of string. Im sorry for my English.
PS. For solve problem with memorysize u can part file to smallest parts, sorting them with my method, and if u will be have for ex (1000 files) u will be remember in each first word (like queues) and next u will be output right word and input next in very short time.
I suggest you use the Unix "sort" command that can easily handle such files.
See How could the UNIX sort command sort a very large file? .
Before disk drives even existed, people wrote programs to sort lists that were far too large to hold in main memory.
Such programs are known as external sorting algorithms.
My understanding is that the Unix "sort" command uses the merge sort algorithm.
Perhaps the simplest version of the external sorting merge sort algorithm works like this (quoting from Wikipedia: merge sort):
Name four tape drives as A, B, C, D, with the original data on A:
Merge pairs of records from A; writing two-record sublists alternately to C and D.
Merge two-record sublists from C and D into four-record sublists; writing these alternately to A and B.
Merge four-record sublists from A and B into eight-record sublists; writing these alternately to C and D
Repeat until you have one list containing all the data, sorted --- in log2(n) passes.
Practical implementations typically have many tweaks:
Almost every practical implementation takes advantage of available RAM by reading many items into RAM at once, using some in-RAM sorting algorithm, rather than reading only one item at a time.
some implementations are able to sort lists even when some or every item in the list is too large to hold in the available RAM.
polyphase merge sort
As suggested by Kaslai, rather than only 4 intermediate files, it is usually quicker to use 26 or more intermediate files. However, as the external sorting article points out, if you divide up the data into too many intermediate files, the program spends a lot of time waiting for disk seeks; too many intermediate files make it run slower.
As Kaslai commented, using larger RAM buffers for each intermediate file can significantly decrease the sort time -- doubling the size of each buffer halves the number of seeks. Ideally each buffer should be sized so the seek time is a relatively small part of the total time to fill that buffer. Then the number of intermediate files should be picked so the total size of all those RAM buffers put together comes close to but does not exceed available RAM. (If you have very short seek times, as with a SSD, the optimal arrangement ends up with many small buffers and many intermediate files. If you have very long seek times, as with tape drives, the optimal arrangement ends up with a few large buffers and few intermediate files. Rotating disk drives are intermediate).
etc. -- See the Knuth book "The Art of Computer Programming, Vol. 3: Sorting and Searching" for details.
Use as much memory as you can and chunk your data. Read one chunk at a time into memory.
Step 1) Sort entries inside chunks
For each chunk:
Use IntroSort to sort your chunk. But to avoid copying your strings around and having to deal with variable sized strings and memory allocations (at this point it will be interesting and relevant if you actually have fixed or max size strings or not), preallocate a standard std array or other fitting container with pointers to your strings that point to a memory region inside the current data chunk. => So your IntroSort swaps the pointers to your strings, instead of swapping actual strings.
Loop over each entry in your sort-array and write the resulting (ordered) strings back to a corresponding sorted strings file for this chunk
Step 2) Merge all strings from sorted chunks into resulting sorted strings file
Allocate a "sliding" window memory region for all sorted strings files at once. To give an example: If you have 4 sorted strings files, allocate 4 * 256MB (or whatever fits, the larger the less (sequential) disk IO reads required).
Fill each window by reading the strings into it (so, read as much strings at once as your window can store).
Use MergeSort to compare any of your chunks, using a comparator to your window (e.g. stringInsideHunkA = getStringFromWindow(1, pointerToCurrentWindow1String) - pointerToCurrentWindow1String is a reference that the function advances to the next string). Note that if the string pointer to your window is beyond the window size (or the last record didn't fit to the window read the next memory region of that chunk into the window.
Use mapped IO (or buffered writer) and write the resulting strings into a giant sorted strings final
I think this could be an IO efficient way. But I've never implemented such thing.
However, in regards to your file size and yet unknown to me "non-functional" requirements, I suggest you to also consider benchmarking a batch-import using LevelDB [1]. It's actually very fast, minimizes disk IO, and even compresses your resulting strings file to about half the size without impact on speed.
[1] http://leveldb.googlecode.com/svn/trunk/doc/benchmark.html
Here is a general algorithm that will be able to do what you want with just a few gigs of memory. You could get away with much less, but the more you have, the less disk overhead you have to deal with. This assumes that all of the strings are in a single file, however could be applied to a multiple file setup.
1: Create some files to store loosely sorted strings in. For terabytes of data, you'd probably want 676 of them. One for strings starting in "aa", one for "ab", and so on until you get to "zy" and "zz".
2: For each file you created, create a corresponding buffer in memory. A std::vector<std::string> perhaps.
3: Determine a buffer size that you want to work with. This should not exceed much beyond 1/2 of your available physical memory.
4: Load as many strings as you can into this buffer.
5: Truncate the file so that the strings in your buffer are no longer on disk. This step can be delayed for later or omitted entirely if you have the disk space to work with or the data is too sensitive to lose in the case of process failure. If truncating, make sure you load your strings from the end of the file, so that the truncation is almost a NOP.
6: Iterate over the strings and store them in their corresponding buffer.
7: Flush all of the buffers to their corresponding files. Clear all the buffers.
8: Go to step 4 and repeat until you have exhausted your source of strings.
9: Read each file to memory and sort it with whatever algorithm you fancy. On the off chance you end up with a file that is larger than your available physical memory, use a similar process from above to split it into smaller files.
10: Overwrite the unsorted file with this new sorted file, or append it to a monolithic file.
If you keep the individual files rather than a monolithic file, you can make insertions and deletions relatively quickly. You would only have to load in, insert, and sort the value into a single file that can be read entirely into memory. Now and then you might have to split a file into smaller files, however this merely amounts to looking around the middle of the file for a good place to split it and then just moving everything after that point to another file.
Good luck with your project.
I have a file which might be 30+GB or more. And each line in this file is called a record and is composed of 2 cols, which goes like this
id1 id2
All of this 2 ids are integers (32-bit). My job is to write a program to remove all the duplicate record, make the record unique, finally output the unique id2 into a file.
There is some constraints, 30G memory is allowed at most, and better get the job done efficiently by a non-multithread/process program.
Initially I came up with an idea: because of the memory constraints, I decided to read the file n times, each only keep in memory those record with id1 % n = i (i = 0,1,2,..,n-1). The data structure I use is a std::map<int, std::set<int> >, it takes id1 as key, and put id2 in id1's std::set.
This way, memory constraints will not be violated, but it's quite slow. I think it's because as the std::map and std::set grows larger, the insertion speed goes down. Moreover, I need to read the file n times, when each round is done, I gotta clear the std::map for next round which also cost some time.
I also tried hash, but it doesn't satisfy me either, which I thought there might be too many collisions even with 300W buckets.
So, I post my problem here, help you guys can offer me any better data structure or algorithm.
Thanks a lot.
PS
Scripts (shell, python) are desired, if it can do it efficiently.
Unless I overlooked a requirement, it should be possible to do this on the Linux shell as
sort -u inputfile > outputfile
Many implementations enable you to use sort in a parallelised manner as well:
sort --parallel=4 -u inputfile > outputfile
for up to four parallel executions.
Note that sort might use a lot of space in /tmp temporarily. If you run out of disk space there, you may use the -T option to point it to an alternative place on disk to use as temporary directory.
(Edit:) A few remarks about efficiency:
A significant portion of the time spent during execution (of any solution to your problem) will be spent on IO, something that sort is highly optimised for.
Unless you have extremely much RAM, your solution is likely to end up performing some of the work on disk (just like sort). Again, optimising this means a lot of work, while for sort all of that work has been done.
One disadvantage of sort is that it operates on string representations of the input lines. If you were to write your own code, one thing you could do (similar to what you suggesed already) is to convert the input lines to 64-bit integers and hash them. If you have enough RAM, that may be a way to beat sort in terms of speed, if you get IO and integer conversions to be really fast. I suspect it may not be worth the effort as sort is easy to use and – I think – fast enough.
I just don't think you can do this efficiently without using a bunch of disk. Any form of data structure will introduce so much memory and/or storage overhead that your algorithm will suffer. So I would expect a sorting solution to be best here.
I reckon you can sort large chunks of the file at a time, and then merge (ie from merge-sort) those chunks after. After sorting a chunk, obviously it has to go back to disk. You could just replace the data in the input file (assuming it's binary), or write to a temporary file.
As far as the records, you just have a bunch of 64-bit values. With 30GB RAM, you can hold almost 4 billion records at a time. That's pretty sweet. You could sort that many in-place with quicksort, or half that many with mergesort. You probably won't get a contiguous block of memory that size. So you're going to have to break it up. That will make quicksort a little trickier, so you might want to use mergesort in RAM as well.
During the final merge it's trivial to discard duplicates. The merge might be entirely file-based, but at worst you'll use an amount of disk equivalent to twice the number of records in the input file (one file for scratch and one file for output). If you can use the input file as scratch, then you have not exceeded your RAM limits OR your disk limits (if any).
I think the key here is the requirement that it shouldn't be multithreaded. That lends itself well to disk-based storage. The bulk of your time is going to be spent on disk access. So you wanna make sure you do that as efficiently as possible. In particular, when you're merge-sorting you want to minimize the amount of seeking. You have large amounts of memory as buffer, so I'm sure you can make that very efficient.
So let's say your file is 60GB (and I assume it's binary) so there's around 8 billion records. If you're merge-sorting in RAM, you can process 15GB at a time. That amounts to reading and (over)writing the file once. Now there are four chunks. If you want to do pure merge-sort then you always deal with just two arrays. That means you read and write the file two more times: once to merge each 15GB chunk into 30GB, and one final merge on those (including discarding of duplicates).
I don't think that's too bad. Three times in and out. If you figure out a nice way to quicksort then you can probably do this with one fewer pass through the file. I imagine a data structure like deque would work well, as it can handle non-contiguous chunks of memory... But you'd probably wanna build your own and finely tune your sorting algorithm to exploit it.
Instead of std::map<int, std::set<int> > use std::unordered_multimap<int,int>. If you can not use C++11 - write your own.
The std::map is node based and it calls malloc on each insertion, this is probably why it is slow. With unodered map (hash table), if you know number of records, you can pre-allocate. Even if you don't, number of mallocs will be O(log N) instead of O(N) with std::map.
I can bet this will be several times faster and more memory efficient than using external sort -u.
This approach may help when there are not too many duplicate records in the file.
1st pass. Allocate most of the memory for Bloom filter. Hash every pair from input file and put the result into Bloom filter. Write each duplicate, found by Bloom filter into temporary file (this file will also contain some amount of false positives, which are not duplicates).
2nd pass. Load temporary file and construct a map from its records. Key is std::pair<int,int>, value is a boolean flag. This map may be implemented either as std::unordered_map/boost::unordered_map, or as std::map.
3rd pass. Read input file again, search each record in the map, output its id2 if either not found or flag is not yet set, then set this flag.
I want to count words occurrences in a set of plain text files. Just like here http://doc.trolltech.com/4.5/qtconcurrent-wordcount-main-cpp.html
The problem is that i need to process very big amount of plain text files - so my result srored in QMap could not fit into memory.
I googled external memory (file based) merge sort algorithm, but i'm too lazy to implement myself. So i want to divide result set by portions to fit each of them into memory. Then store this portions in files on disk. Then call magic function mergeSort(QList, result_file) and have final result in result_file.
Does anyone know Qt compatible implementation of this algo?
In short i'm looking for pythons heapq.merge (http://docs.python.org/library/heapq.html#heapq.merge) analog but for Qt containers.
You might wanna check out this one:
http://stxxl.sourceforge.net/
It's not exactly what you are looking for (close enough though), but I guess you will not find exactly what you want working with Qt lists. Since you are are implementing alghoritm creating this list, changing it's type shouldn't be a problem. As far as i remember on those list you can use standard stl sorting alghoritms. The only problem remains preformance.
I presume that the map contains the association between the word and the number of occurences. In this case, why do you say you have such a significant memory consumption? How many distinct words and forms could you have and what is the average memory consumption for one word?
Considering 1.000.000 words, with 1K memory consumption per word (that includes the word text, the QMap specific storage), that would lead to (aprox) 1GB of memory, which... doesn't seem so much to me.
What is the most memory efficient way to remove duplicate lines in a large text file using C++?
Let me clarify, I'm not asking for code, just the best method. The duplicate lines are not guaranteed to be adjacent. I realize that an approach optimized for minimal memory usage would result in slower speeds however this is my restriction as the files are far too large.
i would hash each line and then seek back to lines that have non-unique hashes and compare them individually (or in a buffered fashion). this would work well on files with a relatively low occurence of duplicates.
When you use a hash, you can set the memory used to a constant amount (i.e., you could have a tiny hash table with just 256 slots or something larger. in any case, the amount of mem can be restricted to any constant amount.) the values in the table are the offset of the lines with that hash. so you only need line_count*sizeof(int) plus a constant to maintain the hash table.
even simpler (but much slower) would be to scan the entire file for each line. but i prefer the first option. this is the most memory efficient option possible. you would only need to store 2 offsets and 2 bytes to do the comparison.
To minimize memory usage:
If you have unlimited (or very fast) disk i/o, you could write each line to its own file with the filename being the hash + some identifier indicating order (or no order, if order is irrelevant). In this manner, you use the file-system as extended memory. This should be much faster than re-scanning the whole file for each line.
As an addition from what those have said below, if you expect a high duplicate rate, you could maintain some threshold of the hashes in memory as well as in file. This would give much better results for high duplicate rates. Since the file is so large, I doubt n^2 is acceptable in processing time. My solution is O(n) in processing speed and O(1) in memory. It's O(n) in required disk space used during runtime, however, which other solutions do not have.
It sounds like you might be running on limited hardware of varied specs, so you'll want to test a number of duplicate removing algorithms and profile before you decide which is best for long-term implementation.
You could use an I/O efficient sort(like the unix sort command) and then read the file in line by line comparing each line to the one previously read. If the two are equal don't output anything if they aren't output the line.
This way the amount of memory used by the algorithm is constant.
Simple brute-force solution (very little memory consumption):
Do an n^2 pass through the file and remove duplicate lines. Speed: O(n^2), Memory: constant
Fast (but poor, memory consumption):
Stefan Kendall's solution: hash each line, store them in a map of some sort and remove a line whose has already exists. Speed: O(n), memory: O(n)
If you're willing to sacrifice file order (I assume not, but I'll add it):
You can sort the lines, then pass through removing duplicates. speed: O(n*log(n)), Memory: constant
edit:
If you dont like the idea of sorting the file contents or trying to maintain unique hashes but can handle O(n) memory usage: You can identify each line with it's 32 bit or 64 bit position marker (depending on the file's size) and sort the file positions instead of the file contents.
edit #2: caveat: in-memory sorting lines of different lengths is harder than doing it to say, an array of ints...actually, thinking about how the memory would have to shift and move in a merge step, I'm second guessing my ability to sort a file like that in n*log(n)
Why not just consult Knuth, Sorting and Searching? That will give you a great background for making a balanced decision.