Processing huge text files - c++

Problem:
I've a huge raw text file (assume of 3gig), I need to go through each word in the file
and find out that a word appears how many times in the file.
My Proposed Solution:
Split the huge file into multiple files and each splitted file will have words in a sorted manner. For example,
all the words starting with "a" will be stored in a "_a.dic" file. So, at any time we will not execeed more than 26 files.
The problem in this approach is,
I can use streams to read the file, but wanted to use threads to read certain parts of the file. For example, read 0-1024 bytes with a separate thread (atleast have 4-8 threads based on the no. of processors exist in the box). Is this is possible or am I dreaming?
Any better approach?
Note: It should be a pure c++ or c based solution. No databases etc., are allowed.

You need to look at 'The Practice of Programming' by Kernighan and Pike, and specifically chapter 3.
In C++, use a map based on the strings and a count (std::map<string,size_t>, IIRC). Read the file (once - it's too big to read more than once), splitting it into words as you go (for some definition of 'word'), and incrementing the count in the map entry for each word you find.
In C, you'll have to create the map yourself. (Or find David Hanson's "C Interfaces and Implementations".)
Or you can use Perl, or Python, or Awk (all of which have associative arrays, equivalent to a map).

I don't think using multiple threads that read parts of the file in parallel is going to help much. I would expect that this application is bound to the bandwidth and latency of your harddisk, not the actual word counting. Such a multi-threaded version might actually perform worse because "quasi-random" file access is typically slower than "linear file" access.
In case the CPU is really busy in a single-threaded version there might be a potential speed up. One thread could read the data in big chunks and put them into a queue of limited capacity. A bunch of other worker threads could operate each on their own chunk and count the words. After the counting worker threads finished you have to merge the word counters.

First - decide on the datastructure for saving the words.
The obvious choice is the map. But perhaps a Trie would serve you better. In each node, you save the count for the word. 0 means, that it's only part of a word.
You can insert into the trie using a stream and reading your file characterbased.
Second - multithreading yes or no?
This one is not easy to answer. Depending on the size the datastructure grows and how you parallelize the answer may differ.
Singlethreaded - straitforward and easy to implement.
Multithreaded with multiple reader threads and one datastructur. Then you have to synchronize the access to the datastructure. In a Trie, you only need to lock the node you are actually in, so multiple readers can access the datastructure without much interference. A self-balancing tree might be different, especially when rebalancing.
Multithreaded with multiple reader threads, each with their own datastructure. Each thread builds it's own datastructure while reading a part of the file. After each one is finished, the results have to be combined (which should be easy).
One thing you have to think about - you have to find a word boundary for each thread to start, but that should not pose a great problem (e.g. each thread walks it's start until the first word boundary and starts there, at the end each thread finishes the word it's working on).

While you can use a second thread to analyze the data after reading it, you're probably not going to gain a huge amount by doing so. Trying to use more than one thread to read the data will almost certainly hurt speed rather than improving it. Using multiple threads to process the data is pointless -- processing will be many times faster than reading, so even with only one extra thread, the limit is going to be the disk speed.
One (possible) way to gain significant speed is to bypass the usual iostreams -- while some are nearly as fast as using C FILE*'s, I don't know of anything that's really faster, and some are substantially slower. If you're running this on a system (e.g. Windows) that has an I/O model that's noticeably different from C's, you can gain considerably more with a little care.
The problem is fairly simple: the file you're reading is (potentially) larger than the cache space you have available -- but you won't gain anything from caching, because you're not going to reread chunks of the file again (at least if you do things sensibly). As such, you want to tell the system to bypass any caching, and just transfer data as directly as possible from the disk drive to your memory where you can process it. In a Unix-like system, that's probably open() and read() (and won't gain you a whole lot). On Windows, that's CreateFile and ReadFile, passing the FILE_FLAG_NO_BUFFERING flag to CreateFile -- and it'll probably roughly double your speed if you do it right.
You've also gotten some answers advocating doing the processing using various parallel constructs. I think these are fundamentally mistaken. Unless you do something horribly stupid, the time to count the words in the file will be only a few milliseconds longer than it takes to simply read the file.
The structure I'd use would be to have two buffers of, say, a megabyte apiece. Read data into one buffer. Turn that buffer over to your counting thread to count the words in that buffer. While that's happening, read data into the second buffer. When those are done, basically swap buffers and continue. There is a little bit of extra processing you'll need to do in swapping buffers to deal with a word that may cross the boundary from one buffer to the next, but it's pretty trivial (basically, if the buffer doesn't end with white space, you're still in a word when you start operating on the next buffer of data).
As long as you're sure it'll only be used on a multi-processor (multi-core) machine, using real threads is fine. If there's a chance this might ever be done on a single-core machine, you'd be somewhat better off using a single thread with overlapped I/O instead.

As others have indicated, the bottleneck will be the disk I/O. I therefore suggest that you use overlapped I/O. This basically inverts the program logic. Instead of your code tyring to determine when to do I/O, you simply tell the Operating System to call your code whenever it has finished a bit of I/O. If you use I/O completion ports, you can even tell the OS to use multiple threads for processing the file chunks.

c based solution?
I think perl was born for this exact purpose.

stream has only one cursor. If you access to the stream with more than one thread at a time, you will not be sure to read where you want. Read is done from cursor position.
What I would do is to have only one thread (maybe the main one) that reads the stream and dispatch reading bytes to other threads.
By example:
Thread #i is ready and ask main thread to give it next part,
Main thread read next 1Mb and provide them to thread 1,
Thread #i read the 1Mb and count words as you want,
Thread #i finishes its work and ask again for the next 1Mb.
By this way you can separate stream reading to stream analysis.

What you are looking for is RegEx. This Stackoverflow thread on c++ regex engines should help:
C++: what regex library should I use?

First, I'm pretty sure that C/C++ isn't the best way to handle this. Ideally, you'd use some map/reduce for parallelism, too.
But, assuming your constraints, here's what I'd do.
1) Split the text file into smaller chunks. You don't have to do this by the first-letter of the word. Just break them up into, say, 5000-word chunks. In pseudocode, you'd do something like this:
index = 0
numwords = 0
mysplitfile = openfile(index-split.txt)
while (bigfile >> word)
mysplitfile << word
numwords ++
if (numwords > 5000)
mysplitfile.close()
index++
mysplitfile = openfile(index-split.txt)
2) Use a shared map data structure and pthreads to spawn new threads to read each of the subfiles. Again, pseudocode:
maplock = create_pthread_lock()
sharedmap = std::map()
for every index-split.txt file:
spawn-new-thread(myfunction, filename, sharedmap, lock)
dump_map(sharedmap)
void myfunction(filename, sharedmap) {
localmap = std::map<string, size_t>();
file = openfile(filename)
while (file >> word)
if !localmap.contains(word)
localmap[word] = 0
localmap[word]++
acquire(lock)
for key,value in localmap
if !sharedmap.contains(key)
sharedmap[key] = 0
sharedmap[key] += value
release(lock)
}
Sorry for the syntax. I've been writing a lot of python lately.

Not C, and a bit UGLY, but it took only 2 minutes to bang out:
perl -lane '$h{$_}++ for #F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq
Loop over each line with -n
Split each line into #F words with -a
Each $_ word increments hash %h
Once the END of file has been reached,
sort the hash by the frequency $h{$b}<=>$h{$a}
If two frequencies are identical, sort alphabetically $a cmp $b
Print the frequency $h{$w} and the word $w
Redirect the results to file 'freq'
I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in 173 seconds.
My input file already had punctuation stripped out, and uppercase converted to lowercase, using this bit of code:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file
(runtime of 144 seconds)
The word-counting script could alternately be written in awk:
awk '{for (i=1; i<=NF; i++){h[$i]++}} END{for (w in h){printf("%s\t%s\n", h[w], w)}}' file | sort -rn > freq

Related

Why is reading big text file in parallel bad?

I have a big txt file with ~30 millions rows, each row is seperated by a line seperator \n. And I'd like to read all lines to an unordered list (e.g. std::list<std::string>).
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
The current implementation is very slow, so I'm learning how to read data by chunk.
But after seeing this comment:
parallelizing on a HDD will make things worse, with the impact depending on the distribution of the files on the HDD. On a SSD it might (!) improve things.
Is it bad to read a file in parallel? What's the algorithm to read all lines of a file to an unordered container (e.g. std::list, normal array,...) as fast as possible, without using any libraries, and the code must be cross-platform?
Is it bad to read a file in parallel? What's the algorithm to read all
lines of a file to an unordered container (e.g. std::list, normal
array,...) as fast as possible, without using any libraries, and the
code must be cross-platform?
I guess I'll attempt to answer this one to avoid spamming the comments. I have, in multiple scenarios, sped up text file parsing substantially using multithreading. However, the keyword here is parsing, not disk I/O (though just about any text file read involves some level of parsing). Now first things first:
VTune here was telling me that my top hotspots were in parsing (sorry, this image was taken years ago and I didn't expand the call graph to show what inside obj_load was taking most of the time, but it was sscanf). This profiling session actually surprised me quite a bit. In spite of having been profiling for decades to the point where my hunches aren't too inaccurate (not accurate enough to avoid profiling, mind you, not even close, but I've tuned my sort of intuitive spider senses enough to where profiling sessions usually don't surprise me that much even without any glaring algorithmic inefficiencies -- though I might still be off about exactly why they exist since I'm not so good at assembly).
Yet this time I was really taken back and shocked so this example has always been one I used to show even the most skeptical colleagues who don't want to use profilers to show why profiling is so important. Some of them are actually good at guessing where hotspots exists and some were actually creating very competent-performing solutions in spite of never having used them, but none of them were good at guessing what isn't a hotspot, and none of them could draw a call graph based on their hunches. So I always liked to use this example to try to convert the skeptics and get them to spend a day just trying out VTune (and we had a boatload of free licenses from Intel who worked with us which were largely going to waste on our team which I thought was a tragedy since VTune is a really expensive piece of software).
And the reason I was taken back this time was not because I was surprised by the sscanf hotspot. That's kind of a no-brainer that non-trivial parsing of epic text files is going to generally be bottlenecked by string parsing. I could have guessed that. My colleagues who never touched a profiler could have guessed that. What I couldn't have guessed was how much of a bottleneck it was. I thought given the fact that I was loading millions of polygons and vertices, texture coordinates, normals, creating edges and finding adjacency data, using index FOR compression, associating materials from the MTL file to the polygons, reverse engineering object normals stored in the OBJ file and consolidating them to form edge creasing, etc. I would at least have a good chunk of the time distributed in the mesh system as well (I would have guessed 25-33% of the time spent in the mesh engine).
Turned out the mesh system took barely any time to my most pleasant surprise, and there my hunches were completely off about it specifically. It was, by far, parsing that was the uber bottleneck (not disk I/O, not the mesh engine).
So that's when I applied this optimization to multithread the parsing, and there it helped a lot. I even initially started off with a very modest multithreaded implementation which barely did any parsing except scanning the character buffers for line endings in each thread just to end up parsing in the loading thread, and that already helped by a decent amount (reduced the operation from 16 seconds to about 14 IIRC, and I eventually got it down to ~8 seconds and that was on an i3 with just two cores and hyperthreading). So anyway, yeah, you can probably make things faster with multithreaded parsing of character buffers you read in from text files in a single thread. I wouldn't use threads as a way to make disk I/O any faster.
I'm reading the characters from the file in binary into big char buffers in a single thread, then, using a parallel loop, have the threads figure out integer ranges for the lines in that buffer.
// Stores all the characters read in from the file in big chunks.
// This is shared for read-only access across threads.
vector<char> buffer;
// Local to a thread:
// Stores the starting position of each line.
vector<size_t> line_start;
// Stores the assigned buffer range for the thread:
size_t buffer_start, buffer_end;
Basically like so:
LINE1 and LINE2 are considered to belong to THREAD 1, while LINE3 is considered to belong to THREAD 2. LINE6 is not considered to belong to any thread since it doesn't have an EOL. Instead the characters of LINE6 will be combined with the next chunky buffer read from the file.
Each thread begins by looking at the first character in its assigned character buffer range. Then it works backwards until it finds an EOL or reaches the beginning of the buffer. After that it works forward and parses each line, looking for EOLs and doing whatever else we want, until it reaches the end of its assigned character buffer range. The last "incomplete line" is not processed by the thread, but instead the next thread (or if the thread is the last thread, then it is processed on the next big chunky buffer read by the first thread). The diagram is teeny (couldn't fit much) but I read in the character buffers from the file in the loading thread in big chunks (megabytes) before the threads parse them in parallel loops, and each thread might then parse thousands of lines from its designated buffer range.
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
Kind of echoing Veedrac's comments, storing your lines in std::list<std::string> if you want to really load an epic number of lines quickly is not a good idea. That would actually be a bigger priority to address than multithreading. I'd turn that into just std::vector<char> all_lines storing all the strings, and you can use std::vector<size_t> line_start to store the starting line position of an nth line, which you can retrieve like so:
// note that 'line' will be EOL-terminated rather than null-terminated
// if it points to the original buffer.
const char* line = all_lines.data() + line_start[n];
The immediate problem with std::list without a custom allocator is a heap allocation per node. On top of that we're wasting memory storing two extra pointers per line. std::string is problematic here because SBO optimizations to avoid heap allocation would make it take too much memory for small strings (and thereby increase cache misses) or still end up invoking heap allocations for every non-small string. So you end up avoiding all these problems just storing everything in one giant char buffer, like in std::vector<char>. I/O streams, including stringstreams and functions like getline, are also horrible for performance, just awful, in ways that really disappointed me at first since my first OBJ loader used those and it was over 20 times slower than the second version where I ported all those I/O stream operators and functions and use of std::string to make use of C functions and my own hand-rolled stuff operating on char buffers. When it comes to parsing in performance-critical contexts, C functions like sscanf and memchr and plain old character buffers tend to be so much faster than the C++ ways of doing it, but you can at least still use std::vector<char> to store huge buffers, e.g., to avoid dealing with malloc/free and get some debug-build sanity checks when accessing the character buffer stored inside.

Read a big file to count the number of words repeat K times

Problem
There is a huge file (10GB), one have to read the file and print out the number of words repeat exactly k times in the file
My Solution
Use ifstream to read the file word by word;
Insert the word into a map std::map<std::string, long> mp; mp[word] += 1;
Once file is read, find all the words in the map to get the words occuring k times
Question
How can multi-thread is used to read the file effectively [read by chunk]? OR
Any method to improve the read speed.
Is there any better data structure other than map can be employed to find the output effectively?
File info
each line can be maximum of 500 words length
each word can be maximum of 100 char length
How can multi-thread is used to read the file effectively [read by chunk]? OR Any method to improve the read speed.
I've been trying out the actual results and it's a good thing to multithread, unlike my previous advice here. The un-threaded variant runs in 1m44,711s, the 4-thread one (on 4 cores) runs in 0m31,559s and the 8-thread one (on 4 cores + HT) runs in 0m23,435s. Major improvement then - almost a factor of 5 in speedup.
So, how do you split up the workload? Split it into N chunks (n == thread count) and have each thread except for the first seek to the first non-word character first. This is the start of their logical chunk. Their logical chunk ends at their end boundary, rounded up to the first non-word character after that point.
Process these blocks in parallel, sync them all to one thread, and then make that thread do the merge of results.
Best next thing you can do to improve speed of reading is to ensure you don't copy data when possible. Read through a memory-mapped file and find strings by keeping pointers or indices to the start and end, instead of accumulating bytes.
Is there any better data structure other than map can be employed to find the output effectively?
Well, because I don't think you'll be using the order, unordered_map is a better choice. I would also make it an unordered_map<std::string_view, size_t> - string_view copies it even less than string would.
On profiling I find that 53% of time is spent in finding the exact bucket that holds a given word.
If you have a 64-bit system then you can memory-map the file, and use e.g. this solution to read from memory.
Combine with the answer from dascandy regarding std::unordered_map and std::string_view (if you have it), and you should be as fast as you can get in a single thread. You could use std::unordered_multiset instead of std::unordered_map, which one is "faster" I don't know.
Using threads is simple, just do what you do know, but each thread handles only part of the file. Merge the maps after all threads are done. But when you split the file into chunks for each thread, then you risk splitting words in the middle. Handling this is not trivial.

What is the appropriate madvise setting for reading a file backwards?

I am using gcc 4.7.2 on a 64-bit Linux box.
I have 20 large sorted binary POD files that I need to read as a part of the final merge in an external merge-sort.
Normally, I would mmap all the files for reading and use a multiset<T,LessThan> to manage the merge sort from small to large, before doing a mmap write out to disk.
However, I realised that if I keep a std::mutex on each of these files, I can create a second thread which reads the file backwards, and sort from large to small at the same time. If I decide, beforehand, that the first thread will take exactly n/2 elements and the second thread will take the rest, I will have no need for a mutex on the output end of things.
Reading lock contentions, can be expected to occur, on average, maybe 1 in 20 in this particular case, so that's acceptable.
Now, here's my question. In the first case, it is obvious that I should call madvise with MADV_SEQUENTIAL, but I have no idea what I should do for the second case, where I'm reading the file backwards.
I see no MADV_REVERSE in the man pages. Should I use MADV_NORMAL or maybe don't call madvise at all?
Recall that an external sort is needed when the volume of data is so large that it will not fit into memory. So we are left with a more complex algorithm to use disk as a temporary store. Divide-and-conquer algorithms will usually involve breaking up the data, doing partial sorts, and then merging the partial sorts.
My steps for an external merge-sort
Take n=1 billion random numbers and break them into 20 shards of equal sizes.
Sort each shard individually from small to large, and write each out into its own file.
Open 40 mmap's, 2 for each file, one for going forward, one for going backwards, associate a mutex with each file.
Instantiate a std::multiset<T,LessThan> buff_fwd; for the forward thread and a std::multiset<T,GreaterThan> buff_rev for the reverse thread. Some people prefer to use priority queues here, but essentially, and sort-on-insert container will work here.
I like to call the two buffers surface and rockbottom, representing the smallest and largest numbers not yet added to the final sort.
Add items from the shards until n/2 is used up, and flush the shards to one output file using mmap from beginning towards the middle, and from the end towards to middle in the other thread. You can basically flush at will, but at least one should do it before either buffer uses up too much memory.
I would suggest:
MADV_RANDOM
To prevent useless read-ahead (which is in the wrong direction).

remove all duplicate records efficiently

I have a file which might be 30+GB or more. And each line in this file is called a record and is composed of 2 cols, which goes like this
id1 id2
All of this 2 ids are integers (32-bit). My job is to write a program to remove all the duplicate record, make the record unique, finally output the unique id2 into a file.
There is some constraints, 30G memory is allowed at most, and better get the job done efficiently by a non-multithread/process program.
Initially I came up with an idea: because of the memory constraints, I decided to read the file n times, each only keep in memory those record with id1 % n = i (i = 0,1,2,..,n-1). The data structure I use is a std::map<int, std::set<int> >, it takes id1 as key, and put id2 in id1's std::set.
This way, memory constraints will not be violated, but it's quite slow. I think it's because as the std::map and std::set grows larger, the insertion speed goes down. Moreover, I need to read the file n times, when each round is done, I gotta clear the std::map for next round which also cost some time.
I also tried hash, but it doesn't satisfy me either, which I thought there might be too many collisions even with 300W buckets.
So, I post my problem here, help you guys can offer me any better data structure or algorithm.
Thanks a lot.
PS
Scripts (shell, python) are desired, if it can do it efficiently.
Unless I overlooked a requirement, it should be possible to do this on the Linux shell as
sort -u inputfile > outputfile
Many implementations enable you to use sort in a parallelised manner as well:
sort --parallel=4 -u inputfile > outputfile
for up to four parallel executions.
Note that sort might use a lot of space in /tmp temporarily. If you run out of disk space there, you may use the -T option to point it to an alternative place on disk to use as temporary directory.
(Edit:) A few remarks about efficiency:
A significant portion of the time spent during execution (of any solution to your problem) will be spent on IO, something that sort is highly optimised for.
Unless you have extremely much RAM, your solution is likely to end up performing some of the work on disk (just like sort). Again, optimising this means a lot of work, while for sort all of that work has been done.
One disadvantage of sort is that it operates on string representations of the input lines. If you were to write your own code, one thing you could do (similar to what you suggesed already) is to convert the input lines to 64-bit integers and hash them. If you have enough RAM, that may be a way to beat sort in terms of speed, if you get IO and integer conversions to be really fast. I suspect it may not be worth the effort as sort is easy to use and – I think – fast enough.
I just don't think you can do this efficiently without using a bunch of disk. Any form of data structure will introduce so much memory and/or storage overhead that your algorithm will suffer. So I would expect a sorting solution to be best here.
I reckon you can sort large chunks of the file at a time, and then merge (ie from merge-sort) those chunks after. After sorting a chunk, obviously it has to go back to disk. You could just replace the data in the input file (assuming it's binary), or write to a temporary file.
As far as the records, you just have a bunch of 64-bit values. With 30GB RAM, you can hold almost 4 billion records at a time. That's pretty sweet. You could sort that many in-place with quicksort, or half that many with mergesort. You probably won't get a contiguous block of memory that size. So you're going to have to break it up. That will make quicksort a little trickier, so you might want to use mergesort in RAM as well.
During the final merge it's trivial to discard duplicates. The merge might be entirely file-based, but at worst you'll use an amount of disk equivalent to twice the number of records in the input file (one file for scratch and one file for output). If you can use the input file as scratch, then you have not exceeded your RAM limits OR your disk limits (if any).
I think the key here is the requirement that it shouldn't be multithreaded. That lends itself well to disk-based storage. The bulk of your time is going to be spent on disk access. So you wanna make sure you do that as efficiently as possible. In particular, when you're merge-sorting you want to minimize the amount of seeking. You have large amounts of memory as buffer, so I'm sure you can make that very efficient.
So let's say your file is 60GB (and I assume it's binary) so there's around 8 billion records. If you're merge-sorting in RAM, you can process 15GB at a time. That amounts to reading and (over)writing the file once. Now there are four chunks. If you want to do pure merge-sort then you always deal with just two arrays. That means you read and write the file two more times: once to merge each 15GB chunk into 30GB, and one final merge on those (including discarding of duplicates).
I don't think that's too bad. Three times in and out. If you figure out a nice way to quicksort then you can probably do this with one fewer pass through the file. I imagine a data structure like deque would work well, as it can handle non-contiguous chunks of memory... But you'd probably wanna build your own and finely tune your sorting algorithm to exploit it.
Instead of std::map<int, std::set<int> > use std::unordered_multimap<int,int>. If you can not use C++11 - write your own.
The std::map is node based and it calls malloc on each insertion, this is probably why it is slow. With unodered map (hash table), if you know number of records, you can pre-allocate. Even if you don't, number of mallocs will be O(log N) instead of O(N) with std::map.
I can bet this will be several times faster and more memory efficient than using external sort -u.
This approach may help when there are not too many duplicate records in the file.
1st pass. Allocate most of the memory for Bloom filter. Hash every pair from input file and put the result into Bloom filter. Write each duplicate, found by Bloom filter into temporary file (this file will also contain some amount of false positives, which are not duplicates).
2nd pass. Load temporary file and construct a map from its records. Key is std::pair<int,int>, value is a boolean flag. This map may be implemented either as std::unordered_map/boost::unordered_map, or as std::map.
3rd pass. Read input file again, search each record in the map, output its id2 if either not found or flag is not yet set, then set this flag.

Search for multiple words in very large text files (10 GB) using C++ the fastest way

I have this program where I have to search for specific values and its line number in very large text file and there might be multiple occurences for the same value.
I've tried a simple C++ programs which reads the text files line by line and searches for a the value using strstr but it's taking a very longgggggggggggggg time
I also tried to use a system command using grep but still it's taking a lot of time, not as long as before but it's still too much time.
I was searching for a library I can use to fasten the search.
Any help and suggestions? Thank you :)
There are two issues concerning the spead: the time it takes to actually
read the data, and the time it takes to search.
Generally speaking, the fastest way to read a file is to mmap it (or
the equivalent under Windows). This can get complicated if the entire
file won't fit into the address space, but you mention 10GB in the
header; if searching is all you do in the program, this shouldn't create
any problems.
More generally, if speed is a problem, avoid using getline on a
string. Reading large blocks, and picking the lines up (as char[])
out of them, without copying, is significantly faster. (As a simple
compromize, you may want to copy when a line crosses a block boundary.
If you're dealing with blocks of a MB or more, this shouldn't be too
often; I've used this technique on older, 16 bit machines, with blocks
of 32KB, and still gotten a significant performance improvement.)
With regards to searching, if you're searching for a single, fixed
string (not a regular expression or other pattern matching), you might
want to try a BM search. If the string you're searching for is
reasonably long, this can make a significant difference over other
search algorithms. (I think that some implementations of grep will
use this if the search pattern is in fact a fixed string, and is
sufficiently long for it to make a difference.)
Use multiple threads. Each thread can be responsible for searching through a portion of the file. For example on a 4 core machine spawn 12 threads. The first thread looks through the first 8%evening of the file, the second thread the second 8% of the file, etc. You will want to tune the number of threads per core to keep the cpu max utilized. Since this is an I/O bound operation you may never reach 100% cpu utilization.
Feeding data to the threads will be a bottleneck using this design. Memory mapping the file might help somewhat but at the end of the day the disk can only read one sector at a time. This will be a bottleneck that you will be hard pressed to resolve. You might consider starting one thread that does nothing but read all the data in to memory and kick off search threads as the data loads up.
Since files are sequential beasts searching from start to end is something that you may not get around however there are a couple of things you could do.
if the data is static you could generate a smaller lookup file (alt. with offsets into the main file), this works good if the same string is repeated multiple times making the index file much smaller. if the file is dynamic you maybe need to regenerate the index file occassionally (offline)
instead of reading line by line, read larger chunks from the file like several MB to speed up I/O.
If you'd like to do use a library you could use xapian.
You may also want to try tokenizing your text before doing the search and I'd also suggest you to try regex too but it will take a lot if you don't have an index on that text so I'd definitely suggest you to try xapian or some search engine.
If your big text file does not change often then create a database (for example SQLite) with a table:
create table word_line_numbers
(word varchar(100), line_number integer);
Read your file and insert a record in database for every word with something like this:
insert into word_line_numbers(word, line_number) values ('foo', 13452);
insert into word_line_numbers(word, line_number) values ('foo', 13421);
insert into word_line_numbers(word, line_number) values ('bar', 1421);
Create an index of words:
create index wird_line_numbers_idx on word_line_numbers(word);
And then you can find line numbers for words fast using this index:
select line_number from word_line_numbers where word='foo';
For added speed (because of smaller database size) and complexity you can use 2 tables: words(word_id integer primary key, word not null) and word_lines(word_id integer not null references words, line_number integer not null).
I'd try first loading as much of the file into the RAM as possible (memory mapping of the file is a good option) and then search concurrently in parts of it on multiple processors. You'll need to take special care near the buffer boundaries to make sure you aren't missing any words. Also, you may want to try something more efficient than the typical strstr(), see these:
Boyer–Moore string search algorithm
Knuth–Morris–Pratt algorithm