Reading text into string vs. directly initialize string with text - c++

I want to store in an array of strings many lines of text(the text is always the same). I can think of 2 ways to so:
One way:
string s[100]={"first_line","second_line",...,"100th_line"};
The other way would be
string s[100];
fstream fin("text.txt");
for (int i = 0; i < 100; i++)
fin.getline(s[i]);
text.txt:
first_line
second_line
...
100th_line
The actual number of lines will be around 500 and the length of each line will be 50-60 characters long.
So my question is: which way is faster/better?
L.E.: How can I put the text from the first method in another file and still be able to use the string s in my source.cpp? I want to do so because I don't want my source.cpp to get messy from all those lines of initialization.

Here some latency number every programmer should know:
memory read from cache: 0.5-7 nanoseconds
memory read from main memory: 100 nanoseconds
SSD disk access: 150 000 nanoseconds (reach location to read)
Hard disk access : 10 000 000 nanoseconds (reach location to read)
So what's the fastest for you ?
The first version will always be faster: the text is loaded together with your executable (no access overhead), and the string objects are constructed in memory (see assembly code online).
The second version will require several disk accesses (at least to open current directory, and to access the file), a couple of operating system actions (e.g. access control), not to forget the buffering of input in memory. Only then would the string objects be created in memory as in first version.
Fortunately, user don't notice nanoseconds and will probably not realize the difference: the human eye requires 13 ms to identify an image and the reaction time from eye to mouse is around 215 ms (215 000 000 nanoseconds)
So, my advice: no premature optimization. Focus on functionality (easy customization of content) and maintenability (e.g. easy localization if software used in several languages) before going too deep on performance.

In the grand scheme of things, with only 500 relatively short strings, which approach is better is mostly an academic question, with little practical difference.
But, if one wants to be picky, reading it from a file requires a little bit more work at runtime than just immediately initializing the string array. Also, you have to prepare for the possibility that the initialization file is missing, and handle that possibility, in some way.
Compiling-in the initial string values as part of the code avoids the need to do some error handling, and saves a little bit of time. The biggest win will be lack of a need to handle the possibility that the initialization file is missing. There's a direct relationship between the likelyhood that something might go wrong, and the actual number of things that could potentially go wrong.

I'd go with the first one, since its constructing the string directly inside the array, which is practically emplace (Or perhaps its moving them, if so I might be wrong), without any more operations, so its probably much better than reading from hard-disk and then doing the same procedure as the first method.

If the data does not change then hard code it into a source file.
If you ever need to change the data, for testing or maintenance, the data should be placed into a file.
Don't worry about execution speed until somebody complains. Concentrate your efforts on robust and easily readable code. Most of the time spent with applications is in maintenance. If you make your code easy to maintain, you will spend less time maintaining it.

Related

Why is reading big text file in parallel bad?

I have a big txt file with ~30 millions rows, each row is seperated by a line seperator \n. And I'd like to read all lines to an unordered list (e.g. std::list<std::string>).
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
The current implementation is very slow, so I'm learning how to read data by chunk.
But after seeing this comment:
parallelizing on a HDD will make things worse, with the impact depending on the distribution of the files on the HDD. On a SSD it might (!) improve things.
Is it bad to read a file in parallel? What's the algorithm to read all lines of a file to an unordered container (e.g. std::list, normal array,...) as fast as possible, without using any libraries, and the code must be cross-platform?
Is it bad to read a file in parallel? What's the algorithm to read all
lines of a file to an unordered container (e.g. std::list, normal
array,...) as fast as possible, without using any libraries, and the
code must be cross-platform?
I guess I'll attempt to answer this one to avoid spamming the comments. I have, in multiple scenarios, sped up text file parsing substantially using multithreading. However, the keyword here is parsing, not disk I/O (though just about any text file read involves some level of parsing). Now first things first:
VTune here was telling me that my top hotspots were in parsing (sorry, this image was taken years ago and I didn't expand the call graph to show what inside obj_load was taking most of the time, but it was sscanf). This profiling session actually surprised me quite a bit. In spite of having been profiling for decades to the point where my hunches aren't too inaccurate (not accurate enough to avoid profiling, mind you, not even close, but I've tuned my sort of intuitive spider senses enough to where profiling sessions usually don't surprise me that much even without any glaring algorithmic inefficiencies -- though I might still be off about exactly why they exist since I'm not so good at assembly).
Yet this time I was really taken back and shocked so this example has always been one I used to show even the most skeptical colleagues who don't want to use profilers to show why profiling is so important. Some of them are actually good at guessing where hotspots exists and some were actually creating very competent-performing solutions in spite of never having used them, but none of them were good at guessing what isn't a hotspot, and none of them could draw a call graph based on their hunches. So I always liked to use this example to try to convert the skeptics and get them to spend a day just trying out VTune (and we had a boatload of free licenses from Intel who worked with us which were largely going to waste on our team which I thought was a tragedy since VTune is a really expensive piece of software).
And the reason I was taken back this time was not because I was surprised by the sscanf hotspot. That's kind of a no-brainer that non-trivial parsing of epic text files is going to generally be bottlenecked by string parsing. I could have guessed that. My colleagues who never touched a profiler could have guessed that. What I couldn't have guessed was how much of a bottleneck it was. I thought given the fact that I was loading millions of polygons and vertices, texture coordinates, normals, creating edges and finding adjacency data, using index FOR compression, associating materials from the MTL file to the polygons, reverse engineering object normals stored in the OBJ file and consolidating them to form edge creasing, etc. I would at least have a good chunk of the time distributed in the mesh system as well (I would have guessed 25-33% of the time spent in the mesh engine).
Turned out the mesh system took barely any time to my most pleasant surprise, and there my hunches were completely off about it specifically. It was, by far, parsing that was the uber bottleneck (not disk I/O, not the mesh engine).
So that's when I applied this optimization to multithread the parsing, and there it helped a lot. I even initially started off with a very modest multithreaded implementation which barely did any parsing except scanning the character buffers for line endings in each thread just to end up parsing in the loading thread, and that already helped by a decent amount (reduced the operation from 16 seconds to about 14 IIRC, and I eventually got it down to ~8 seconds and that was on an i3 with just two cores and hyperthreading). So anyway, yeah, you can probably make things faster with multithreaded parsing of character buffers you read in from text files in a single thread. I wouldn't use threads as a way to make disk I/O any faster.
I'm reading the characters from the file in binary into big char buffers in a single thread, then, using a parallel loop, have the threads figure out integer ranges for the lines in that buffer.
// Stores all the characters read in from the file in big chunks.
// This is shared for read-only access across threads.
vector<char> buffer;
// Local to a thread:
// Stores the starting position of each line.
vector<size_t> line_start;
// Stores the assigned buffer range for the thread:
size_t buffer_start, buffer_end;
Basically like so:
LINE1 and LINE2 are considered to belong to THREAD 1, while LINE3 is considered to belong to THREAD 2. LINE6 is not considered to belong to any thread since it doesn't have an EOL. Instead the characters of LINE6 will be combined with the next chunky buffer read from the file.
Each thread begins by looking at the first character in its assigned character buffer range. Then it works backwards until it finds an EOL or reaches the beginning of the buffer. After that it works forward and parses each line, looking for EOLs and doing whatever else we want, until it reaches the end of its assigned character buffer range. The last "incomplete line" is not processed by the thread, but instead the next thread (or if the thread is the last thread, then it is processed on the next big chunky buffer read by the first thread). The diagram is teeny (couldn't fit much) but I read in the character buffers from the file in the loading thread in big chunks (megabytes) before the threads parse them in parallel loops, and each thread might then parse thousands of lines from its designated buffer range.
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
Kind of echoing Veedrac's comments, storing your lines in std::list<std::string> if you want to really load an epic number of lines quickly is not a good idea. That would actually be a bigger priority to address than multithreading. I'd turn that into just std::vector<char> all_lines storing all the strings, and you can use std::vector<size_t> line_start to store the starting line position of an nth line, which you can retrieve like so:
// note that 'line' will be EOL-terminated rather than null-terminated
// if it points to the original buffer.
const char* line = all_lines.data() + line_start[n];
The immediate problem with std::list without a custom allocator is a heap allocation per node. On top of that we're wasting memory storing two extra pointers per line. std::string is problematic here because SBO optimizations to avoid heap allocation would make it take too much memory for small strings (and thereby increase cache misses) or still end up invoking heap allocations for every non-small string. So you end up avoiding all these problems just storing everything in one giant char buffer, like in std::vector<char>. I/O streams, including stringstreams and functions like getline, are also horrible for performance, just awful, in ways that really disappointed me at first since my first OBJ loader used those and it was over 20 times slower than the second version where I ported all those I/O stream operators and functions and use of std::string to make use of C functions and my own hand-rolled stuff operating on char buffers. When it comes to parsing in performance-critical contexts, C functions like sscanf and memchr and plain old character buffers tend to be so much faster than the C++ ways of doing it, but you can at least still use std::vector<char> to store huge buffers, e.g., to avoid dealing with malloc/free and get some debug-build sanity checks when accessing the character buffer stored inside.

Reading large (~1GB) data file with C++ sometimes throws bad_alloc, even if I have more than 10GB of RAM available

I'm trying to read the data contained in a .dat file with size ~1.1GB.
Because I'm doing this on a 16GB RAM machine, I though it would have not be a problem to read the whole file into memory at once, to only after process it.
To do this, I employed the slurp function found in this SO answer.
The problem is that the code sometimes, but not always, throws a bad_alloc exception.
Looking at the task manager I see that there are always at least 10GB of free memory available, so I don't see how memory would be an issue.
Here is the code that reproduces this error
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
using namespace std;
int main()
{
ifstream file;
file.open("big_file.dat");
if(!file.is_open())
cerr << "The file was not found\n";
stringstream sstr;
sstr << file.rdbuf();
string text = sstr.str();
cout << "Successfully read file!\n";
return 0;
}
What could be causing this problem?
And what are the best practices to avoid it?
The fact that your system has 16GB doesn't mean any program at any time can allocate a given amount of memory. In fact, this might work on a machine that has only 512MB of physical RAM, if enought swap is available, or it might fail on a HPC node with 128GB of RAM – it's totally up to your Operating System to decide how much memory is available to you, here.
I'd also argue that std::string is never the data type of choice if actually dealing with a file, possibly binary, that large.
The point here is that there is absolutely no knowing how much memory stringstream tries to allocate. A pretty reasonable algorithm would double the amount of memory allocated every time the allocated internal buffer becomes too small to contain the incoming bytes. Also, libc++/libc will probably also have their own allocators that will have some allocation overhead, here.
Note that stringstream::str() returns a copy of the data contained in the stringstream's internal state, again leaving you with at least 2.2 GB of heap used up for this task.
Really, if you need to deal with data from a large binary file as something that you can access with the index operator [], look into memory mapping your file; that way, you get a pointer to the beginning of the file, and might work with it as if it was a plain array in memory, letting your OS take care of handling the underlying memory/buffer management. It's what OSes are for!
If you didn't know Boost before, it's kind of "the extended standard library for C++" by now, and of course, it has a class abstracting memory mapping a file: mapped_file.
The file I'm reading contains a series of data in ASCII tabular form, i.e. float1,float2\nfloat3,float4\n....
I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances?
Depends; I actually think the fastest way of dealing with this (since file IO is much, much slower than in-memory parsing of ASCII) is to parse the file incrementally, directly into an in-memory array of float variables; possibly taking advantage of your OS'es pre-fetching SMP capabilities in that you don't even get that much of a speed advantage if you'd spawn separate threads for file reading and float conversion. std::copy, used to read from std::ifstream to a std::vector<float> should work fine, here.
I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations?
Yes, and no: First, of course, you will have more context switches then you'd have if you just ordered for the whole to be read at once. But those aren't that expensive -- at least, they're going to be much less expensive when you realize that most OSes and libc's know quite well how to optimize reads, and thus will fetch a whole lot of file at once if you don't use extremely randomized read lengths. Also, you don't infer the penalty of trying to allocate a block of RAM at least 1.1GB in size -- that calls for some serious page table lookups, which aren't that fast, either.
Now, the idea is that your occasional context switch and the fact that, if you're staying single-threaded, there will be times when you don't read the file because you're still busy converting text to float will still mean less of a performance hit, because most of the time, your read will pretty much immediately return, as your OS/runtime has already prefetched a significant part of your file.
Generally, to me, you seem to be worried about all the wrong kinds of things: Performance seems to be important to you (is it really that important, here? You're using a brain-dead file format for interchanging floats, which is both bloaty, loses information, and on top of that is slow to parse), but you'd rather first read the whole file in at once and then start converting it to numbers. Frankly, if performance was of any criticality to your application, you would start to multi-thread/-process it, so that string parsing could already happen while data is still being read. Using buffers of a few kilo- to Megabytes to be read up to \n boundaries and exchanged with a thread that creates the in-memory table of floats sounds like it would basically reduce your read+parse time down to read+non-measurable without sacrificing read performance, and without the need for Gigabytes of RAM just to parse a sequential file.
By the way, to give you an impression of how bad storing floats in ASCII is:
The typical 32bit single-precision IEEE753 floating point number has about 6-9 significant decimal digits. Hence, you will need at least 6 characters to represent these in ASCII, one ., typically one exponential divider, e.g. E, and on average 2.5 digits of decimal exponent, plus on average half a sign character (- or not), if your numbers are uniformly chosen from all possible IEEE754 32bit floats:
-1.23456E-10
That's an average of 11 characters.
Add one , or \n after every number.
Now, your character is 1B, meaning that you blow up your 4B of actual data by a factor of 3, still losing precision.
Now, people always come around telling me that plaintext is more usable, because if in doubt, the user can read it… I've yet to see one user that can skim through 1.1GB (according to my calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) and not go insane.
In a 32 bit executable, total memory address space is 4gb. Of that, sometimes 1-2 gb is reserved for system use.
To allocate 1 GB, you need 1 GB of contiguous space. To copy it, you need 2 1 GB blocks. This can easily fail, unpredictably.
There are two approaches. First, switch to a 64 bit executable. This will not run on a 32 bit system.
Second, stop allocating 1 GB contiguous blocks. Once you start dealing with that much data, segmenting it and or streaming it starts making a lot of sense. Done right you'll also be able to start to process it prior to finishing reading it.
There are many file io datastructures, from stxxl to boost, or you can roll your own.
The size of the heap (a pool of memory used for dynamic allocations) is limited independently on the amount of RAM your machine has. You should use some other memory allocation technique for such large allocations which will probably force you to change the way you read from the file.
If you are running on UNIX based system you can check the function vmalloc or the VirtualAlloc function if you are running on Windows platform.

Improving/optimizing file write speed in C++

I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.
So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.
Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.

Memory Issues in C++

I am having run-time memory allocation errors with a C++ application. I have eliminated memory leaks, invalid pointer references and out-of-bounds vector assignments as the source of the issue - I am pretty sure it's something to do with memory fragmentation and I am hoping to get help with how to further diagnose and correct the problem.
My code is too large to post (about 15,000 lines - I know not huge but clearly too large to put online), so I am going to describe things with a few relevant snippets of code.
Basically, my program takes a bunch of string and numerical data sets as inputs (class objects with vector variables of type double, string, int and bool), performs a series of calculations, and then spits out the resulting numbers. I have tested and re-tested the calculations and outputs - everything is calculating as it should, and on smaller datasets things run perfectly.
However, when I scale things up, I start getting memory allocation errors, but I don't think I am even close to approaching the memory limits of my system - please see the two graphs below...my program cycles through a series of scenarios (performing identical calculations under a different set of parameters for each scenario) - in the first graph, I run 7 scenarios on a dataset of about 200 entries. As the graph shows, each "cycle" results in memory swinging up and back down to its baseline, and the overall memory usage is tiny (see the seven small blips on the right half of the bottom graph). On the second graph, I am now running a dataset of about 10,000 entries (see notes on dataset below). In this case, I only get through 2 full cycles before getting my error (as it is trying to resize a class object for the third scenario). You can see the first two scenarios in the bottom right-half graph; a lot more memory usage than before, but still only a small fraction of available memory. And as with the smaller dataset, usage increases while my scenario runs, and then decreases back to it's initial level before reaching the next scenario.
This pattern, along with other tests I have done, lead me to believe it's some sort of fragmentation problem. The error always occurs when I am attempting to resize a vector, although the particular resize operation that causes the error varies based on the dataset size. Can anyone help me understand what's going on here and how I might fix it? I can describe things in much greater detail but already felt like my post was getting long...please ask questions if you need to and I will respond/edit promptly.
Clarification on the data set
The numbers 200 and 10,000 represent the number of unique records I am analyzing. Each record contains somewhere between 75 and 200 elements / variables, many of which are then being manipulated. Further, each variable is being manipulated over time and across multiple iterations (both dimensions variable). As a result, for an average "record" (the 200 to 10,000 referenced above), there could be easily as many as 200,000 values associated with it - a sample calculation:
1 Record * 75 Variables * 150 periods * 20 iterations = 225,000 unique values per record.
Offending Code (in this specific instance):
vector<LoanOverrides> LO;
LO.resize(NumOverrides + 1); // Error is occuring here. I am certain that NumOverrides is a valid numerical entry = 2985
// Sample class definition
class LoanOverrides {
public:
string IntexDealName;
string LoanID;
string UniqueID;
string PrepayRate;
string PrepayUnits;
double DefaultRate;
string DefaultUnits;
double SeverityRate;
string SeverityUnits;
double DefaultAdvP;
double DefaultAdvI;
double RecoveryLag;
string RateModRate;
string RateModUnits;
string BalanceForgivenessRate;
string BalanceForgivenessRateUnits;
string ForbearanceRate;
string ForbearanceRateUnits;
double ForbearanceRecoveryRate;
string ForbearanceRecoveryUnits;
double BalloonExtension;
double ExtendPctOfPrincipal;
double CouponStepUp;
};
You have a 64-bit operating system capable of allocating large quantities of memory, but have built your application as a 32-bit application, which can only allocate a maximum of about 3GB of memory. You are trying to allocate more than that.
Try compiling as a 64-bit application. This may enable you to reach your goals. You may have to increase your pagefile size.
See if you can dispose of intermediate results earlier than you are currently doing so.
Try calculating how much memory is being used/would be used by your algorithm, and try reworking your algorithm to use less.
Try avoiding duplicating data by reworking your algorithm. I see you have a lot of reference data, which by the looks of it isn't going to change during the application run. You could put all of that into a single vector, which you allocate once, then refer to them via integer indexes everywhere else, rather than copying them. (Just guessing that you are copying them).
Try avoiding loading all the data at once by reworking your algorithm to work on batches.
Without knowing more about your application, it is impossible to offer better advice. But basically you are running out of memory because you are allocating a huge, huge amount of it, and based on your application and the snippets you have posted I think you can probably avoid doing so with a little thought. Good luck.

Search for multiple words in very large text files (10 GB) using C++ the fastest way

I have this program where I have to search for specific values and its line number in very large text file and there might be multiple occurences for the same value.
I've tried a simple C++ programs which reads the text files line by line and searches for a the value using strstr but it's taking a very longgggggggggggggg time
I also tried to use a system command using grep but still it's taking a lot of time, not as long as before but it's still too much time.
I was searching for a library I can use to fasten the search.
Any help and suggestions? Thank you :)
There are two issues concerning the spead: the time it takes to actually
read the data, and the time it takes to search.
Generally speaking, the fastest way to read a file is to mmap it (or
the equivalent under Windows). This can get complicated if the entire
file won't fit into the address space, but you mention 10GB in the
header; if searching is all you do in the program, this shouldn't create
any problems.
More generally, if speed is a problem, avoid using getline on a
string. Reading large blocks, and picking the lines up (as char[])
out of them, without copying, is significantly faster. (As a simple
compromize, you may want to copy when a line crosses a block boundary.
If you're dealing with blocks of a MB or more, this shouldn't be too
often; I've used this technique on older, 16 bit machines, with blocks
of 32KB, and still gotten a significant performance improvement.)
With regards to searching, if you're searching for a single, fixed
string (not a regular expression or other pattern matching), you might
want to try a BM search. If the string you're searching for is
reasonably long, this can make a significant difference over other
search algorithms. (I think that some implementations of grep will
use this if the search pattern is in fact a fixed string, and is
sufficiently long for it to make a difference.)
Use multiple threads. Each thread can be responsible for searching through a portion of the file. For example on a 4 core machine spawn 12 threads. The first thread looks through the first 8%evening of the file, the second thread the second 8% of the file, etc. You will want to tune the number of threads per core to keep the cpu max utilized. Since this is an I/O bound operation you may never reach 100% cpu utilization.
Feeding data to the threads will be a bottleneck using this design. Memory mapping the file might help somewhat but at the end of the day the disk can only read one sector at a time. This will be a bottleneck that you will be hard pressed to resolve. You might consider starting one thread that does nothing but read all the data in to memory and kick off search threads as the data loads up.
Since files are sequential beasts searching from start to end is something that you may not get around however there are a couple of things you could do.
if the data is static you could generate a smaller lookup file (alt. with offsets into the main file), this works good if the same string is repeated multiple times making the index file much smaller. if the file is dynamic you maybe need to regenerate the index file occassionally (offline)
instead of reading line by line, read larger chunks from the file like several MB to speed up I/O.
If you'd like to do use a library you could use xapian.
You may also want to try tokenizing your text before doing the search and I'd also suggest you to try regex too but it will take a lot if you don't have an index on that text so I'd definitely suggest you to try xapian or some search engine.
If your big text file does not change often then create a database (for example SQLite) with a table:
create table word_line_numbers
(word varchar(100), line_number integer);
Read your file and insert a record in database for every word with something like this:
insert into word_line_numbers(word, line_number) values ('foo', 13452);
insert into word_line_numbers(word, line_number) values ('foo', 13421);
insert into word_line_numbers(word, line_number) values ('bar', 1421);
Create an index of words:
create index wird_line_numbers_idx on word_line_numbers(word);
And then you can find line numbers for words fast using this index:
select line_number from word_line_numbers where word='foo';
For added speed (because of smaller database size) and complexity you can use 2 tables: words(word_id integer primary key, word not null) and word_lines(word_id integer not null references words, line_number integer not null).
I'd try first loading as much of the file into the RAM as possible (memory mapping of the file is a good option) and then search concurrently in parts of it on multiple processors. You'll need to take special care near the buffer boundaries to make sure you aren't missing any words. Also, you may want to try something more efficient than the typical strstr(), see these:
Boyer–Moore string search algorithm
Knuth–Morris–Pratt algorithm