Why is reading big text file in parallel bad?

Why is reading big text file in parallel bad? - c++

I have a big txt file with ~30 millions rows, each row is seperated by a line seperator \n. And I'd like to read all lines to an unordered list (e.g. std::list<std::string>).
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
The current implementation is very slow, so I'm learning how to read data by chunk.
But after seeing this comment:
parallelizing on a HDD will make things worse, with the impact depending on the distribution of the files on the HDD. On a SSD it might (!) improve things.
Is it bad to read a file in parallel? What's the algorithm to read all lines of a file to an unordered container (e.g. std::list, normal array,...) as fast as possible, without using any libraries, and the code must be cross-platform?

Is it bad to read a file in parallel? What's the algorithm to read all
lines of a file to an unordered container (e.g. std::list, normal
array,...) as fast as possible, without using any libraries, and the
code must be cross-platform?
I guess I'll attempt to answer this one to avoid spamming the comments. I have, in multiple scenarios, sped up text file parsing substantially using multithreading. However, the keyword here is parsing, not disk I/O (though just about any text file read involves some level of parsing). Now first things first:
VTune here was telling me that my top hotspots were in parsing (sorry, this image was taken years ago and I didn't expand the call graph to show what inside obj_load was taking most of the time, but it was sscanf). This profiling session actually surprised me quite a bit. In spite of having been profiling for decades to the point where my hunches aren't too inaccurate (not accurate enough to avoid profiling, mind you, not even close, but I've tuned my sort of intuitive spider senses enough to where profiling sessions usually don't surprise me that much even without any glaring algorithmic inefficiencies -- though I might still be off about exactly why they exist since I'm not so good at assembly).
Yet this time I was really taken back and shocked so this example has always been one I used to show even the most skeptical colleagues who don't want to use profilers to show why profiling is so important. Some of them are actually good at guessing where hotspots exists and some were actually creating very competent-performing solutions in spite of never having used them, but none of them were good at guessing what isn't a hotspot, and none of them could draw a call graph based on their hunches. So I always liked to use this example to try to convert the skeptics and get them to spend a day just trying out VTune (and we had a boatload of free licenses from Intel who worked with us which were largely going to waste on our team which I thought was a tragedy since VTune is a really expensive piece of software).
And the reason I was taken back this time was not because I was surprised by the sscanf hotspot. That's kind of a no-brainer that non-trivial parsing of epic text files is going to generally be bottlenecked by string parsing. I could have guessed that. My colleagues who never touched a profiler could have guessed that. What I couldn't have guessed was how much of a bottleneck it was. I thought given the fact that I was loading millions of polygons and vertices, texture coordinates, normals, creating edges and finding adjacency data, using index FOR compression, associating materials from the MTL file to the polygons, reverse engineering object normals stored in the OBJ file and consolidating them to form edge creasing, etc. I would at least have a good chunk of the time distributed in the mesh system as well (I would have guessed 25-33% of the time spent in the mesh engine).
Turned out the mesh system took barely any time to my most pleasant surprise, and there my hunches were completely off about it specifically. It was, by far, parsing that was the uber bottleneck (not disk I/O, not the mesh engine).
So that's when I applied this optimization to multithread the parsing, and there it helped a lot. I even initially started off with a very modest multithreaded implementation which barely did any parsing except scanning the character buffers for line endings in each thread just to end up parsing in the loading thread, and that already helped by a decent amount (reduced the operation from 16 seconds to about 14 IIRC, and I eventually got it down to ~8 seconds and that was on an i3 with just two cores and hyperthreading). So anyway, yeah, you can probably make things faster with multithreaded parsing of character buffers you read in from text files in a single thread. I wouldn't use threads as a way to make disk I/O any faster.
I'm reading the characters from the file in binary into big char buffers in a single thread, then, using a parallel loop, have the threads figure out integer ranges for the lines in that buffer.
// Stores all the characters read in from the file in big chunks.
// This is shared for read-only access across threads.
vector<char> buffer;
// Local to a thread:
// Stores the starting position of each line.
vector<size_t> line_start;
// Stores the assigned buffer range for the thread:
size_t buffer_start, buffer_end;
Basically like so:
LINE1 and LINE2 are considered to belong to THREAD 1, while LINE3 is considered to belong to THREAD 2. LINE6 is not considered to belong to any thread since it doesn't have an EOL. Instead the characters of LINE6 will be combined with the next chunky buffer read from the file.
Each thread begins by looking at the first character in its assigned character buffer range. Then it works backwards until it finds an EOL or reaches the beginning of the buffer. After that it works forward and parses each line, looking for EOLs and doing whatever else we want, until it reaches the end of its assigned character buffer range. The last "incomplete line" is not processed by the thread, but instead the next thread (or if the thread is the last thread, then it is processed on the next big chunky buffer read by the first thread). The diagram is teeny (couldn't fit much) but I read in the character buffers from the file in the loading thread in big chunks (megabytes) before the threads parse them in parallel loops, and each thread might then parse thousands of lines from its designated buffer range.
std::list<std::string> list;
std::ifstream file(path);
while(file.good())
{
std::string tmp;
std::getline(file, tmp);
list.emplace_back(tmp);
}
process_data(list);
Kind of echoing Veedrac's comments, storing your lines in std::list<std::string> if you want to really load an epic number of lines quickly is not a good idea. That would actually be a bigger priority to address than multithreading. I'd turn that into just std::vector<char> all_lines storing all the strings, and you can use std::vector<size_t> line_start to store the starting line position of an nth line, which you can retrieve like so:
// note that 'line' will be EOL-terminated rather than null-terminated
// if it points to the original buffer.
const char* line = all_lines.data() + line_start[n];
The immediate problem with std::list without a custom allocator is a heap allocation per node. On top of that we're wasting memory storing two extra pointers per line. std::string is problematic here because SBO optimizations to avoid heap allocation would make it take too much memory for small strings (and thereby increase cache misses) or still end up invoking heap allocations for every non-small string. So you end up avoiding all these problems just storing everything in one giant char buffer, like in std::vector<char>. I/O streams, including stringstreams and functions like getline, are also horrible for performance, just awful, in ways that really disappointed me at first since my first OBJ loader used those and it was over 20 times slower than the second version where I ported all those I/O stream operators and functions and use of std::string to make use of C functions and my own hand-rolled stuff operating on char buffers. When it comes to parsing in performance-critical contexts, C functions like sscanf and memchr and plain old character buffers tend to be so much faster than the C++ ways of doing it, but you can at least still use std::vector<char> to store huge buffers, e.g., to avoid dealing with malloc/free and get some debug-build sanity checks when accessing the character buffer stored inside.

Related

Reading text into string vs. directly initialize string with text

I want to store in an array of strings many lines of text(the text is always the same). I can think of 2 ways to so:
One way:
string s[100]={"first_line","second_line",...,"100th_line"};
The other way would be
string s[100];
fstream fin("text.txt");
for (int i = 0; i < 100; i++)
fin.getline(s[i]);
text.txt:
first_line
second_line
...
100th_line
The actual number of lines will be around 500 and the length of each line will be 50-60 characters long.
So my question is: which way is faster/better?
L.E.: How can I put the text from the first method in another file and still be able to use the string s in my source.cpp? I want to do so because I don't want my source.cpp to get messy from all those lines of initialization.

Here some latency number every programmer should know:
memory read from cache: 0.5-7 nanoseconds
memory read from main memory: 100 nanoseconds
SSD disk access: 150 000 nanoseconds (reach location to read)
Hard disk access : 10 000 000 nanoseconds (reach location to read)
So what's the fastest for you ?
The first version will always be faster: the text is loaded together with your executable (no access overhead), and the string objects are constructed in memory (see assembly code online).
The second version will require several disk accesses (at least to open current directory, and to access the file), a couple of operating system actions (e.g. access control), not to forget the buffering of input in memory. Only then would the string objects be created in memory as in first version.
Fortunately, user don't notice nanoseconds and will probably not realize the difference: the human eye requires 13 ms to identify an image and the reaction time from eye to mouse is around 215 ms (215 000 000 nanoseconds)
So, my advice: no premature optimization. Focus on functionality (easy customization of content) and maintenability (e.g. easy localization if software used in several languages) before going too deep on performance.

In the grand scheme of things, with only 500 relatively short strings, which approach is better is mostly an academic question, with little practical difference.
But, if one wants to be picky, reading it from a file requires a little bit more work at runtime than just immediately initializing the string array. Also, you have to prepare for the possibility that the initialization file is missing, and handle that possibility, in some way.
Compiling-in the initial string values as part of the code avoids the need to do some error handling, and saves a little bit of time. The biggest win will be lack of a need to handle the possibility that the initialization file is missing. There's a direct relationship between the likelyhood that something might go wrong, and the actual number of things that could potentially go wrong.

I'd go with the first one, since its constructing the string directly inside the array, which is practically emplace (Or perhaps its moving them, if so I might be wrong), without any more operations, so its probably much better than reading from hard-disk and then doing the same procedure as the first method.

If the data does not change then hard code it into a source file.
If you ever need to change the data, for testing or maintenance, the data should be placed into a file.
Don't worry about execution speed until somebody complains. Concentrate your efforts on robust and easily readable code. Most of the time spent with applications is in maintenance. If you make your code easy to maintain, you will spend less time maintaining it.

Reading large (~1GB) data file with C++ sometimes throws bad_alloc, even if I have more than 10GB of RAM available

I'm trying to read the data contained in a .dat file with size ~1.1GB.
Because I'm doing this on a 16GB RAM machine, I though it would have not be a problem to read the whole file into memory at once, to only after process it.
To do this, I employed the slurp function found in this SO answer.
The problem is that the code sometimes, but not always, throws a bad_alloc exception.
Looking at the task manager I see that there are always at least 10GB of free memory available, so I don't see how memory would be an issue.
Here is the code that reproduces this error
#include <iostream>
#include <fstream>
#include <sstream>
#include <string>
using namespace std;
int main()
{
ifstream file;
file.open("big_file.dat");
if(!file.is_open())
cerr << "The file was not found\n";
stringstream sstr;
sstr << file.rdbuf();
string text = sstr.str();
cout << "Successfully read file!\n";
return 0;
}
What could be causing this problem?
And what are the best practices to avoid it?

The fact that your system has 16GB doesn't mean any program at any time can allocate a given amount of memory. In fact, this might work on a machine that has only 512MB of physical RAM, if enought swap is available, or it might fail on a HPC node with 128GB of RAM – it's totally up to your Operating System to decide how much memory is available to you, here.
I'd also argue that std::string is never the data type of choice if actually dealing with a file, possibly binary, that large.
The point here is that there is absolutely no knowing how much memory stringstream tries to allocate. A pretty reasonable algorithm would double the amount of memory allocated every time the allocated internal buffer becomes too small to contain the incoming bytes. Also, libc++/libc will probably also have their own allocators that will have some allocation overhead, here.
Note that stringstream::str() returns a copy of the data contained in the stringstream's internal state, again leaving you with at least 2.2 GB of heap used up for this task.
Really, if you need to deal with data from a large binary file as something that you can access with the index operator [], look into memory mapping your file; that way, you get a pointer to the beginning of the file, and might work with it as if it was a plain array in memory, letting your OS take care of handling the underlying memory/buffer management. It's what OSes are for!
If you didn't know Boost before, it's kind of "the extended standard library for C++" by now, and of course, it has a class abstracting memory mapping a file: mapped_file.
The file I'm reading contains a series of data in ASCII tabular form, i.e. float1,float2\nfloat3,float4\n....
I'm browsing through the various possible solutions proposed on SO to deal with this kind of problem, but I was left wondering on this (to me) peculiar behaviour. What would you recommend in these kinds of circumstances?
Depends; I actually think the fastest way of dealing with this (since file IO is much, much slower than in-memory parsing of ASCII) is to parse the file incrementally, directly into an in-memory array of float variables; possibly taking advantage of your OS'es pre-fetching SMP capabilities in that you don't even get that much of a speed advantage if you'd spawn separate threads for file reading and float conversion. std::copy, used to read from std::ifstream to a std::vector<float> should work fine, here.
I'm still not getting something: you say that file IO is much slower than in-memory parsing, and this I understand (and is the reason why I wanted to read the whole file at once). Then you say that the best way is to parse the whole file incrementally into an in-memory array of float. What exactly do you mean by this? Doesn't this mean to read the file line-by-line, resulting in a large number of file IO operations?
Yes, and no: First, of course, you will have more context switches then you'd have if you just ordered for the whole to be read at once. But those aren't that expensive -- at least, they're going to be much less expensive when you realize that most OSes and libc's know quite well how to optimize reads, and thus will fetch a whole lot of file at once if you don't use extremely randomized read lengths. Also, you don't infer the penalty of trying to allocate a block of RAM at least 1.1GB in size -- that calls for some serious page table lookups, which aren't that fast, either.
Now, the idea is that your occasional context switch and the fact that, if you're staying single-threaded, there will be times when you don't read the file because you're still busy converting text to float will still mean less of a performance hit, because most of the time, your read will pretty much immediately return, as your OS/runtime has already prefetched a significant part of your file.
Generally, to me, you seem to be worried about all the wrong kinds of things: Performance seems to be important to you (is it really that important, here? You're using a brain-dead file format for interchanging floats, which is both bloaty, loses information, and on top of that is slow to parse), but you'd rather first read the whole file in at once and then start converting it to numbers. Frankly, if performance was of any criticality to your application, you would start to multi-thread/-process it, so that string parsing could already happen while data is still being read. Using buffers of a few kilo- to Megabytes to be read up to \n boundaries and exchanged with a thread that creates the in-memory table of floats sounds like it would basically reduce your read+parse time down to read+non-measurable without sacrificing read performance, and without the need for Gigabytes of RAM just to parse a sequential file.
By the way, to give you an impression of how bad storing floats in ASCII is:
The typical 32bit single-precision IEEE753 floating point number has about 6-9 significant decimal digits. Hence, you will need at least 6 characters to represent these in ASCII, one ., typically one exponential divider, e.g. E, and on average 2.5 digits of decimal exponent, plus on average half a sign character (- or not), if your numbers are uniformly chosen from all possible IEEE754 32bit floats:
-1.23456E-10
That's an average of 11 characters.
Add one , or \n after every number.
Now, your character is 1B, meaning that you blow up your 4B of actual data by a factor of 3, still losing precision.
Now, people always come around telling me that plaintext is more usable, because if in doubt, the user can read it… I've yet to see one user that can skim through 1.1GB (according to my calculations above, that's around 90 million floating point numbers, or 45 million floating point pairs) and not go insane.

In a 32 bit executable, total memory address space is 4gb. Of that, sometimes 1-2 gb is reserved for system use.
To allocate 1 GB, you need 1 GB of contiguous space. To copy it, you need 2 1 GB blocks. This can easily fail, unpredictably.
There are two approaches. First, switch to a 64 bit executable. This will not run on a 32 bit system.
Second, stop allocating 1 GB contiguous blocks. Once you start dealing with that much data, segmenting it and or streaming it starts making a lot of sense. Done right you'll also be able to start to process it prior to finishing reading it.
There are many file io datastructures, from stxxl to boost, or you can roll your own.

The size of the heap (a pool of memory used for dynamic allocations) is limited independently on the amount of RAM your machine has. You should use some other memory allocation technique for such large allocations which will probably force you to change the way you read from the file.
If you are running on UNIX based system you can check the function vmalloc or the VirtualAlloc function if you are running on Windows platform.

Improving/optimizing file write speed in C++

I've been running into some issues with writing to a file - namely, not being able to write fast enough.
To explain, my goal is to capture a stream of data coming in over gigabit Ethernet and simply save it to a file.
The raw data is coming in at a rate of 10MS/s, and it's then saved to a buffer and subsequently written to a file.
Below is the relevant section of code:
std::string path = "Stream/raw.dat";
ofstream outFile(path, ios::out | ios::app| ios::binary);
if(outFile.is_open())
cout << "Yes" << endl;
while(1)
{
rxSamples = rxStream->recv(&rxBuffer[0], rxBuffer.size(), metaData);
switch(metaData.error_code)
{
//Irrelevant error checking...
//Write data to a file
std::copy(begin(rxBuffer), end(rxBuffer), std::ostream_iterator<complex<float>>(outFile));
}
}
The issue I'm encountering is that it's taking too long to write the samples to a file. After a second or so, the device sending the samples reports its buffer has overflowed. After some quick profiling of the code, nearly all of the execution time is spent on std::copy(...) (99.96% of the time to be exact). If I remove this line, I can run the program for hours without encountering any overflow.
That said, I'm rather stumped as to how I can improve the write speed. I've looked through several posts on this site, and it seems like the most common suggestion (in regard to speed) is to implement file writes as I've already done - through the use of std::copy.
If it's helpful, I'm running this program on Ubuntu x86_64. Any suggestions would be appreciated.

So the main problem here is that you try to write in the same thread as you receive, which means that your recv() can only be called again after copy is complete. A few observations:
Move the writing to a different thread. This is about a USRP, so GNU Radio might really be the tool of your choice -- it's inherently multithreaded.
Your output iterator is probably not the most performant solution. Simply "write()" to a file descriptor might be better, but that's performance measurements that are up to you
If your hard drive/file system/OS/CPU aren't up to the rates coming in from the USRP, even if decoupling receiving from writing thread-wise, then there's nothing you can do -- get a faster system.
Try writing to a RAM disk instead
In fact, I don't know how you came up with the std::copy approach. The rx_samples_to_file example that comes with UHD does this with a simple write, and you should definitely favor that over copying; file I/O can, on good OSes, often be done with one copy less, and iterating over all elements is probably very slow.

Let's do a bit of math.
Your samples are (apparently) of type std::complex<std::float>. Given a (typical) 32-bit float, that means each sample is 64 bits. At 10 MS/s, that means the raw data is around 80 megabytes per second--that's within what you can expect to write to a desktop (7200 RPM) hard drive, but getting fairly close to the limit (which is typically around 100-100 megabytes per second or so).
Unfortunately, despite the std::ios::binary, you're actually writing the data in text format (because std::ostream_iterator basically does stream << data;).
This not only loses some precision, but increases the size of the data, at least as a rule. The exact amount of increase depends on the data--a small integer value can actually decrease the quantity of data, but for arbitrary input, a size increase close to 2:1 is fairly common. With a 2:1 increase, your outgoing data is now around 160 megabytes/second--which is faster than most hard drives can handle.
The obvious starting point for an improvement would be to write the data in binary format instead:
uint32_t nItems = std::end(rxBuffer)-std::begin(rxBuffer);
outFile.write((char *)&nItems, sizeof(nItems));
outFile.write((char *)&rxBuffer[0], sizeof(rxBuffer));
For the moment I've used sizeof(rxBuffer) on the assumption that it's a real array. If it's actually a pointer or vector, you'll have to compute the correct size (what you want is the total number of bytes to be written).
I'd also note that as it stands right now, your code has an even more serious problem: since it hasn't specified a separator between elements when it writes the data, the data will be written without anything to separate one item from the next. That means if you wrote two values of (for example) 1 and 0.2, what you'd read back in would not be 1 and 0.2, but a single value of 10.2. Adding separators to your text output will add yet more overhead (figure around 15% more data) to a process that's already failing because it generates too much data.
Writing in binary format means each float will consume precisely 4 bytes, so delimiters are not necessary to read the data back in correctly.
The next step after that would be to descend to a lower-level file I/O routine. Depending on the situation, this might or might not make much difference. On Windows, you can specify FILE_FLAG_NO_BUFFERING when you open a file with CreateFile. This means that reads and writes to that file will basically bypass the cache and go directly to the disk.
In your case, that's probably a win--at 10 MS/s, you're probably going to use up the cache space quite a while before you reread the same data. In such a case, letting the data go into the cache gains you virtually nothing, but costs you some data to copy data to the cache, then somewhat later copy it out to the disk. Worse, it's likely to pollute the cache with all this data, so it's no longer storing other data that's a lot more likely to benefit from caching.

Reading large txt efficiently in c++

I have to read a large text file (> 10 GB) in C++. This is a csv file with variable length lines. when I try to read line by line using ifstream it works but takes long time, i guess this is becuase each time I read a line it goes to disk and reads, which makes it very slow.
Is there a way to read in bufferes, for example read 250 MB at one shot (using read method of ifstream) and then get lines from this buffer, i see lot of issues with solution like buffer can have incomplete lines etc..
Is there a solution for this in c++ which handles all these cases etc. Are there any open source libraries that can do this for example boost etc ?
Note: I would want to avoid c stye FILE* pointers etc.

Try using the Windows memory mapped file function. The calls are buffered and you get to treat a file as if its just memory.
memory mapped files

IOstreams already use buffers much as you describe (though usually only a few kilobytes, not hundreds of megabytes). You can use pubsetbuf to get it to use a larger buffer, but I wouldn't expect any huge gains. Most of the overhead in IOstreams stems from other areas (like using virtual functions), not from lack of buffering.
If you're running this on Windows, you might be able to gain a little by writing your own stream buffer, and having it call CreateFile directly, passing (for example) FILE_FLAG_SEQUENTIAL_SCAN or FILE_FLAG_NO_BUFFERING. Under the circumstances, either of these may help your performance substantially.

If you want real speed, then you're going to have to stop reading lines into std::string, and start using char*s into the buffer. Whether you read that buffer using ifstream::read() or memory mapped files is less important, though read() has the disadvantage you note about potentially having N complete lines and an incomplete one in the buffer, and needing to recognise that (can easily do that by scanning the rest of the buffer for '\n' - perhaps by putting a NUL after the buffer and using strchr). You'll also need to copy the partial line to the start of the buffer, read the next chunk from file so it continues from that point, and change the maximum number of characters read such that it doesn't overflow the buffer. If you're nervous about FILE*, I hope you're comfortable with const char*....
As you're proposing this for performance reasons, I do hope you've profiled to make sure that it's not your CSV field extraction etc. that's the real bottleneck.

I hope this helps -
http://www.cppprog.com/boost_doc/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file
BTW, you wrote "i see lot of issues with solution like buffer can have incomplete lines etc.." - in this situation how about reading 250 MB and then read char by char until you get the delimiter to complete the line.

Processing huge text files

Problem:
I've a huge raw text file (assume of 3gig), I need to go through each word in the file
and find out that a word appears how many times in the file.
My Proposed Solution:
Split the huge file into multiple files and each splitted file will have words in a sorted manner. For example,
all the words starting with "a" will be stored in a "_a.dic" file. So, at any time we will not execeed more than 26 files.
The problem in this approach is,
I can use streams to read the file, but wanted to use threads to read certain parts of the file. For example, read 0-1024 bytes with a separate thread (atleast have 4-8 threads based on the no. of processors exist in the box). Is this is possible or am I dreaming?
Any better approach?
Note: It should be a pure c++ or c based solution. No databases etc., are allowed.

You need to look at 'The Practice of Programming' by Kernighan and Pike, and specifically chapter 3.
In C++, use a map based on the strings and a count (std::map<string,size_t>, IIRC). Read the file (once - it's too big to read more than once), splitting it into words as you go (for some definition of 'word'), and incrementing the count in the map entry for each word you find.
In C, you'll have to create the map yourself. (Or find David Hanson's "C Interfaces and Implementations".)
Or you can use Perl, or Python, or Awk (all of which have associative arrays, equivalent to a map).

I don't think using multiple threads that read parts of the file in parallel is going to help much. I would expect that this application is bound to the bandwidth and latency of your harddisk, not the actual word counting. Such a multi-threaded version might actually perform worse because "quasi-random" file access is typically slower than "linear file" access.
In case the CPU is really busy in a single-threaded version there might be a potential speed up. One thread could read the data in big chunks and put them into a queue of limited capacity. A bunch of other worker threads could operate each on their own chunk and count the words. After the counting worker threads finished you have to merge the word counters.

First - decide on the datastructure for saving the words.
The obvious choice is the map. But perhaps a Trie would serve you better. In each node, you save the count for the word. 0 means, that it's only part of a word.
You can insert into the trie using a stream and reading your file characterbased.
Second - multithreading yes or no?
This one is not easy to answer. Depending on the size the datastructure grows and how you parallelize the answer may differ.
Singlethreaded - straitforward and easy to implement.
Multithreaded with multiple reader threads and one datastructur. Then you have to synchronize the access to the datastructure. In a Trie, you only need to lock the node you are actually in, so multiple readers can access the datastructure without much interference. A self-balancing tree might be different, especially when rebalancing.
Multithreaded with multiple reader threads, each with their own datastructure. Each thread builds it's own datastructure while reading a part of the file. After each one is finished, the results have to be combined (which should be easy).
One thing you have to think about - you have to find a word boundary for each thread to start, but that should not pose a great problem (e.g. each thread walks it's start until the first word boundary and starts there, at the end each thread finishes the word it's working on).

While you can use a second thread to analyze the data after reading it, you're probably not going to gain a huge amount by doing so. Trying to use more than one thread to read the data will almost certainly hurt speed rather than improving it. Using multiple threads to process the data is pointless -- processing will be many times faster than reading, so even with only one extra thread, the limit is going to be the disk speed.
One (possible) way to gain significant speed is to bypass the usual iostreams -- while some are nearly as fast as using C FILE*'s, I don't know of anything that's really faster, and some are substantially slower. If you're running this on a system (e.g. Windows) that has an I/O model that's noticeably different from C's, you can gain considerably more with a little care.
The problem is fairly simple: the file you're reading is (potentially) larger than the cache space you have available -- but you won't gain anything from caching, because you're not going to reread chunks of the file again (at least if you do things sensibly). As such, you want to tell the system to bypass any caching, and just transfer data as directly as possible from the disk drive to your memory where you can process it. In a Unix-like system, that's probably open() and read() (and won't gain you a whole lot). On Windows, that's CreateFile and ReadFile, passing the FILE_FLAG_NO_BUFFERING flag to CreateFile -- and it'll probably roughly double your speed if you do it right.
You've also gotten some answers advocating doing the processing using various parallel constructs. I think these are fundamentally mistaken. Unless you do something horribly stupid, the time to count the words in the file will be only a few milliseconds longer than it takes to simply read the file.
The structure I'd use would be to have two buffers of, say, a megabyte apiece. Read data into one buffer. Turn that buffer over to your counting thread to count the words in that buffer. While that's happening, read data into the second buffer. When those are done, basically swap buffers and continue. There is a little bit of extra processing you'll need to do in swapping buffers to deal with a word that may cross the boundary from one buffer to the next, but it's pretty trivial (basically, if the buffer doesn't end with white space, you're still in a word when you start operating on the next buffer of data).
As long as you're sure it'll only be used on a multi-processor (multi-core) machine, using real threads is fine. If there's a chance this might ever be done on a single-core machine, you'd be somewhat better off using a single thread with overlapped I/O instead.

As others have indicated, the bottleneck will be the disk I/O. I therefore suggest that you use overlapped I/O. This basically inverts the program logic. Instead of your code tyring to determine when to do I/O, you simply tell the Operating System to call your code whenever it has finished a bit of I/O. If you use I/O completion ports, you can even tell the OS to use multiple threads for processing the file chunks.

c based solution?
I think perl was born for this exact purpose.

stream has only one cursor. If you access to the stream with more than one thread at a time, you will not be sure to read where you want. Read is done from cursor position.
What I would do is to have only one thread (maybe the main one) that reads the stream and dispatch reading bytes to other threads.
By example:
Thread #i is ready and ask main thread to give it next part,
Main thread read next 1Mb and provide them to thread 1,
Thread #i read the 1Mb and count words as you want,
Thread #i finishes its work and ask again for the next 1Mb.
By this way you can separate stream reading to stream analysis.

What you are looking for is RegEx. This Stackoverflow thread on c++ regex engines should help:
C++: what regex library should I use?

First, I'm pretty sure that C/C++ isn't the best way to handle this. Ideally, you'd use some map/reduce for parallelism, too.
But, assuming your constraints, here's what I'd do.
1) Split the text file into smaller chunks. You don't have to do this by the first-letter of the word. Just break them up into, say, 5000-word chunks. In pseudocode, you'd do something like this:
index = 0
numwords = 0
mysplitfile = openfile(index-split.txt)
while (bigfile >> word)
mysplitfile << word
numwords ++
if (numwords > 5000)
mysplitfile.close()
index++
mysplitfile = openfile(index-split.txt)
2) Use a shared map data structure and pthreads to spawn new threads to read each of the subfiles. Again, pseudocode:
maplock = create_pthread_lock()
sharedmap = std::map()
for every index-split.txt file:
spawn-new-thread(myfunction, filename, sharedmap, lock)
dump_map(sharedmap)
void myfunction(filename, sharedmap) {
localmap = std::map<string, size_t>();
file = openfile(filename)
while (file >> word)
if !localmap.contains(word)
localmap[word] = 0
localmap[word]++
acquire(lock)
for key,value in localmap
if !sharedmap.contains(key)
sharedmap[key] = 0
sharedmap[key] += value
release(lock)
}
Sorry for the syntax. I've been writing a lot of python lately.

Not C, and a bit UGLY, but it took only 2 minutes to bang out:
perl -lane '$h{$_}++ for #F; END{for $w (sort {$h{$b}<=>$h{$a} || $a cmp $b} keys %h) {print "$h{$w}\t$w"}}' file > freq
Loop over each line with -n
Split each line into #F words with -a
Each $_ word increments hash %h
Once the END of file has been reached,
sort the hash by the frequency $h{$b}<=>$h{$a}
If two frequencies are identical, sort alphabetically $a cmp $b
Print the frequency $h{$w} and the word $w
Redirect the results to file 'freq'
I ran this code on a 3.3GB text file with 580,000,000 words.
Perl 5.22 completed in 173 seconds.
My input file already had punctuation stripped out, and uppercase converted to lowercase, using this bit of code:
perl -pe "s/[^a-zA-Z \t\n']/ /g; tr/A-Z/a-z/" file_raw > file
(runtime of 144 seconds)
The word-counting script could alternately be written in awk:
awk '{for (i=1; i<=NF; i++){h[$i]++}} END{for (w in h){printf("%s\t%s\n", h[w], w)}}' file | sort -rn > freq

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js