How does the compiler optimize getline() so effectively? - c++

I know a lot of a compiler's optimizations can be rather esoteric, but my example is so straightforward I'd like to see if I can understand, if anyone has an idea what it could be doing.
I have a 500 mb text file. I declare and initialize an fstream:
std::fstream file(path,std::ios::in)
I need to read the file sequentially. It's tab delimited but the field lengths are not known and vary line to line. The actual parsing I need to do to each line added very little time to the total (which really surprised me since I was doing string::find on each line from getline. I thought that'd be slow).
In general I want to search each line for a string, and abort the loop when I find it. I also have it incrementing and spitting out the line numbers for my own curiosity, I confirmed this adds little time (5 seconds or so) and lets me see how it blows past short lines and slows down on long lines.
I have the text to be found as the unique string tagging the eof, so it needs to search every line. I'm doing this on my phone so I apologize for formatting issues but it's pretty straightforward. I have a function taking my fstream as a reference and the text to be found as a string and returning a std::size_t.
long long int lineNum = 0;
while (std::getline (file, line))
{
pos = line.find(text);
lineNum += 1;
std::cout << std::to_string(lineNum) << std::endl;
if (pos != -1)
return file.tellg():
}
return std::string::npos;
Edit: lingxi pointed out the to_string isn't necessary here, thanks. As mentioned, entirely omitting the line number calculation and output saves a few seconds, which in my preoptimized example is a small percent of the total.
This successfully runs through every line, and returns the end position in 408 seconds. I had minimal improvement trying to put the file in a stringstream, or omitting everything in the whole loop (just getline until the end, no checks, searches, or displaying). Also pre-reserving a huge space for the string didn't help.
Seems like the getline is entirely the driver. However...if I compile with the /O2 flag (MSVC++) I get a comically faster 26 seconds. In addition, there is no apparent slow down on long lines vs short. Clearly the compiler is doing something very different. No complaints from me, but any thoughts as to how it's achieved? As an exercise I'd like to try and get my code to execute faster before compiler optimization.
I bet it has something to do with the way the getline manipulates the string. Would it be faster (alas can't test for a while) to just reserve the whole filesize for the string, and read character by character, incrementing my line number when I pass a /n? Also, would the compiler employ things like mmap?
UPDATE: I'll post code when I get home this evening. It looks like just turning off runtime checks dropped the execution from 400 seconds to 50! I tried performing the same functionality using raw c style arrays. I'm not super experienced, but it was easy enough to dump the data into a character array, and loop through it looking for newlines or the first letter of my target string.
Even in full debug mode it gets to the end and correctly finds the string in 54 seconds. 26 seconds with checks off, and 20 seconds optimized. So from my informal, ad-hoc experiments it looks like the string and stream functions are victimized by the runtime checks? Again, I'll double check when I get home.

The reason for this dramatic speedup is that the iostream class hierarchy is based on templates (std::ostream is actually a typedef of a template called std::basic_ostream), and a lot of its code is in headers. C++ iostreams take several function calls to process every byte in the stream. However, most of those functions are fairly trivial. By turning on optimization, most of these calls are inlined, exposing to the compiler the fact that std::getline essentially copies characters from one buffer to another until it finds a newline - normally this is "hidden" under several layers of function calls. This can be further optimized, reducing the overhead per byte by orders of magnitude.
Buffering behavior actually doesn't change between the optimized and non-optimized version, otherwise the speedup would be even higher.

Related

Is seekg or ignore more efficient?

I'm using C++ to write a very time-sensitive application, so efficiency is of the utmost importance.
I have a std::ifstream, and I want to jump a specific amount of characters (a.k.a. byte offset, I'm not using wchar_t) to get to a specific line instead of using std::getline to read every single line because it is too inefficient for me.
Is it better to use seekg or ignore to skip a specified number of characters and start reading from there?
size_t n = 100;
std::ifstream f("test");
f.seekg(n, std::ios_base::beg);
// vs.
f.ignore(n);
Looking at cplusplus.com for both functions, it seems like ignore will use sbumpc or sgetc to skip the requested amount of characters. This means that it works even on streams that do not natively support skipping (which ifstream does), but it also processes every single byte.
seekg on the other hand uses pubseekpos or pubseekoff, which is implementation defined. For files, this should directly skip to the desired position without processing the bytes up to it.
I would expect seekg to be much more efficent, but as others said: doing your own tests with a big file would be the best way to go for you.

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.
No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.
Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

Fast way to get two first and last characters of a string from the input

I need to read a string from the input
a string has its length from 2 letters up to 1000 letters
I only need 2 first letters, 2 last letters, and the size of the entire string
Here is my way of doing it, HOWEVER, I do believe there is a smarter way, which is why I am asking this question. Could you please tell me, unexperienced and new C++ programmer, what are possible ways of doing this task better?
Thank you.
string word;
getline(cin, word);
// results - I need only those 5 numbers:
int l = word.length();
int c1 = word[0];
int c2 = word[1];
int c3 = word[l-2];
int c4 = word[l-1];
Why do I need this? I want to encode a huge number of really long strings, but I figured out I really need only those 5 values I mentioned, the rest is redundant. How many words will be loaded? Enough to make this part of code worth working on :)
I will take you at your word that this is something that is worth optimizing to an extreme. The method you've shown in the question is already the most straight-forward way to do it.
I'd start by using memory mapping to map chunks of the file into memory at a time. Then, loop through the buffer looking for newline characters. Take the first two characters after the previous newline and the last two characters before the one you just found. Subtract the address of the second newline from the first to get the length of the line. Rinse, lather, and repeat.
Obviously some care will need to be taken around boundaries, where one newline is in the previous mapped buffer and one is in the next.
The first two letters are easy to obtain and fast.
The issue is with the last two letters.
In order to read a text line, the input must be scanned until it finds an end-of-line character (usually a newline). Since your text lines are variable, there is no fast solution here.
You can mitigate the issue by reading in blocks of data from the file into memory and searching memory for the line endings. This avoids a call to getline, and it avoids a double search for the end of line (once by getline and the other by your program).
If you change the input to be fixed with, this issue can be sped up.
If you want to optimize this (although I can't imagine why you would want to do that, but surely you have your reasons), the first thing to do is to get rid of std::string and read the input directly. That will spare you one copy of the whole string.
If your input is stdin, you will be slowed down by the buffering too. As it has already been said, the best speed woukd be achieved by reading big chunks from a file in binary mode and doing the end of line detection yourself.
At any rate, you will be limited by the I/O bandwidth (disk access speed) in the end.

Why is Scala's combinator parsing slow when parsing large files? What can I do?

I need to parse files that have millions of lines. I noticed that my combinator parser gets slower and slower as it parses more and more lines. The problem seems to be in scala's "rep" or regex parsers, because this behaviour occurs even for the simple example parser shown below:
def file: Parser[Int] = rep(line) ^^ { 1 } // a file is a repetition of lines
def line: Parser[Int] = """(?m)^.*$""".r ^^ { 0 } // reads a line and returns 0
When I try to parse a file with millions of lines of equal length with this simple parser, in the beginning it parses 46 lines/ms. After 370000 lines, the speed drops to 20 lines/ms. After 840000 lines, it drops to 10 lines/ms. After 1790000 lines, 5 lines/ms...
My questions are:
Why does this happen?
What can I do to prevent this?
This is probably a result of the change in Java 7u6 that doesn't have substrings as a part of the original string. So big strings get copied over and over, causing lots and lots of memory churn (among other things). As you increase the amount of stuff you've parsed (I'm assuming you're storing at least some of it), the garbage collector has more and more work to do, so creating all that extra garbage has a steeper and steeper penalty.
There is a ticket to fix the memory usage, and code from Zach Moazeni there that lets you wrap your strings inside a construct that will make substrings properly (which you can pass into the parser in place of strings).
This won't necessarily change the overall result that parsing eventually slows down, but it should help reduce the time overall.
Also, I wouldn't advise making a file be a repetition of lines. You're making the parser keep track of the entire file when it really need not. I'd feed it in a line at a time. (And then if the lines are short, you may not need the above fix.)

Including huge string in our c++ programs?

I am trying to include huge string in my c++ programs, Its size is 20598617 characters , I am using #define to achieve it. I have a header file which contains this statement
#define "<huge string containing 20598617 characterd>"
When I try to compile the program I get error as fatal error C1060: compiler is out of heap space
I tried following command line options with no success
/Zm200
/Zm1000
/Zm2000
How can I make successful compilation of this program?
Platform: Windows 7
You can't, not reliably. Even if it will compile, it's liable to break the runtime library, or the OS assumptions, and so forth.
If you tell us why you're trying to do it, we can offer lots of alternatives. Deciding how to handle arbitrarily large data is a major part of programming.
Edited to add:
Rather than guess, I looked into MSDN:
Prior to adjacent strings being
concatenated, a string cannot be
longer than 16380 single-byte
characters.
A Unicode string of about one half
this length would also generate this
error.
The page concludes:
You may want to store exceptionally
large string literals (32K or more) in
a custom resource or an external file.
What do other compilers say?
Further edited to add:
I created a file like this:
char s[] = {'x','x','x','x'};
I kept doubling the occurrences of 'x', testing each one as an #include file.
An 8388608 byte string succeeded; 16777216 bytes failed, with the "out of heap space" error.
I suspect you are running into a design limit on the size of a character string.
Most people really think that a million characters is long enough :-}
To avoid such design limits, I'd try not to put the whole thing into a single literal string. On the suspicion that #define macro bodies likewise have similar limits, I't try not to put the entire thing in a single #define, either.
Most C compilers will accept pretty big lists of individual characters as initializers. If you write
char c[]={ c1, c2, ... c20598617 };
with the c_i being your individual characters, you may succeed. I've seen GCC2 applications where there were 2 million elements like this (apparantly they were loading some type of ROM image). You might even be able to group the c_i into blocks of K characters for K=100, 1000, 10000 as suits your tastes, and that might actually help the compiler.
You might also consider running your string through a compression algorithm,
putting the compressed result into your C++ file by any of the above methods,
and decompressing after the program was loaded.
I suspect you can get a decompression algorithm into a few thousand bytes.
Store the string to a file and just open and read it...
Its much cleaner/organized that way [i'm assuming that right now you have a file named blargh.h which contains that one #Define...]
Um, store the string in a separate resource of some sort and load it in? Seriously, in embedded land, you would have this as a separate resource and not hold it in RAM. On windows, I believe you can use .dlls or other external resources to handle this for you. Compilers aren't designed to hold this size of resources for you and they will fail.
Increase the compiler heap space.
If your string comes from a large text or binary file, you may have luck with either the xxd -i command (to get everything in an array, per Ira Baxter's answer) or a variant of the bin2obj command (to get everything into a .o file you can link into the program).
Note that the string may not be null terminated in this case.
See answers to the earlier question, "How can I get the contents of a file at build time into my C++ string?"
(Also, as an aside: note the existence of the .xbm format.)
This is a very old question, but since there's no definitive answer yet: C++11's raw string literals seem to do the job.
This compiles nicely on GCC 4.8:
#include <string>
std::string data = R"(
... <1.4 MB of base85-encoded string> ...
)";
As said in other posts in this thread, this is definitely not the preferred way of handling large amounts of data.