Is seekg or ignore more efficient? - c++

I'm using C++ to write a very time-sensitive application, so efficiency is of the utmost importance.
I have a std::ifstream, and I want to jump a specific amount of characters (a.k.a. byte offset, I'm not using wchar_t) to get to a specific line instead of using std::getline to read every single line because it is too inefficient for me.
Is it better to use seekg or ignore to skip a specified number of characters and start reading from there?
size_t n = 100;
std::ifstream f("test");
f.seekg(n, std::ios_base::beg);
// vs.
f.ignore(n);

Looking at cplusplus.com for both functions, it seems like ignore will use sbumpc or sgetc to skip the requested amount of characters. This means that it works even on streams that do not natively support skipping (which ifstream does), but it also processes every single byte.
seekg on the other hand uses pubseekpos or pubseekoff, which is implementation defined. For files, this should directly skip to the desired position without processing the bytes up to it.
I would expect seekg to be much more efficent, but as others said: doing your own tests with a big file would be the best way to go for you.

Related

C++ how to check if the std::cin buffer is empty

The title is misleading because I'm more interested in finding an alternate solution. My gut feeling is that checking whether the buffer is empty is not the most ideal solution (at least in my case).
I'm new to C++ and have been following Bjarne Stroustrup's Programming Principles and Practices using C++. I'm currently on Chapter 7, where we are "refining" the calculator from Chapter 6. (I'll put the links for the source code at the end of the question.)
Basically, the calculator can take multiple inputs from the user, delimited by semi-colons.
> 5+2; 10*2; 5-1;
= 7
> = 20
> = 4
>
But I'd like to get rid of the prompt character ('>') for the last two answers, and display it again only when the user input is asked for. My first instinct was to find a way to check if the buffer is empty, and if so, cout the character and if not, proceed with couting the answer. But after a bit of googling I realized the task is not as easy as I initially thought... And also that maybe that wasn't a good idea to begin with.
I guess essentially my question is how to get rid of the '>' characters for the last two answers when there are multiple inputs. But if checking the cin buffer is possible and is not a bad idea after all, I'd love to know how to do it.
Source code: https://gist.github.com/Spicy-Pumpkin/4187856492ccca1a24eaa741d7417675
Header file: http://www.stroustrup.com/Programming/PPP2code/std_lib_facilities.h
^ You need this header file. I assume it is written by the author himself.
Edit: I did look around the web for some solutions, but to be honest none of them made any sense to me. It's been like 4 days since I picked up C++ and I have a very thin background in programming, so sometimes even googling is a little tough..
As you've discovered, this is a deceptively complicated task. This is because there are multiple issues here at play, both the C++ library, and the actual underlying file.
C++ library
std::cin, and C++ input streams, use an intermediate buffer, a std::streambuf. Input from the underlying file, or an interactive terminal, is not read character by character, but rather in moderately sized chunks, where possible. Let's say:
int n;
std::cin >> n;
Let's say that when this is done and over is, n contains the number 42. Well, what actually happened is that std::cin, more than likely, did not read just two characters, '4' and '2', but whatever additional characters, beyond that, were available on the std::cin stream. The remaining characters were stored in the std::streambuf, and the next input operation will read them, before actually reading the underlying file.
And it is equally likely that the above >> did not actually read anything from the file, but rather fetched the '4' and the '2' characters from the std::streambuf, that were left there after the previous input operation.
It is possible to examine the underlying std::streambuf, and determine whether there's anything unread there. But this doesn't really help you.
If you were about to execute the above >> operator, you looked at the underlying std::streambuf, and discover that it contains a single character '4', that also doesn't tell you much. You need to know what the next character is in std::cin. It could be a space or a newline, in which case all you'll get from the >> operator is 4. Or, the next character could be '2', in which case >> will swallow at least '42', and possibly more digits.
You can certainly implement all this logic yourself, look at the underlying std::streambuf, and determine whether it will satisfy your upcoming input operation. Congratulations: you've just reinvented the >> operator. You might as well just parse the input, a character at a time, yourself.
The underlying file
You determined that std::cin does not have sufficient input to satisfy your next input operation. Now, you need to know whether or not input is available on std::cin.
This now becomes an operating system-specific subject matter. This is no longer covered by the standard C++ library.
Conclusion
This is doable, but in all practical situations, the best solution here is to use an operating system-specific approach, instead of C++ input streams, and read and buffer your input yourself. On Linux, for example, the classical approach is to set fd 0 to non-blocking mode, so that read() does not block, and to determine whether or not there's available input, just try read() it. If you did read something, put it into a buffer that you can look at later. Once you've consumed all previously-read buffered input, and you truly need to wait for more input to be read, poll() the file descriptor, until it's there.

Using Getline on a Binary File

I have read that getline behaves as an unformatted input function. Which I believe should allow it to be used on a binary file. Let's say for example that I've done this:
ofstream ouput("foo.txt", ios_base::binary);
const auto foo = "lorem ipsum";
output.write(foo, strlen(foo) + 1);
output.close();
ifstream input("foo.txt", ios_base::binary);
string bar;
getline(input, bar, '\0');
Is that breaking any rules? It seems to work fine, I think I've just traditionally seen arrays handled by writing the size and then writing the array.
No, it's not breaking any rules that I can see.
Yes, it's more common to write an array with a prefixed size, but using a delimiter to mark the end can work perfectly well also. The big difference is that (like with a text file) you have to read through data to find the next item. With a prefixed size, you can look at the size, and skip directly to the next item if you don't need the current one. Of course, you also need to ensure that if you're using something to mark the end of a field, that it can never occur inside the field (or come up with some way of detecting when it's inside a field, so you can read the rest of the field when it does).
Depending on the situation, that can mean (for example) using Unicode text. This gives you a lot of options for values that can't occur inside the text (because they aren't legal Unicode). That, on the other hand, would also mean that your "binary" file is really a text file, and has to follow some basic text-file rules to make sense.
Which is preferable depends on how likely it is that you'll want to read random pieces of the file rather than reading through it from beginning to end, as well as the difficulty (if any) of finding a unique delimiter and if you don't have one, the complexity of making the delimiter recognizable from data inside a field. If the data is only meaningful if written in order, then having to read it in order doesn't really pose a problem. If you can read individual pieces meaningfully, then being able to do so much more likely to be useful.
In the end, it comes down to a question of what you want out of your file being "binary'. In the typical case, all 'binary" really means is that what end of line markers that might be translated from a new-line character to (for example) a carriage-return/line-feed pair, won't be. Depending on the OS you're using, it might not even mean that much though--for example, on Linux, there's normally no difference between binary and text mode at all.
Well, there are no rules broken and you'll get away with that just fine, except that may miss the precision of reading binary from a stream object.
With binary input, you usually want to know how many characters were read successfully, which you can obtain afterwards with gcount()... Using std::getline will not reflect the bytes read in gcount().
Of cause, you can simply get such info from the size of the string you passed into std::getline. But the stream will no longer encapsulate the number of bytes you consumed in the last Unformatted Operation

How does the compiler optimize getline() so effectively?

I know a lot of a compiler's optimizations can be rather esoteric, but my example is so straightforward I'd like to see if I can understand, if anyone has an idea what it could be doing.
I have a 500 mb text file. I declare and initialize an fstream:
std::fstream file(path,std::ios::in)
I need to read the file sequentially. It's tab delimited but the field lengths are not known and vary line to line. The actual parsing I need to do to each line added very little time to the total (which really surprised me since I was doing string::find on each line from getline. I thought that'd be slow).
In general I want to search each line for a string, and abort the loop when I find it. I also have it incrementing and spitting out the line numbers for my own curiosity, I confirmed this adds little time (5 seconds or so) and lets me see how it blows past short lines and slows down on long lines.
I have the text to be found as the unique string tagging the eof, so it needs to search every line. I'm doing this on my phone so I apologize for formatting issues but it's pretty straightforward. I have a function taking my fstream as a reference and the text to be found as a string and returning a std::size_t.
long long int lineNum = 0;
while (std::getline (file, line))
{
pos = line.find(text);
lineNum += 1;
std::cout << std::to_string(lineNum) << std::endl;
if (pos != -1)
return file.tellg():
}
return std::string::npos;
Edit: lingxi pointed out the to_string isn't necessary here, thanks. As mentioned, entirely omitting the line number calculation and output saves a few seconds, which in my preoptimized example is a small percent of the total.
This successfully runs through every line, and returns the end position in 408 seconds. I had minimal improvement trying to put the file in a stringstream, or omitting everything in the whole loop (just getline until the end, no checks, searches, or displaying). Also pre-reserving a huge space for the string didn't help.
Seems like the getline is entirely the driver. However...if I compile with the /O2 flag (MSVC++) I get a comically faster 26 seconds. In addition, there is no apparent slow down on long lines vs short. Clearly the compiler is doing something very different. No complaints from me, but any thoughts as to how it's achieved? As an exercise I'd like to try and get my code to execute faster before compiler optimization.
I bet it has something to do with the way the getline manipulates the string. Would it be faster (alas can't test for a while) to just reserve the whole filesize for the string, and read character by character, incrementing my line number when I pass a /n? Also, would the compiler employ things like mmap?
UPDATE: I'll post code when I get home this evening. It looks like just turning off runtime checks dropped the execution from 400 seconds to 50! I tried performing the same functionality using raw c style arrays. I'm not super experienced, but it was easy enough to dump the data into a character array, and loop through it looking for newlines or the first letter of my target string.
Even in full debug mode it gets to the end and correctly finds the string in 54 seconds. 26 seconds with checks off, and 20 seconds optimized. So from my informal, ad-hoc experiments it looks like the string and stream functions are victimized by the runtime checks? Again, I'll double check when I get home.
The reason for this dramatic speedup is that the iostream class hierarchy is based on templates (std::ostream is actually a typedef of a template called std::basic_ostream), and a lot of its code is in headers. C++ iostreams take several function calls to process every byte in the stream. However, most of those functions are fairly trivial. By turning on optimization, most of these calls are inlined, exposing to the compiler the fact that std::getline essentially copies characters from one buffer to another until it finds a newline - normally this is "hidden" under several layers of function calls. This can be further optimized, reducing the overhead per byte by orders of magnitude.
Buffering behavior actually doesn't change between the optimized and non-optimized version, otherwise the speedup would be even higher.

Is there any way to read characters that satisfy certain conditions only from stdin in C++?

I am trying to read some characters that satisfy certain condition from stdin with iostream library while leave those not satisfying the condition in stdin so that those skipped characters can be read later. Is it possible?
For example, I want characters in a-c only and the input stream is abdddcxa.
First read in all characters in a-c - abca; after this input finished, start read the remaining characters dddx. (This two inputs can't happen simultaneously. They might be in two different functions).
Wouldn't it be simpler to read everything, then split the input into the two parts you need and finally send each part to the function that needs to process it?
Keeping the data in the stdin buffer is akin to using globals, it makes your program harder to understand and leaves the risk of other code (or the user) changing what is in the buffer while you process it.
On the other hand, dividing your program into "the part that reads the data", "the part that parses the data and divides the workload" and the "part that does the work" makes for a better structured program which is easy to understand and test.
You can probably use regex to do the actual split.
What you're asking for is the putback method (for more details see: http://www.cplusplus.com/reference/istream/istream/putback/). You would have to read everything, filter the part that you don't want to keep out, and put it back into the stream. So for instance:
cin >> myString;
// Do stuff to fill putbackBuf[] with characters in reverse order to be put back
pPutbackBuf = &putbackBuf[0];
do{
cin.putback(*(pPutbackBuf++));
while(*pPutbackBuf);
Another solution (which is not exactly what you're asking for) would be to split the input into two strings and then feed the "non-inputted" string into a stringstream and pass that to whatever function needs to do something with the rest of the characters.
What you want to do is not possible in general; ungetc and putback exist, but they're not guaranteed to work for more than one character. They don't actually change stdin; they just push back on an input buffer.
What you could do instead is to explicitly keep a buffer of your own, by reading the input into a string and processing that string. Streams don't let you safely rewind in many cases, though.
No, random access is not possible for streams (except for fstream an stringstream). You will have to read in the whole line/input and process the resulting string (which you could, however, do using iostreams/std::stringstream if you think it is the best tool for that -- I don't think that but iostreams gurus may differ).

How to use fgets if you don't know the number of characters to be read?

I need to read a file and send the text from it to a string so I can parse it. However, the program won't know exactly how long the file is, so what would I do if I wanted to use fgets(), or is there a better alternative?
Note:
char *fgets(char *str, size_t num, FILE *stream);
Don't forget that fgets() reads a line at a time, subject to having enough space.
Humans seldom write lines longer than ... 80, 256, pick a number ... characters. POSIX suggests a line length of 4096. So, I usually use:
char buffer[4096];
while (fgets(buffer, sizeof(buffer), fp))
{
...process line...
}
If you are worried that someone might provide more than 4K of data in a single line (and a machine generated file, such as HTML or JSON, might contain that), then you have to decide what to do next. You can do any of the following (and there are likely some other options I've not mentioned):
Process the over-long lines in bits without assuming that there was a newline in between.
Allocate memory for a longer line (say 8K to start with), copy the initial 4K into the allocated buffer, and read more data into the second half of the buffer, iterating until you find the end of line.
Use the POSIX 2008 function getline() which is available on Linux. It does memory allocation for you.
You can use fgets iteratively, but a simpler alternative is (stdio.h's) getline. It's in POSIX, but it's not standard C.
Since you're using C++ though, can you use std::string functions like iostream's getline?
If you're not on a POSIX system and don't have getline available, take a look at Chuck Falconer's public domain ggets/fggets functions which dynamically grow a buffer to consume an entire line. (That link seems to be down right now, but archive.org has a copy.)
Allocate a buffer (the one that str points to), and pass the size of the buffer for num. The actual space taken up will only be the length of the text read by fgets.
Something like:
char str[1000];
fgets(str, 1000, &file);
If the next line only has 10 characters before the newline, then str will hold those 10 characters, the newline, and the null terminator.
Edit: just in case there is any confusion, I didn't intend the above to sound as if the extra space in the buffer isn't in use. I only meant to illustrate that you don't need to know ahead of time how long your string is going to be, as long as you can put a maximum length on it.