Why does the VS2008 std::string.erase() move its buffer? - c++

I want to read a file line by line and capture one particular line of input. For maximum performance I could do this in a low level way by reading the entire file in and just iterating over its contents using pointers, but this code is not performance critical so therefore I wish to use a more readable and typesafe std library style implementation.
So what I have is this:
std::string line;
line.reserve(1024);
std::ifstream file(filePath);
while(file)
{
std::getline(file, line);
if(line.substr(0, 8) == "Whatever")
{
// Do something ...
}
}
While this isn't performance critical code I've called line.reserve(1024) before the parsing operation to preclude multiple reallocations of the string as larger lines are read in.
Inside std::getline the string is erased before having the characters from each line added to it. I stepped through this code to satisfy myself that the memory wasn't being reallocated each iteration, what I found fried my brain.
Deep inside string::erase rather than just resetting its size variable to zero what it's actually doing is calling memmove_s with pointer values that would overwrite the used part of the buffer with the unused part of the buffer immediately following it, except that memmove_s is being called with a count argument of zero, i.e. requesting a move of zero bytes.
Questions:
Why would I want the overhead of a library function call in the middle of my lovely loop, especially one that is being called to do nothing at all?
I haven't picked it apart myself yet but under what circumstances would this call not actually do nothing but would in fact start moving chunks of buffer around?
And why is it doing this at all?
Bonus question: What the C++ standard library tag?

This is a known issue I reported a year ago, to take advantage of the fix you'll have to upgrade to a future version of the compiler.
Connect Bug: "std::string::erase is stupidly slow when erasing to the end, which impacts std::string::resize"
The standard doesn't say anything about the complexity of any std::string functions, except swap.

std::string::clear() is defined in terms of std::string::erase(),
and std::string::erase() does have to move all of the characters after
the block which was erased. So why shouldn't it call a standard
function to do so? If you've got some profiler output which proves that
this is a bottleneck, then perhaps you can complain about it, but
otherwise, frankly, I can't see it making a difference. (The logic
necessary to avoid the call could end up costing more than the call.)
Also, you're not checking the results of the call to getline before
using them. Your loop should be something like:
while ( std::getline( file, line ) ) {
// ...
}
And if you're so worried about performance, creating a substring (a new
std::string) just in order to do a comparison is far more expensive
than a call to memmove_s. What's wrong with something like:
static std::string const target( "Whatever" );
if ( line.size() >= target.size()
&& std::equal( target.begin(), target().end(), line.being() ) ) {
// ...
}
I'ld consider this the most idiomatic way of determining whether a
string starts with a specific value.
(I might add that from experience, the reserve here doesn't buy you
much either. After you've read a couple of lines in the file, your
string isn't going to grow much anyway, so there'll be very few
reallocations after the first couple of lines. Another case of
premature optimization?)

In this case, I think the idea you mention of reading the entire file and iterating over the result may actually give about as simple of code. You're simply changing: "read line, check for prefix, process" to "read file, scan for prefix, process":
size_t not_found = std::string::npos;
std::istringstream buffer;
buffer << file.rdbuf();
std::string &data = buffer.str();
char const target[] = "\nWhatever";
size_t len = sizeof(target)-1;
for (size_t pos=0; not_found!=(pos=data.find(target, pos)); pos+=len)
{
// process relevant line starting at contents[pos+1]
}

Related

Why on earth is my file reading function placing null-terminators where excess CR LF carriages should be?

Today I tried to put together a simple OpenGL shader class, one that loads text from a file, does a little bit of parsing to build a pair of vertex and fragment shaders according to some (pretty sweet) custom syntax (for example, writing ".varying [type] [name];" would allow you to define a varying variable in both shaders while only writing it once, same with ".version",) then compiles an OpenGL shader program using the two, then marks the shader class as 'ready' if and only if the shader code compiled correctly.
Now, I did all this, but then encountered the most bizarre (and frankly kinda scary) problems. I set everything up, declared a new 'tt::Shader' with some file containing valid shader code, only to have it tell me that the shader was invalid but then give me an empty string when I asked what the error was (which means OpenGL gave me an empty string as that's where it gets it from.)
I tried again, this time with obviously invalid shader code, and while it identified that the shader was invalid, it still gave me nothing in terms of what the error was, just an empty string (from which I assumed that obviously the error identification portion of it was also just the same as before.)
Confused, I re-wrote both shaders, the valid and invalid one, by hand as a string, compiling the classes again with the string directly, with no file access. Doing this, the error vanished, the first one compiled correctly, and the second one failed but correctly identified what the error was.
Even more confused, I started comparing the strings from the files to those I wrote myself. Turns out the former were a tad longer than the ladder, despite printing the same. After doing a bit of counting, I realised that these characters must be Windows CR LF line ending carriage characters that got cut off in the importing process.
To test this, I took the hand-written strings, inserted carriages where they would be cut off, and ran my string comparison tests again. This time, it evaluated there lengths to be the same, but also told me that the two where still not equal, which was quite puzzling.
So, I wrote a simple for-loop to iterate through the characters of the two strings and print then each next to one another, and cast to integers so I could see their index values. I ran the program, looked through the (quite lengthy) list, and came to a vary insightful though even less clarifying answer: The hidden characters were in the right places, but they weren't carriages ... they were null-terminators!
Here's the code for the file reading function I'm using. It's nothing fancy, just standard library stuff.
// Attempts to read the file with the given path, returning a string of its contents.
// If the file could not be found and read, an empty string will be returned.
// File strings are build by reading the file line-by-line and assembling a single with new lines placed between them.
// Given this line-by-line method, take note that it will copy no more than 4096 bytes from a single line before moving on.
inline std::string fileRead(const std::string& path) {
if (!tt::fileExists(path))
return "";
std::ifstream a;
a.open(path);
std::string r;
const tt::uint32 _LIMIT = 4096;
char r0[_LIMIT];
tt::uint32 i = 0;
while (a.good()) {
a.getline(r0, _LIMIT);
if (i > 0)
r += "\n";
i++;
r += std::string(r0, static_cast<tt::uint32>(a.gcount()));
}
// TODO: Ask StackOverflow why on earth our file reading function is placing null characters where excess carriages should go.
for (tt::uint32 i = 0; i < r.length(); i++)
if (r[i] == '\0')
r[i] = '\r';
a.close();
tt::printL("Reading file '" + path + "' ...");
return r;
}
If y'all could take a read and tell me what the hell is going on with it, that'd be awesome, as I'm at a total loss for what its doing to my string to cause this.
Lastly, I do get why the null-terminators didn't show up to me but did for OpenGL, the ladder was using C-strings, while I was just doing everything with std::string objects, where store things based on length given that they're pretty much just fancy std::vector objects.
Read the documentation for std::string constructor. Constructor std::string(const char*, size_t n) creates string of size n regardless of input. It may contain null character inside or even more than 1. Note that size of std::string doesn't include the null character (so that str[str.size()] == '\0' always).
So clearly the code simply copies the null character from the output buffer of the getline function.
Why would it do that? Go to gcount() function documentation - it returns the number of extracted characters by the last operation. I.e., it includes the extracted character \n which is replaced in output by \0 voila. Exactly one number more that the constructor ask for.
So to fix it simply do replace:
r += std::string(r0, static_cast<tt::uint32>(a.gcount()-1));
Or you could've simply used getline() with std::string as input instead of a buffer - and none of this would've happened.

Looking for a more compact syntax (simple code) - C++

I'm relearning C++ and I found myself often writing pieces of code like this:
vector<string> Text;
string Line;
while (getline(cin, Line))
{
Text.push_back(Line);
}
I was wondering if there is a more compact way to write the loop using only basic features (no user-written functions, for example) - more or less, putting everything in the condition?
You can use a for loop. We can declare Line in the variable declaration part and use the condition and increment parts to read and place the line. That gives you
for(string Line; getline(cin, Line); Text.push_back(Line));
more or less, putting everything in the condition?
You can do this
while (getline(cin, Line) && (Text.push_back(Line),true)) {}
It works because && is short-circuited and because the comma-operator evaluates the first operand, discards the result and returns the result of the second operand.
So it is possible, but why would you want to do that? Making code as dense as possible is rarely good for readability (actually your original code is more readable and uses less characters).
Well, at the expense of some obfuscation,
while (getline(cin, Line) && (Text.push_back(Line), 1));
would do it: note the use of the expression separator operator which, informally speaking, "converts" the void return type of push_back to an int so enabling its use with the short-circuiting &&.
But as a rule of thumb, work with the language, not against it. My answer is doing the latter. The way you present the code in your question is adequate.
At the risk of "but that is exactly OP's code!", I would personally favor this version if this is the entire body of a scope (e.g. function that parses the text):
vector<string> Text;
string Line;
while (getline(cin, Line))
Text.push_back(Line);
Alternatively, if part of a larger scope, I would probably group all four lines together and add empty lines before and after for visual coherence (and maybe add a short comment before it):
// [lots of other code]
// Gather input from cin.
vector<string> Text;
string Line;
while (getline(cin, Line))
Text.push_back(Line);
// [lots of other code]
I am aware that this introduces no clever tricks, but this is the most compact and readable form of the given code, at least to me.
If you wanted compactness above all else, you could choose garbage variables, omit all unnecessary whitespace and even alias the types beforehand (since we are "often writing" this kind of code this is a one-off) to, say, V<S> t;S l;while(getline(cin,l))t.push_back(l); but nobody wants to read that.
So clearly there is more than compactness at play. As for me, I'm looking to keep noise to a minimum while retaining intuitive readability, and I would suggest this is an agreeable goal.
I would never use the "throw everything into the loop condition" suggestions because that very much breaks how I expect code to be structured: The main purpose of your loop goes into the loop body. You may disagree/have different expectations, but in my eyes everything else is just an attempt to show off your minifying skills, it does not produce good code.
The above accomplishes just that: The braces are noise for this simple operation, and the important part stands out as the loop body. "But is getline not also important?" - It is, and I would honestly prefer a version where it is in the loop body, such as a hypothetical
vector<string> Text;
while (cin.hasLine())
Text.push_back(readLine(cin));
This would be an ideal loop to me: The condition only checks for termination and the loop body is only the operation we want to repeat.
Even better would be a standard algorithm, but I unaware of any that would help here (ranges or boost might provide, I don't know).
On a more abstract level, if OP frequently writes this exact code, it should obviously be a separate function. But even if not, the "lots of other code" example would benefit from that abstraction too.
Loop with a single instruction. you can write it in one line but I don't recommend it
while (getline(cin, Line)) Text.push_back(Line);

How does the compiler optimize getline() so effectively?

I know a lot of a compiler's optimizations can be rather esoteric, but my example is so straightforward I'd like to see if I can understand, if anyone has an idea what it could be doing.
I have a 500 mb text file. I declare and initialize an fstream:
std::fstream file(path,std::ios::in)
I need to read the file sequentially. It's tab delimited but the field lengths are not known and vary line to line. The actual parsing I need to do to each line added very little time to the total (which really surprised me since I was doing string::find on each line from getline. I thought that'd be slow).
In general I want to search each line for a string, and abort the loop when I find it. I also have it incrementing and spitting out the line numbers for my own curiosity, I confirmed this adds little time (5 seconds or so) and lets me see how it blows past short lines and slows down on long lines.
I have the text to be found as the unique string tagging the eof, so it needs to search every line. I'm doing this on my phone so I apologize for formatting issues but it's pretty straightforward. I have a function taking my fstream as a reference and the text to be found as a string and returning a std::size_t.
long long int lineNum = 0;
while (std::getline (file, line))
{
pos = line.find(text);
lineNum += 1;
std::cout << std::to_string(lineNum) << std::endl;
if (pos != -1)
return file.tellg():
}
return std::string::npos;
Edit: lingxi pointed out the to_string isn't necessary here, thanks. As mentioned, entirely omitting the line number calculation and output saves a few seconds, which in my preoptimized example is a small percent of the total.
This successfully runs through every line, and returns the end position in 408 seconds. I had minimal improvement trying to put the file in a stringstream, or omitting everything in the whole loop (just getline until the end, no checks, searches, or displaying). Also pre-reserving a huge space for the string didn't help.
Seems like the getline is entirely the driver. However...if I compile with the /O2 flag (MSVC++) I get a comically faster 26 seconds. In addition, there is no apparent slow down on long lines vs short. Clearly the compiler is doing something very different. No complaints from me, but any thoughts as to how it's achieved? As an exercise I'd like to try and get my code to execute faster before compiler optimization.
I bet it has something to do with the way the getline manipulates the string. Would it be faster (alas can't test for a while) to just reserve the whole filesize for the string, and read character by character, incrementing my line number when I pass a /n? Also, would the compiler employ things like mmap?
UPDATE: I'll post code when I get home this evening. It looks like just turning off runtime checks dropped the execution from 400 seconds to 50! I tried performing the same functionality using raw c style arrays. I'm not super experienced, but it was easy enough to dump the data into a character array, and loop through it looking for newlines or the first letter of my target string.
Even in full debug mode it gets to the end and correctly finds the string in 54 seconds. 26 seconds with checks off, and 20 seconds optimized. So from my informal, ad-hoc experiments it looks like the string and stream functions are victimized by the runtime checks? Again, I'll double check when I get home.
The reason for this dramatic speedup is that the iostream class hierarchy is based on templates (std::ostream is actually a typedef of a template called std::basic_ostream), and a lot of its code is in headers. C++ iostreams take several function calls to process every byte in the stream. However, most of those functions are fairly trivial. By turning on optimization, most of these calls are inlined, exposing to the compiler the fact that std::getline essentially copies characters from one buffer to another until it finds a newline - normally this is "hidden" under several layers of function calls. This can be further optimized, reducing the overhead per byte by orders of magnitude.
Buffering behavior actually doesn't change between the optimized and non-optimized version, otherwise the speedup would be even higher.

var arg list to tempfile, why is it needed?

I have this code inside a constructor of a class (not written by me) and it writes a variable arg list to a tmp file.
I wondered why this would be needed? The tmpfile is removed after this ctor goes out of scope and the var arg list sits inside the m_str vector.
Can someone suggest a better way of doing this without the use of a tmpfile?
DString(const char *fmt, ...)
{
DLog::Instance()->Log("Inside DString with ellipses");
va_list varptr;
va_start(varptr, fmt);
FILE *f = tmpfile();
if (f != NULL)
{
int n = ::vfprintf(f, fmt, varptr) + 1;
m_str.resize(n + 1);
::vsprintf(&m_str[0], fmt, varptr);
va_end(varptr);
}
else
DLog::Instance()->Log("[ERROR TMPFILE:] Unable to create TmpFile for request!");
}
This is C++ code: I think you may be trying to solve the wrong problem here.
The need for a temp file would go away completely if you consider using a C++-esque design instead of continuing to use the varargs. It may seem like a lot of work to convert all the calling sites to use a new mechanism, but varargs provide a wide variety of possibilities to mis-pass parameters leaving you open to insidious bugs, not to mention you can't pass non-POD types at all. I believe in the long (or even medium) term it will pay off in reliability, clarity, and ease of debugging.
Instead try to implement a C++-style streams interface that provides type safety and even the ability to disallow certain operations if needed.
It's just using the temporary file as a place that it can write the contents that won't overflow, so it can measure the length, then allocate sufficient space for the string, and finally deposit the real output in the string.
I'd at least consider how difficult it would be to replace the current printf-style interface that's leading to this with an iostreams-style interface, which will make it easy to avoid and give all the usual benefits of iostreams (type-safe, extensible, etc.)
Edit: if changing the function's signature is really too difficult to contemplate, then you probably want to replace vfprintf with vsnprintf. vsnprintf allows you to specify a buffer length (so it won't overrun the buffer) and it returns the number of characters that would have been generated if there had been sufficient space. As such, usage would be almost like you have now, but avoid generating the temporary file. You'd call it once specifying a buffer length of 0, use the return value (+1 for the NUL terminator) to resize your buffer, then call it again specifying the correct buffer size.
It appears to be using the temp file as an output place for the ::vfprintf() call. It does that to get the length of the formatted string (plus 1 for the NULL). Then resizes m_str big enough to hold the formatted string, which gets filled in from the ::vsprintf() call.
The var arg list is not in the file or in m_str. The formatted output from printf() (and its variants) is in the file and in m_str.
I have a queasy feeling showing this but you could try:
FILE *fp=freopen("nul","w", stderr)
int n = ::vfprintf( fp , fmt, varptr );
fclose(fp);
(windows)

Overloading operator>> to a char buffer in C++ - can I tell the stream length?

I'm on a custom C++ crash course. I've known the basics for many years, but I'm currently trying to refresh my memory and learn more. To that end, as my second task (after writing a stack class based on linked lists), I'm writing my own string class.
It's gone pretty smoothly until now; I want to overload operator>> that I can do stuff like cin >> my_string;.
The problem is that I don't know how to read the istream properly (or perhaps the problem is that I don't know streams...). I tried a while (!stream.eof()) loop that .read()s 128 bytes at a time, but as one might expect, it stops only on EOF. I want it to read to a newline, like you get with cin >> to a std::string.
My string class has an alloc(size_t new_size) function that (re)allocates memory, and an append(const char *) function that does that part, but I obviously need to know the amount of memory to allocate before I can write to the buffer.
Any advice on how to implement this? I tried getting the istream length with seekg() and tellg(), to no avail (it returns -1), and as I said looping until EOF (doesn't stop reading at a newline) reading one chunk at a time.
To read characters from the stream until the end of line use a loop.
char c;
while(istr.get(c) && c != '\n')
{
// Apped 'c' to the end of your string.
}
// If you want to put the '\n' back onto the stream
// use istr.unget(c) here
// But I think its safe to say that dropping the '\n' is fine.
If you run out of room reallocate your buffer with a bigger size.
Copy the data across and continue. No need to be fancy for a learning project.
you can use cin::getline( buffer*, buffer_size);
then you will need to check for bad, eof and fail flags:
std::cin.bad(), std::cin.eof(), std::cin.fail()
unless bad or eof were set, fail flag being set usually indicates buffer overflow, so you should reallocate your buffer and continue reading into the new buffer after calling std::cin.clear()
A side note: In the STL the operator>> of an istream is overloaded to provide this kind of functionality or (as for *char ) are global functions. Maybe it would be more wise to provide a custom overload instead of overloading the operator in your class.
Check Jerry Coffin's answer to this question.
The first method he used is very simple (just a helper class) and allow you to write your input in a std::vector<std::string> where each element of the vector represents a line of the original input.
That really makes things easy when it comes to processing afterwards!