performance overhead of c++ string tokenize via istringstream

performance overhead of c++ string tokenize via istringstream - c++

I would like to know what's the performance overhead of
string line, word;
while (std::getline(cin, line))
{
istringstream istream(line);
while (istream >> word)
// parse word here
}
I think this is the standard c++ way to tokenize input.
To be specific:
Does each line copied three times, first via getline, then via istream constructor, last via operator>> for each word?
Would frequent construction & destruction of istream be an issue? What's the equivalent implementation if I define istream before the outer while loop?
Thanks!
Update:
An equivalent implementation
string line, word;
stringstream stream;
while (std::getline(cin, line))
{
stream.clear();
stream << line;
while (stream >> word)
// parse word here
}
uses a stream as a local stack, that pushes lines, and pops out words.
This would get rid of possible frequent constructor & destructor call in the previous version, and utilize stream internal buffering effect (Is this point correct?).
Alternative solutions, might be extends std::string to support operator<< and operator>>, or extends iostream to support sth. like locate_new_line. Just brainstorming here.

Unfortunately, iostreams is not for performance-intensive work. The problem is not copying things in memory (copying strings is fast), it's virtual function dispatches, potentially to the tune of several indirect function calls per character.
As for your question about copying, yes, as written everything gets copied when you initialize a new stringstream. (Characters also get copied from the stream to the output string by getline or >>, but that obviously can't be prevented.)
Using C++11's move facility, you can eliminate the extraneous copies:
string line, word;
while (std::getline(cin, line)) // initialize line
{ // move data from line into istream (so it's no longer in line):
istringstream istream( std::move( line ) );
while (istream >> word)
// parse word here
}
All that said, performance is only an issue if a measurement tool tells you it is. Iostreams is flexible and robust, and filebuf is basically fast enough, so you can prototype the code so it works and then optimize the bottlenecks without rewriting everything.

When you define a variable inside a block, it will be allocated on the stack. When you are leaving the block it will get popped from the stack. Using this code you have a lot of operation on the stack. This goes for 'word' too. You can use pointers and operate on pointers instead of variables. Pointers are stored on the stack too but where they are pointing to is a place inside the heap memory.
Such operations can have overhead for making the variables, pushing it on the stack and popping it from the stack again. But using pointers you allocate the space once and you work with the allocated space in the heap. As well pointers can be much smaller than real objects so their allocation will be faster.
As you see getLine() method accepts a reference(some kind of pointers) to line object which make it work with it without creating a string object again.
In your code , line and word variables are made once and their references are used. The only object you are making in each iteration is ss variable. If you want to not to make it in each iteration you can make it before loop and initialize it using its relates methods. You can search to find a suitable method to reassign it not using the constructor.
You can use this :
string line, word ;
istringstream ss ;
while (std::getline(cin, line))
{
ss.clear() ;
ss.str(line) ;
while (ss >> word) {
// parse word here
}
}
Also you can use this reference istringstream
EDIT : Thanks for comment #jrok. Yes, you should clear error flags before assigning new string. This is the reference for str() istringstream::str

Related

Why is getline written so strangely?

I don't understand the design decisions behind the C++ getline function.
Why does it take a stream and a string by reference as arguments, only to return the same stream that was passed in? It seems more intuitive to only take the stream as an argument, then return the string that was read. Returning the same stream lets you chain the call, but would anyone really want to use getline(getline(stream, x), y)?
Additionally, why is the function not in the std namespace like the rest of the standard library?

If the function returned a string, there would be no way of indicating that the read failed, as all string values are valid values that could be returned by this (or any other) function. On the other hand, a stream has lots of error indicator flags that can be tested by the code that calls getline. So people can write code like:
while( std::getline( std::cin, somestring )) {
// do stuff with somestring
}
and it is hard to see how you could write similar code if getline returned a string.
why is the function not in the std namespace like the rest of the standard library?
It is in the std namespace - what makes you think otherwise?

Why does it take a stream and a string by reference as arguments, only to return the same stream that was passed in?
It is a common pattern in the stream library to do that. It means you can test the operation being performed as you perform it. For example:
std::string line;
while(std::getline(std::cin, line))
{
// use line here because we know the read succeeded
}
You can also make succinct parsers by "chaining" stream functions:
std::string key, value;
if(std::getline(std::getline(in, key, '='), value))
my_map[key] = value;
It seems more intuitive to only take the stream as an argument, then return the string that was read.
The problem with returning a new string every call is that you are constantly allocating new memory for them instead of reusing the memory already allocated to the string you passed in or that it gained while iterating through a loop.
// Here line will not need to allocate memory every time
// through the loop. Only when it finds a longer line than
// it has capacity for:
std::string line;
while(std::getline(std::cin, line))
{
// use line here because we know the read succeeded
}

Read whole text file into memory, then process it line-by-line without allocation/copy

Suppose we've read the content of a text file into a stringstream via
std::ifstream file(filepath);
std::stringstream ss;
ss << file.rdbuf();
and now want to process the file line-by-line. This can be done via
for (std::string line; std::getline(ss, line);)
{
}
However, having in mind that ss contains the whole content of the file in an internal buffer (and that we can obtain this content as a string via ss.str()), the code above is highly inefficient. For each line of the file, a memory allocation and a copy operation is performed.
Is it possible to come up with a solution that provides the lines in form of a std::string_view? (Feel free to use an other mechanism to load the whole file; I don't need it to be accessed via a stringstream.)

the code above is highly inefficient. For each line of the file, a memory allocation and a copy operation is performed.
Is it possible to come up with a solution that provides the lines in form of a std::string_view?
No, as far I know.
But you can use the getline() method in stringstream that receive a pointer to a char.
Something as (caution: code not tested)
constexpr std::size_t dim { 100U }; // choose you the dim
char buff[dim];
while ( ss.getline(buff, dim) )
{
// do something
}
The copy remain but, at least, this way you should avoid the -- for each line -- allocation.

std::istringstream from std::string without copying

I've been using this:
ifstream in("file.txt")
string line;
getline(in,line);
istringstream iss(line);
...
for some simple parsing.
I would like to avoid unnecessary copying in order to improve performance so I tried:
ifstream in("huge_line.txt");
string line;
getline(in,line);
istringstream ss;
ss.rdbuf()->pubsetbuf(const_cast<char*>(line.c_str()), line.size());
...
and it seems to do the job (significantly improve performance, that is). My question is, is this safe given the const_cast?
I mean, as long as I'm working with an istrinstream, the internal buffer should never get written to by the istringstream class, so the ss variable should remain in a valid state as long as the line variable is valid and unchanged, right?

The const_cast is safe, because the underlying buffer of std::string is not const. And yes, as long as line does not expire while ss is being read from, your program should be fine.

The effect of ss.rdbuf()->pubsetbuf is implementation-defined and hence doesn't necessarily do what you expect.
So, the effect of your altered code doesn't need to be equivalent to the initial one.

C++ Reading an multiline file with lines with arbitary lengths and format without using a stringstream

I have an input stream with the following lines:
# <int> <int>
<some_data_type> <some_data_type> <some_data_type> ..... <some_data_type>
<some_data_type_1> <some_data_type_2> <some_data_type_3> <some_data_type_1> <some_data_type_2> <some_data_type_3> .... <some_data_type_1> <some_data_type_2> <some_data_type_3>
In the above stream all three lines are different and have to be parsed differently. Currently,I am using a reading method as follows:
void reader( std::istream & is, DataStructure & d ){
std::string line;
getline(is,line);
std::stringstream s(line);
//parse line 1
getline(is,line);
std::stringstream line2(line);
//parse line 2
getline(is,line);
std::stringstream line3(line);
//parse line 3
}
Now the idea is not to make use of std::stringstream at all, as a line can arbitarily large and we donot want to load everything into memory twice. So, it would be better if it was possible to read from the input stream directly into the user given datastructure d.
An idea is to make use of std::istream_iterator but unfortunately the different lines have different parsing needs. For example, in the last line, three elements from the stream together constitute a single data element.
The only idea that seems plausible to me at this moment is to handle the stream buffer directly. It would be great if anyone could recommend a better way of doing this.
NOTE: Cannot make use of a tertiary data structure like std::stringstream. It is essential to read from the stream directly into the user provided data structure.
EDIT: Please note we are only allowed a single pass over the file.

Now the idea is not to make use of std::stringstream at all, as a line
can arbitarily large and we donot want to load everything into memory
twice. So, it would be better if it was possible to read from the
input stream directly into the user given datastructure d.
Olaf explained the extraction operator above but then we have a new requirement:
This will only work for the first line, where it is known there is a
fixed number of elements.
and
(2) Unfortunately, I have no discriminator beyond my knowledge that each instance of the data
structure needs to be instantiated with information stored in three
different lines. All three lines have different lengths and different
data elements. Also, I cannot change the format.
plus
(3) All information is treated as unsigned integer.
Now the next issue is that we don't know what the data structure actually is, so given what has come before it appears to be dynamic in some fashion. Because we can treat the data as unsigned int then we can use the extraction operator possibly, but read into a dynamic member:
vector<unsigned int> myUInts;
...
inFile >> currentUInt;
myUInts.push_back(currentUInt);
But then the issue of where to stop comes into play. Is it at the end of the first line, the third? If you need to read an arbitrary number of unsigned ints, whilst still checking for a new line then you will need to process white space as well:
inFile.unsetf(ios_base::skipws);
How you actually handle that is beyond what I can say at the moment without some clearer requirements. But I would guess it will be in the form:
inFile >> myMember;
char next = infile.peek()
//skip whitespace and check for new line
//Repeat until data structure filled, and repeat for each data structure.

Then do not use std::getline() at all. Define an istream operator for your types and use these directly
std::istream &operator >>(std::istream &f, DataStructure &d)
{
f >> d.member1 >> d.member2 >> ...;
return f;
}
void reader(std::istream & is, DataStructure &d)
{
is >> d;
}
There's no need fiddling with an std::istream_iterator or directly manipulating the stream buffer.

STL containers & memory leaks

C# coder just wrote this simple C++ method to get text from a file:
static std::vector<std::string> readTextFile(const std::string &filePath) {
std::string line;
std::vector<std::string> lines;
std::ifstream theFile(filePath.c_str());
while (theFile.good()) {
getline (theFile, line);
lines.push_back(line);
}
theFile.close();
return lines;
}
I know this code is not efficient; the text lines are copied once as they are read and a second time when returned-by-value.
Two questions:
(1) can this code leak memory ?
(2) more generally can returning containers of objects by value ever leak memory ? (assuming the objects themselves don't leak)

while (theFile.good()) {
getline (theFile, line);
lines.push_back(line);
}
Forget about efficiency, this code is not correct. It will not read the file correctly. See the following topic to know why:
What's preferred pattern for reading lines from a file in C++?
So the loop should be written as:
while (getline (theFile, line)) {
lines.push_back(line);
}
Now this is correct. If you want to make it efficient, profile your application first. Try see the part which takes most CPU cycles.
(1) can this code leak memory ?
No.
(2) more generally can returning containers of objects by value ever leak memory ?
Depends on the type of the object in the containers. In your case, the type of the object in std::vector is std::string which makes sure that no memory will be leaked.

No and no. Returning by value never leaks memory (assuming the containers and the contained objects are well written). It would be fairly useless if it was any other way.
And I second what Nawaz says, your while loop is wrong. It's frankly incredible how many times we see that, there must be an awful lot of bad advice out there.

(1) can this code leak memory ?
No
(2) more generally can returning containers of objects by value ever leak memory ?
No. You might leak memory that is stored in a container by pointer or through objects that leak. But that would not be caused by returning by value.
I know this code is not efficient; the text lines are copied once as they are read and a second time when returned-by-value.
Most probably not. There are two copies of the string, but not the ones that you are thinking about. The return copy will most likely be optimized in C++03, and will either be optimized away or transformed into a move (cheap) in C++11.
The two copes are rather:
getline (theFile, line);
lines.push_back(line);
The first line copies from the file to line, and the second copies from line to the container. If you are using a C++11 compiler you can change the second line to:
lines.push_back(std::move(line));
to move the contents of the string to the container. Alternatively (and also valid in C++03), you can change the two lines with:
lines.push_back(std::string()); // In most implementations this is *cheap*
// (i.e. no memory allocation)
getline(theFile, lines.back());
And you should test the result of the read (if the read fails, in the last alternative, make sure to resize to one less element to remove the last empty string.

In C++11, you can do:
std::vector<std::string>
read_text_file(const std::string& path)
{
std::string line;
std::vector<std::string> ans;
std::ifstream file(path.c_str());
while (std::getline(file, line))
ans.push_back(std::move(line));
return ans;
}
and no extra copies are made.
In C++03, you accept the extra copies, and painfully remove them only if profiling dictates so.
Note: you don't need to close the file manually, the destructor of std::ifstream does it for you.
Note2: You can template on the char type, this may be useful in some circumstances:
template <typename C, typename T>
std::vector<std::basic_string<C, T>>
read_text_file(const char* path)
{
std::basic_string<C, T> line;
std::vector<std::basic_string<C, T>> ans;
std::basic_ifstream<C, T> file(path);
// Rest as above
}

No, returning container by value should not leak memory. The standard library is designed not to leak memory itself in any case. It can only leak a memory if there is a bug in its implementation. At least there used to be a bug in vector of strings in an old MSVC.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js