std::istringstream from std::string without copying - c++

I've been using this:
ifstream in("file.txt")
string line;
getline(in,line);
istringstream iss(line);
...
for some simple parsing.
I would like to avoid unnecessary copying in order to improve performance so I tried:
ifstream in("huge_line.txt");
string line;
getline(in,line);
istringstream ss;
ss.rdbuf()->pubsetbuf(const_cast<char*>(line.c_str()), line.size());
...
and it seems to do the job (significantly improve performance, that is). My question is, is this safe given the const_cast?
I mean, as long as I'm working with an istrinstream, the internal buffer should never get written to by the istringstream class, so the ss variable should remain in a valid state as long as the line variable is valid and unchanged, right?

The const_cast is safe, because the underlying buffer of std::string is not const. And yes, as long as line does not expire while ss is being read from, your program should be fine.

The effect of ss.rdbuf()->pubsetbuf is implementation-defined and hence doesn't necessarily do what you expect.
So, the effect of your altered code doesn't need to be equivalent to the initial one.

Related

Why is getline written so strangely?

I don't understand the design decisions behind the C++ getline function.
Why does it take a stream and a string by reference as arguments, only to return the same stream that was passed in? It seems more intuitive to only take the stream as an argument, then return the string that was read. Returning the same stream lets you chain the call, but would anyone really want to use getline(getline(stream, x), y)?
Additionally, why is the function not in the std namespace like the rest of the standard library?
If the function returned a string, there would be no way of indicating that the read failed, as all string values are valid values that could be returned by this (or any other) function. On the other hand, a stream has lots of error indicator flags that can be tested by the code that calls getline. So people can write code like:
while( std::getline( std::cin, somestring )) {
// do stuff with somestring
}
and it is hard to see how you could write similar code if getline returned a string.
why is the function not in the std namespace like the rest of the standard library?
It is in the std namespace - what makes you think otherwise?
Why does it take a stream and a string by reference as arguments, only to return the same stream that was passed in?
It is a common pattern in the stream library to do that. It means you can test the operation being performed as you perform it. For example:
std::string line;
while(std::getline(std::cin, line))
{
// use line here because we know the read succeeded
}
You can also make succinct parsers by "chaining" stream functions:
std::string key, value;
if(std::getline(std::getline(in, key, '='), value))
my_map[key] = value;
It seems more intuitive to only take the stream as an argument, then return the string that was read.
The problem with returning a new string every call is that you are constantly allocating new memory for them instead of reusing the memory already allocated to the string you passed in or that it gained while iterating through a loop.
// Here line will not need to allocate memory every time
// through the loop. Only when it finds a longer line than
// it has capacity for:
std::string line;
while(std::getline(std::cin, line))
{
// use line here because we know the read succeeded
}

Read whole text file into memory, then process it line-by-line without allocation/copy

Suppose we've read the content of a text file into a stringstream via
std::ifstream file(filepath);
std::stringstream ss;
ss << file.rdbuf();
and now want to process the file line-by-line. This can be done via
for (std::string line; std::getline(ss, line);)
{
}
However, having in mind that ss contains the whole content of the file in an internal buffer (and that we can obtain this content as a string via ss.str()), the code above is highly inefficient. For each line of the file, a memory allocation and a copy operation is performed.
Is it possible to come up with a solution that provides the lines in form of a std::string_view? (Feel free to use an other mechanism to load the whole file; I don't need it to be accessed via a stringstream.)
the code above is highly inefficient. For each line of the file, a memory allocation and a copy operation is performed.
Is it possible to come up with a solution that provides the lines in form of a std::string_view?
No, as far I know.
But you can use the getline() method in stringstream that receive a pointer to a char.
Something as (caution: code not tested)
constexpr std::size_t dim { 100U }; // choose you the dim
char buff[dim];
while ( ss.getline(buff, dim) )
{
// do something
}
The copy remain but, at least, this way you should avoid the -- for each line -- allocation.

How to use stringstream constructor in getline?

Following up https://stackoverflow.com/a/1120224/390066.
Why can't I use
getline(stringstream(line),cell,','){}
instead of
stringstream lineStream(line);
getline(lineStream,cell,','){}
?
update
I should have clarified that I want to use getline within a loop.
Furthermore, I should have also noted that my initial intention was to read a file line-by-line using getline and use the line from that in the new getline that would divide on ',', which is more intuitive imo.
From what I understood so far, getline is not designed for that because it takes a non-const input and gives const token; therefore, getline cannot be blindly recursed.
As show by #James Kanze you can.
The question is do you really want to?
The stream is destroyed at the end of the expression so you are only reading one cell from it.
If we look at this in the context of the original question:
i.e. You can not use that in a loop:
std::string line = /* Init */;
std::stringstream lineStream(line);
std::string cell;
while(std::getline(lineStream, cell, ','))
{
// Do stuff with cell.
}
If you place your code into this context it will not work as expected:
std::string cell;
while(std::getline(std::istringstream(line).flush(), cell, ','))
{
// Do stuff with cell.
}
As the expression inside the while() will be fully evaluated each time. So you go into an infinte loop reading the first cell continuously.
You can, but it's ugly:
std::getline( std::istringstream( line ).flush(), cell, ',' );
The problem is that std::getline takes a non-const reference (which is
logical, since it is going to modify the stream), and you cannot
initialize a non-const reference with a temporary. You can, however,
call member functions on it. std::istream::flush is a member
function, which returns a non-const reference to the stream on which it
was called (and if that stream is an std::istringstream, doesn't do
anything else).
FWIW: you'd probably find:
cell = std::string( line.cbegin(), std::find( line.cbegin(), line.cend(), ',' ) );
a bit more efficient. And, at least in my opinion, easier to read and
maintain.

Fastest way to input huge strings?

I have to read huge lines of strings from stdin so time is a critical issue. Strings are on consecutive lines and have no spaces so I can simply use while(cin>>str) { //code } but this is extremely slow. I have heard that scanf is much more faster than cin but if I use scanf("%s,str) I think that str is treated as char* and not a C++ string so I can't use the STL. I could take input as char* and copy all the chars into a C++ string but IMO that will also be slow.
Is there a way to get input using scanf or something but still get a C++ string as a result?
If you know the average or maximum size of the text, you create std::string with a pre-allocated size. One area occupying a lot of time is the memory (re) allocation by std::string.
cin >> str is the closest thing you'll find in STL to scanf("%s, str"). The only reason scanf would be faster than cin is because it would be giving you a char* instead of a string, and while you can create a new string from the char* by just passing them in to the string() constructor, that would be almost the same thing as using cin >> str.
You can use getline:
for (std::string line; getline(std::cin, line); ) {
do_something_with(line);
}
I don't know if it is any faster than cin >> line, but it might be, since it doesn't need to deal with whitespace other than newlines. But I don't believe this is as significant as the overhead of sentry construction.

performance overhead of c++ string tokenize via istringstream

I would like to know what's the performance overhead of
string line, word;
while (std::getline(cin, line))
{
istringstream istream(line);
while (istream >> word)
// parse word here
}
I think this is the standard c++ way to tokenize input.
To be specific:
Does each line copied three times, first via getline, then via istream constructor, last via operator>> for each word?
Would frequent construction & destruction of istream be an issue? What's the equivalent implementation if I define istream before the outer while loop?
Thanks!
Update:
An equivalent implementation
string line, word;
stringstream stream;
while (std::getline(cin, line))
{
stream.clear();
stream << line;
while (stream >> word)
// parse word here
}
uses a stream as a local stack, that pushes lines, and pops out words.
This would get rid of possible frequent constructor & destructor call in the previous version, and utilize stream internal buffering effect (Is this point correct?).
Alternative solutions, might be extends std::string to support operator<< and operator>>, or extends iostream to support sth. like locate_new_line. Just brainstorming here.
Unfortunately, iostreams is not for performance-intensive work. The problem is not copying things in memory (copying strings is fast), it's virtual function dispatches, potentially to the tune of several indirect function calls per character.
As for your question about copying, yes, as written everything gets copied when you initialize a new stringstream. (Characters also get copied from the stream to the output string by getline or >>, but that obviously can't be prevented.)
Using C++11's move facility, you can eliminate the extraneous copies:
string line, word;
while (std::getline(cin, line)) // initialize line
{ // move data from line into istream (so it's no longer in line):
istringstream istream( std::move( line ) );
while (istream >> word)
// parse word here
}
All that said, performance is only an issue if a measurement tool tells you it is. Iostreams is flexible and robust, and filebuf is basically fast enough, so you can prototype the code so it works and then optimize the bottlenecks without rewriting everything.
When you define a variable inside a block, it will be allocated on the stack. When you are leaving the block it will get popped from the stack. Using this code you have a lot of operation on the stack. This goes for 'word' too. You can use pointers and operate on pointers instead of variables. Pointers are stored on the stack too but where they are pointing to is a place inside the heap memory.
Such operations can have overhead for making the variables, pushing it on the stack and popping it from the stack again. But using pointers you allocate the space once and you work with the allocated space in the heap. As well pointers can be much smaller than real objects so their allocation will be faster.
As you see getLine() method accepts a reference(some kind of pointers) to line object which make it work with it without creating a string object again.
In your code , line and word variables are made once and their references are used. The only object you are making in each iteration is ss variable. If you want to not to make it in each iteration you can make it before loop and initialize it using its relates methods. You can search to find a suitable method to reassign it not using the constructor.
You can use this :
string line, word ;
istringstream ss ;
while (std::getline(cin, line))
{
ss.clear() ;
ss.str(line) ;
while (ss >> word) {
// parse word here
}
}
Also you can use this reference istringstream
EDIT : Thanks for comment #jrok. Yes, you should clear error flags before assigning new string. This is the reference for str() istringstream::str