Fastest way to input huge strings? - c++

I have to read huge lines of strings from stdin so time is a critical issue. Strings are on consecutive lines and have no spaces so I can simply use while(cin>>str) { //code } but this is extremely slow. I have heard that scanf is much more faster than cin but if I use scanf("%s,str) I think that str is treated as char* and not a C++ string so I can't use the STL. I could take input as char* and copy all the chars into a C++ string but IMO that will also be slow.
Is there a way to get input using scanf or something but still get a C++ string as a result?

If you know the average or maximum size of the text, you create std::string with a pre-allocated size. One area occupying a lot of time is the memory (re) allocation by std::string.

cin >> str is the closest thing you'll find in STL to scanf("%s, str"). The only reason scanf would be faster than cin is because it would be giving you a char* instead of a string, and while you can create a new string from the char* by just passing them in to the string() constructor, that would be almost the same thing as using cin >> str.

You can use getline:
for (std::string line; getline(std::cin, line); ) {
do_something_with(line);
}
I don't know if it is any faster than cin >> line, but it might be, since it doesn't need to deal with whitespace other than newlines. But I don't believe this is as significant as the overhead of sentry construction.

Related

How do the different ways to read strings from console actually differ? Operator <<, getline or cin.getline?

Let's suppose I'd like to read an integer from the console, and I would not like the program to break if it is fed non-integer characters. This is how I would do this:
#include <iostream>
#include <sstream>
#include <string>
using namespace std;
int main() {
string input; int n;
cin >> input;
if(!(stringstream(input)>>n)) cout << "Bad input!\n";
else cout << n;
return 0;
}
However, I see that http://www.cplusplus.com/doc/tutorial/basic_io/ uses getline(cin,input) rather than cin >> input. Are there any relevant differences between the two methods?
Also I wonder, since string is supposed not to have any length limits... What would happen if someone passed a 10GB long string to this program? Wouldn't it be safer to store the input in a limited-length char table and use, for example, cin.getline(input,256)?
std::getline gets a line (including spaces) and also reads (but discards) the ending newline. The input operator >> reads a whitespace-delimited "word".
For example, if your input is
123 456 789
Using std::getline will give you the string "123 456 789", but using the input operator >> you will get only "123".
And theoretically there's no limit to std::string, but in reality it's of course limited to the amount of memory it can allocate.
the first gets a line,
the second gets a world.if your input "hello world"
getline(cin,str) : str=="hello world"
cin>>str: str="hello"
and dont worry about out of range, like vector ,string can grow
operator>> reads a word (i.e. up to next whitespace-character)
std::getline() reads a "line" (by default up to next newline, but you can configure the delimiter) and stores the result in a std::string
istream::getline() reads a "line" in a fashion similar to std::getline(), but takes a char-array as its target. (This is the one you'll find as cin.getline())
If you get a 10 GB line passed to your program, then you'll either run out of memory on a 32-bit system, or take a while to parse it, potentially swapping a fair bit of memory to disk on a 64-bit system.
Whether arbitrary line-length size limitations make sense, really boils down to what kind of data your program expects to handle, as well as what kind of error you want to produce when you get too much data. (Presumably it is not acceptable behaviour to simply ignore the rest of the data, so you'll want to either read and perform work on it in parts, or error out)

Dynamically allocated strings in C

I was doing a relatively simple string problem in UVa's online judge to practice with strings since I've been having a hard time with them in C. The problem basically asks to check if a string B contains another string A if you remove the 'clutter' and concatenate the remaining characters, for example if "ABC" is contained in "AjdhfmajBsjhfhC" which in this case is true.
So, my question is how can I efficiently allocate memory for a string which I don't know its length? What I did was to make a string really big char Mstring[100000], read from input and then use strlen(Mstring) to copy the string the a properly sized char array. Something like :
char Mstring[100000];
scanf("%s",Mstring);
int length = strlen(Mstring);
char input[length+1]={0};
for(int i = 0; i<length;i++){
input[i]=Mstring[i];
}
Is there a better/standard way to do this in C? I know that C does not has a great support for strings, if there is not a better way to do it in C maybe in C++?
If you have the option of using C++ (as you mentioned), that is going to make your life a lot easier. You can then use a STL string (std::string) which manages dynamically sized strings for you. You can also drop the old scanf() beast and use std::cin.
Example:
#include <iostream>
#include <string>
void main()
{
std::string sInput;
std::getline(std::cin, sInput);
// alternatively, you could execute this line instead:
// std::cin >> sInput;
// but that will tokenize input based on whitespace, so you
// will only get one word at a time rather than an entire line
}
Describing how to manage strings that can grow dynamically in C will take considerably more explanation and care, and it sounds like you really don't need that. If so, however, here is a starting point: http://www.strchr.com/dynamic_arrays.

performance overhead of c++ string tokenize via istringstream

I would like to know what's the performance overhead of
string line, word;
while (std::getline(cin, line))
{
istringstream istream(line);
while (istream >> word)
// parse word here
}
I think this is the standard c++ way to tokenize input.
To be specific:
Does each line copied three times, first via getline, then via istream constructor, last via operator>> for each word?
Would frequent construction & destruction of istream be an issue? What's the equivalent implementation if I define istream before the outer while loop?
Thanks!
Update:
An equivalent implementation
string line, word;
stringstream stream;
while (std::getline(cin, line))
{
stream.clear();
stream << line;
while (stream >> word)
// parse word here
}
uses a stream as a local stack, that pushes lines, and pops out words.
This would get rid of possible frequent constructor & destructor call in the previous version, and utilize stream internal buffering effect (Is this point correct?).
Alternative solutions, might be extends std::string to support operator<< and operator>>, or extends iostream to support sth. like locate_new_line. Just brainstorming here.
Unfortunately, iostreams is not for performance-intensive work. The problem is not copying things in memory (copying strings is fast), it's virtual function dispatches, potentially to the tune of several indirect function calls per character.
As for your question about copying, yes, as written everything gets copied when you initialize a new stringstream. (Characters also get copied from the stream to the output string by getline or >>, but that obviously can't be prevented.)
Using C++11's move facility, you can eliminate the extraneous copies:
string line, word;
while (std::getline(cin, line)) // initialize line
{ // move data from line into istream (so it's no longer in line):
istringstream istream( std::move( line ) );
while (istream >> word)
// parse word here
}
All that said, performance is only an issue if a measurement tool tells you it is. Iostreams is flexible and robust, and filebuf is basically fast enough, so you can prototype the code so it works and then optimize the bottlenecks without rewriting everything.
When you define a variable inside a block, it will be allocated on the stack. When you are leaving the block it will get popped from the stack. Using this code you have a lot of operation on the stack. This goes for 'word' too. You can use pointers and operate on pointers instead of variables. Pointers are stored on the stack too but where they are pointing to is a place inside the heap memory.
Such operations can have overhead for making the variables, pushing it on the stack and popping it from the stack again. But using pointers you allocate the space once and you work with the allocated space in the heap. As well pointers can be much smaller than real objects so their allocation will be faster.
As you see getLine() method accepts a reference(some kind of pointers) to line object which make it work with it without creating a string object again.
In your code , line and word variables are made once and their references are used. The only object you are making in each iteration is ss variable. If you want to not to make it in each iteration you can make it before loop and initialize it using its relates methods. You can search to find a suitable method to reassign it not using the constructor.
You can use this :
string line, word ;
istringstream ss ;
while (std::getline(cin, line))
{
ss.clear() ;
ss.str(line) ;
while (ss >> word) {
// parse word here
}
}
Also you can use this reference istringstream
EDIT : Thanks for comment #jrok. Yes, you should clear error flags before assigning new string. This is the reference for str() istringstream::str

cin doesn't take input as ENTER

char ch[4];
char* ptr;
ptr = ch;
while(1)
{
cin >> *ptr;
if(*ptr == '\n')
break;
ptr++;
}
Here I just wrote a bit of sample code where I am trying to get out of a while loop when user writes ENTER but it's not working. Please help me. Thank you in advance.
To get a single character, use std::istream::get. This should work for getting newlines as well.
But instead of getting characters in a loop until you get a newline, why not just use something like std::getline:
std::string str;
std::getline(cin, str);
Or if you only want to get max three characters you can use std::istream::getline:
char ch[4];
cin.getline(ch, 4, '\n');
You are reading input into the value of a character. That's what *ptr means. I think you want just plain ptr, which is a pointer to an array of characters, which is something that is meant to receive data. What you wrote is basically this:
char c;
cin >> c;
I don't think that's what you meant, nor would it work even if it were, since as Joachim Pileborg points out above, the >> operator skips whitespace like newlines. In general, it is always best to be very robust when it comes to reading input. Provide adequate space, and either use a variable that can grow automatically (like std::string) or tell the system how much space you have (like fgets()).
The following will read a line:
istream& getline (char* s, streamsize n );
The extraction operator would skip leading white-spaces and stop execution on encountering any subsequent white-space. So, when you want to do something like this, use std::istream::get() or std::istream::getline().

Reading a fixed number of chars with << on an istream

I was trying out a few file reading strategies in C++ and I came across this.
ifstream ifsw1("c:\\trys\\str3.txt");
char ifsw1w[3];
do {
ifsw1 >> ifsw1w;
if (ifsw1.eof())
break;
cout << ifsw1w << flush << endl;
} while (1);
ifsw1.close();
The content of the file were
firstfirst firstsecond
secondfirst secondsecond
When I see the output it is printed as
firstfirst
firstsecond
secondfirst
I expected the output to be something like:
fir
stf
irs
tfi
.....
Moreover I see that "secondsecond" has not been printed. I guess that the last read has met the eof and the cout might not have been executed. But the first behavior is not understandable.
The extraction operator has no concept of the size of the ifsw1w variable, and (by default) is going to extract characters until it hits whitespace, null, or eof. These are likely being stored in the memory locations after your ifsw1w variable, which would cause bad bugs if you had additional variables defined.
To get the desired behavior, you should be able to use
ifsw1.width(3);
to limit the number of characters to extract.
It's virtually impossible to use std::istream& operator>>(std::istream&, char *) safely -- it's like gets in this regard -- there's no way for you to specify the buffer size. The stream just writes to your buffer, going off the end. (Your example above invokes undefined behavior). Either use the overloads accepting a std::string, or use std::getline(std::istream&, std::string).
Checking eof() is incorrect. You want fail() instead. You really don't care if the stream is at the end of the file, you care only if you have failed to extract information.
For something like this you're probably better off just reading the whole file into a string and using string operations from that point. You can do that using a stringstream:
#include <string> //For string
#include <sstream> //For stringstream
#include <iostream> //As before
std::ifstream myFile(...);
std::stringstream ss;
ss << myFile.rdbuf(); //Read the file into the stringstream.
std::string fileContents = ss.str(); //Now you have a string, no loops!
You're trashing the memory... its reading past the 3 chars you defined (its reading until a space or a new line is met...).
Read char by char to achieve the output you had mentioned.
Edit : Irritate is right, this works too (with some fixes and not getting the exact result, but that's the spirit):
char ifsw1w[4];
do{
ifsw1.width(4);
ifsw1 >> ifsw1w;
if(ifsw1.eof()) break;
cout << ifsw1w << flush << endl;
}while(1);
ifsw1.close();
The code has undefined behavior. When you do something like this:
char ifsw1w[3];
ifsw1 >> ifsw1w;
The operator>> receives a pointer to the buffer, but has no idea of the buffer's actual size. As such, it has no way to know that it should stop reading after two characters (and note that it should be 2, not 3 -- it needs space for a '\0' to terminate the string).
Bottom line: in your exploration of ways to read data, this code is probably best ignored. About all you can learn from code like this is a few things you should avoid. It's generally easier, however, to just follow a few rules of thumb than try to study all the problems that can arise.
Use std::string to read strings.
Only use fixed-size buffers for fixed-size data.
When you do use fixed buffers, pass their size to limit how much is read.
When you want to read all the data in a file, std::copy can avoid a lot of errors:
std::vector<std::string> strings;
std::copy(std::istream_iterator<std::string>(myFile),
std::istream_iterator<std::string>(),
std::back_inserter(strings));
To read the whitespace, you could used "noskipws", it will not skip whitespace.
ifsw1 >> noskipws >> ifsw1w;
But if you want to get only 3 characters, I suggest you to use the get method:
ifsw1.get(ifsw1w,3);