Taking into account \r\n - c++

I am trying to solve a problem on spoj. Apparently the input lines end with \r\n as per the comments. What I know about \r\n from previous questions is that its a windows thing. What I want to know is how to take it into account. Currently i am using getline(cin,str) in c++. What do I do to take into account the \r\n.

When you use std::getline(std::cin, str) the '\n' is already taken care of: std::getline() will read characters until it finds a '\n' and inserts these into str. It doesn't insert the '\n', however.
Thus, you may be stuck with a a '\r' at the end of the string. If you are on Windows you can just open your file in text mode and the stream will extract them, too. If that's not the way to go, you can just determine if your str ends with a '\r' and remove it:
if (!str.empty() && str[str.size() - 1] == '\r') {
str.erase(str.end() - 1);
}
If you want to remove all carriage returns (there may, in theory, some embedded in the string), you can use
str.erase(std::remove(str.begin(), str.end(), '\r'), str.end());
Finally, if you don't want to ever encounter the carriage returns, you can create a filtering stream buffer which just removes all '\r' (or just those from a "\r\n" sequence). Below is a quick example how a simple filtering stream buffer can be implemented:
#include <algorithm>
#include <iostream>
#include <streambuf>
#include <string>
class crfilter
: std::streambuf
{
std::istream* stream;
std::streambuf* sbuf;
char buffer[8];
int underflow() {
std::streamsize n;
while (this->gptr() == this->egptr()
&& (n = this->sbuf->sgetn(buffer, 8))) {
char* end = std::remove(buffer, buffer + n, '\r');
this->setg(buffer, buffer, end);
}
return this->gptr() == this->egptr()
? std::char_traits<char>::eof()
: std::char_traits<char>::to_int_type(*this->gptr());
}
public:
crfilter(std::istream& in): stream(&in), sbuf(in.rdbuf(this)) {}
~crfilter() { stream->rdbuf(this->sbuf); }
};
int main()
{
crfilter filter(std::cin);
std::string str;
while (std::getline(std::cin, str)) {
std::cout << "str='" << str << "'\n";
}
}

They are carriage return/line feeds telling you the end of the line and beginning of the next.

Related

How to determine how many characters `std::getline()` extracted?

Let's say I read a std::string from std::istream by using std::getline() overload. How to determine how many characters extracted from the stream? std::istream::gcount() does not work as discussed here: ifstream gcount returns 0 on getline string overload
#include <iostream>
#include <sstream>
#include <string>
int main()
{
std::istringstream s( "hello world\n" );
std::string str;
std::getline( s, str );
std::cout << "extracted " << s.gcount() << " characters" << std::endl;
}
Live example
Note, for downvoters - length of the string is not the answer, as std::getline() may or may not extract additional character from the stream.
It would seem the way to do this is not completely straightforward because std::getline may (or may not) read a terminating delimiter and in either case it will not put it in the string. So the length of the string is not enough to tell you exactly how many characters were read.
You can test eof() to see if the delimiter was read or not:
std::getline(is, line);
auto n = line.size() + !is.eof();
It would be nice to wrap it up in a function but how to pass back the extra information?
One way I suppose is to add the delimiter back if it was read and let the caller deal with it:
std::istream& getline(std::istream& is, std::string& line, char delim = '\n')
{
if(std::getline(is, line, delim) && !is.eof())
line.push_back(delim); // add the delimiter if it was in the stream
return is;
}
But I am not sure I would always want that.

C++: Parsing a log with split but one entry can have several lines

I am started to learn C++, and my current project should extend my knowledge in using files, split and finally do a regexp on a varchar string.
The problem:
I have a logfile wich contains data like
<date> <time> <username> (<ip:port>) <uuid> - #<id> "<varchar text>"
e.g:
10.03.2016 07:40:38: blacksheep (127.0.0.1:54444) #865 "(this can have text
over several lines
without ending marker"
10.03.2016 07:40:38: blacksheep (127.0.0.1:54444) #865 "A new line, just one without \n"
So I am starting with the following but I am stuck now with how to get the lines with \n into the string. How can this be solved the right way without unnecessary steps like splitting several times and how can I define where a complete line (even if it's having some \n within) stops?
With fin.ignore(80, '\n');, \ns are being ignored, but this implicates that I will only have one line... Short text before # and a very large string after :-|
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
std::vector<std::string> split(std::string str, char seperator) {
std::vector<std::string> result;
std::string::size_type token_offset = 0;
std::string::size_type seperator_offset = 0;
while (seperator_offset != std::string::npos) {
seperator_offset = str.find(seperator, seperator_offset);
std::string::size_type token_length;
if(seperator_offset == std::string::npos) {
token_length = seperator_offset;
} else {
token_length = seperator_offset - token_offset;
seperator_offset++;
}
std::string token = str.substr(token_offset, token_length);
if (!token.empty()) {
result.push_back(token);
}
token_offset = seperator_offset;
}
return result;
}
int main(int argc, char **argv) {
std::fstream fin("input.dat");
while(!fin.eof()) {
std::string line;
getline(fin, line, ';');
fin.ignore(80, '\n');
std::vector<std::string> strs = split(line, ',');
for(int i = 0; i < strs.size(); ++i) {
std::cout << strs[i] << std::endl;
}
}
fin.close();
return 0;
}
Regards
Blacksheep
There is no canned C++ library function for swallowing input like that. std::getline reads the next line of text, up until the next newline character (by default). That's it. std::getline does not do any further examination on the input, beyond that.
I will suggest the following approach for you.
Initialize a buffer representing the entire logical line just read.
Read the next line of input, using std::getline(), and append the line to the input buffer.
Count the number of quote characters in the buffer.
Is the number of quotes even? Stop. If the quote character count is odd, append a newline to the buffer, then go back and read another line of input.
Some obvious optimizations are possible here, of course, but this should be a good start.

How to read the whole lines from a file (with spaces)?

I am using STL. I need to read lines from a text file. How to read lines till the first \n but not till the first ' ' (space)?
For example, my text file contains:
Hello world
Hey there
If I write like this:
ifstream file("FileWithGreetings.txt");
string str("");
file >> str;
then str will contain only "Hello" but I need "Hello world" (till the first \n).
I thought I could use the method getline() but it demands to specify the number of symbols to be read. In my case, I do not know how many symbols I should read.
You can use getline:
#include <string>
#include <iostream>
int main() {
std::string line;
if (getline(std::cin,line)) {
// line is the whole line
}
}
using getline function is one option.
or
getc to read each char with a do-while loop
if the file consists of numbers, this would be a better way to read.
do {
int item=0, pos=0;
c = getc(in);
while((c >= '0') && (c <= '9')) {
item *=10;
item += int(c)-int('0');
c = getc(in);
pos++;
}
if(pos) list.push_back(item);
}while(c != '\n' && !feof(in));
try by modifying this method if your file consists of strings..
Thanks to all of the people who answered me. I made new code for my program, which works:
#include <iostream>
#include <fstream>
#include <string>
using namespace std;
int main(int argc, char** argv)
{
ifstream ifile(argv[1]);
// ...
while (!ifile.eof())
{
string line("");
if (getline(ifile, line))
{
// the line is a whole line
}
// ...
}
ifile.close();
return 0;
}
I suggest:
#include<fstream>
ifstream reader([filename], [ifstream::in or std::ios_base::in);
if(ifstream){ // confirm stream is in a good state
while(!reader.eof()){
reader.read(std::string, size_t how_long?);
// Then process the std::string as described below
}
}
For the std::string, any variable name will do, and for how long, whatever you feel appropriate or use std::getline as above.
To process the line, just use an iterator on the std::string:
std::string::iterator begin() & std::string::iterator end()
and process the iterator pointer character by character until you have the \n and ' ' you are looking for.

Something like istream::getline() but with alternative delim characters?

What's the cleanest way of getting the effect of istream::getline(string, 256, '\n' OR ';')?
I know it's quite straightforward to write a loop, but I feel that I might be missing something. Am I?
What I used:
while ((is.peek() != '\n') && (is.peek() != ';'))
stringstream.put(is.get());
Unfortunately there is no way to have multiple "line endings". What you can do is read the line with e.g. std::getline and put it in an std::istringstream and use std::getline (with the ';' separator) in a loop on the istringstream.
Although you could check the Boost iostreams library to see it it has functionality for it.
There's std::getline.
For more complex scenarios one might try splitting istream_iterator or istreambuf_iterator with boost split or regex_iterator (here is an example of using stream iterators).
Here is a working implementation:
enum class cascade { yes, no };
std::istream& getline(std::istream& stream, std::string& line, const std::string& delim, cascade c = cascade::yes){
line.clear();
std::string::value_type ch;
bool stream_altered = false;
while(stream.get(ch) && (stream_altered = true)){
if(delim.find(ch) == std::string::npos)
line += ch;
else if(c == cascade::yes && line.empty())
continue;
else break;
}
if(stream.eof() && stream_altered) stream.clear(std::ios_base::eofbit);
return stream;
}
The cascade::yes option collapses consecutive delimiters found. With cascade::no, it will return an empty string for each a second consecutive delimeter found.
Usage:
const std::string punctuation = ",.';:?";
std::string words;
while(getline(istream_object, words, punctuation))
std::cout << word << std::endl;
See its usage Live on Coliru
A more generic version will be this

std::string manipulation: whitespace, "newline escapes '\'" and comments #

Kind of looking for affirmation here. I have some hand-written code, which I'm not shy to say I'm proud of, which reads a file, removes leading whitespace, processes newline escapes '\' and removes comments starting with #. It also removes all empty lines (also whitespace-only ones). Any thoughts/recommendations? I could probably replace some std::cout's with std::runtime_errors... but that's not a priority here :)
const int RecipeReader::readRecipe()
{
ifstream is_recipe(s_buffer.c_str());
if (!is_recipe)
cout << "unable to open file" << endl;
while (getline(is_recipe, s_buffer))
{
// whitespace+comment
removeLeadingWhitespace(s_buffer);
processComment(s_buffer);
// newline escapes + append all subsequent lines with '\'
processNewlineEscapes(s_buffer, is_recipe);
// store the real text line
if (!s_buffer.empty())
v_s_recipe.push_back(s_buffer);
s_buffer.clear();
}
is_recipe.close();
return 0;
}
void RecipeReader::processNewlineEscapes(string &s_string, ifstream &is_stream)
{
string s_temp;
size_t sz_index = s_string.find_first_of("\\");
while (sz_index <= s_string.length())
{
if (getline(is_stream,s_temp))
{
removeLeadingWhitespace(s_temp);
processComment(s_temp);
s_string = s_string.substr(0,sz_index-1) + " " + s_temp;
}
else
cout << "Error: newline escape '\' found at EOF" << endl;
sz_index = s_string.find_first_of("\\");
}
}
void RecipeReader::processComment(string &s_string)
{
size_t sz_index = s_string.find_first_of("#");
s_string = s_string.substr(0,sz_index);
}
void RecipeReader::removeLeadingWhitespace(string &s_string)
{
const size_t sz_length = s_string.size();
size_t sz_index = s_string.find_first_not_of(" \t");
if (sz_index <= sz_length)
s_string = s_string.substr(sz_index);
else if ((sz_index > sz_length) && (sz_length != 0)) // "empty" lines with only whitespace
s_string.clear();
}
Some extra info: the first s_buffer passed to the ifstream contains the filename, std::string s_buffer is a class data member, so is std::vector v_s_recipe. Any comment is welcome :)
UPDATE: for the sake of not being ungrateful, here is my replacement, all-in-one function that does what I want for now (future holds: parenthesis, maybe quotes...):
void readRecipe(const std::string &filename)
{
string buffer;
string line;
size_t index;
ifstream file(filename.c_str());
if (!file)
throw runtime_error("Unable to open file.");
while (getline(file, line))
{
// whitespace removal
line.erase(0, line.find_first_not_of(" \t\r\n\v\f"));
// comment removal TODO: store these for later output
index = line.find_first_of("#");
if (index != string::npos)
line.erase(index, string::npos);
// ignore empty buffer
if (line.empty())
continue;
// process newline escapes
index = line.find_first_of("\\");
if (index != string::npos)
{
line.erase(index,string::npos); // ignore everything after '\'
buffer += line;
continue; // read next line
}
else // no newline escapes found
{
buffer += line;
recipe.push_back(buffer);
buffer.clear();
}
}
}
Definitely ditch the hungarian notation.
It's not bad, but I think you're thinking of std::basic_string<T> too much as a string and not enough as an STL container. For example:
void RecipeReader::removeLeadingWhitespace(string &s_string)
{
s_string.erase(s_string.begin(),
std::find_if(s_string.begin(), s_string.end(), std::not1(isspace)));
}
A few comments:
As another answer (+1 from me) said - ditch the hungarian notation. It really doesn't do anything but add unimportant trash to every line. In addition, ifstream yielding an is_ prefix is ugly. is_ usually indicates a boolean.
Naming a function with processXXX gives very very little information on what it is actually doing. Use removeXXX, like you did with the RemoveLeadingWhitespace function.
The processComment function does an unnecessary copy and assignment. Use s.erase(index, string::npos); (npos is default, but this is more obvious).
It's not clear what your program does, but you might consider storing a different file format (like html or xml) if you need to post-process your files like this. That would depend on the goal.
using find_first_of('#') may give you some false positives. If it's present in quotes, it's not necessarily indicating a comment. (But again, this depends on your file format)
using find_first_of(c) with one character can be simplified to find(c).
The processNewlineEscapes function duplicates some functionality from the readRecipe function. You may consider refactoring to something like this:
-
string s_buffer;
string s_line;
while (getline(is_recipe, s_line)) {
// Sanitize the raw line.
removeLeadingWhitespace(s_line);
removeComments(s_line);
// Skip empty lines.
if (s_line.empty()) continue;
// Add the raw line to the buffer.
s_buffer += s_line;
// Collect buffer across all escaped lines.
if (*s_line.rbegin() == '\\') continue;
// This line is not escaped, now I can process the buffer.
v_s_recipe.push_back(s_buffer);
s_buffer.clear();
}
I'm not big on methods that modify the parameters. Why not return strings rather than modifying the input arguments? For example:
string RecipeReader::processComment(const string &s)
{
size_t index = s.find_first_of("#");
return s_string.substr(0, index);
}
I personally feel this clarifies intent and makes it more obvious what the method is doing.
I'd consider replacing all your processing code (almost everything you've written) with boost::regex code.
A few comments:
If s_buffer contains the file name to be opened, it should have a better name like s_filename.
The s_buffer member variable should not be reused to store temporary data from reading the file. A local variable in the function would do as well, no need for the buffer to be a member variable.
If there is not need to have the filename stored as a member variable it could just be passed as a parameter to readRecipe()
processNewlineEscapes() should check that the found backslash is at the end of the line before appending the next line. At the moment any backslash at any position triggers adding of the next line at the position of the backslash. Also, if there are several backslashes, find_last_of() would probably easier to use than find_first_of().
When checking the result of find_first_of() in processNewlineEscapes() and removeLeadingWhitespace() it would be cleaner to compare against string::npos to check if anything was found.
The logic at the end of removeLeadingWhitespace() could be simplified:
size_t sz_index = s_string.find_first_not_of(" \t");
if (sz_index != s_string.npos)
s_string = s_string.substr(sz_index);
else // "empty" lines with only whitespace
s_string.clear();
You might wish to have a look at Boost.String. It's a simple collection of algorithms to work with streams, and notably features trim methods :)
Now, on to the review itself:
Don't bother to remove the hungarian notation, if it's your style then use it, however you should try and improve the names of methods and variables. processXXX is definitely not indicating anything useful...
Functionally, I am worried about your assumptions: the main issue here is that you do not care for espace sequences (\n uses a backslash for example) and you do not worry for the presence of strings of charachters: std::cout << "Process #" << pid << std::endl; would yield an invalid line because of your "comment" preprocessing
Furthermore, since you remove the comments before processing the newline escapes:
i = 3; # comment \
running comment
will be parsed as
i = 3; running comment
which is syntactically incorrect.
From an interface point of view: there is not benefit in having the methods being class members here, you don't need an instance of RecipeReader really...
And finally, I find it awkward that two methods would read from the stream.
Little peeve of mine: returning by const value does not serve any purpose.
Here is my own version, as I believe than showing is easier than discussing:
// header file
std::vector<std::string> readRecipe(const std::string& fileName);
std::string extractLine(std::ifstream& file);
std::pair<std:string,bool> removeNewlineEscape(const std::string& line);
std::string removeComment(const std::string& line);
// source file
#include <boost/algorithm/string.hpp>
std::vector<std::string> readRecipe(const std::string& fileName)
{
std::vector<std::string> result;
ifstream file(fileName.c_str());
if (!file) std::cout << "Could not open: " << fileName << std::endl;
std::string line = extractLine(file);
while(!line.empty())
{
result.push_back(line);
line = extractLine(file);
} // looping on the lines
return result;
} // readRecipe
std::string extractLine(std::ifstream& file)
{
std::string line, buffer;
while(getline(file, buffer))
{
std::pair<std::string,bool> r = removeNewlineEscape(buffer);
line += boost::trim_left_copy(r.first); // remove leading whitespace
// based on the current locale
if (!r.second) break;
line += " "; // as we append, we insert a whitespace
// in order unintended token concatenation
}
return removeComment(line);
} // extractLine
//< Returns the line, minus the '\' character
//< if it was the last significant one
//< Returns a boolean indicating whether or not the line continue
//< (true if it's necessary to concatenate with the next line)
std::pair<std:string,bool> removeNewlineEscape(const std::string& line)
{
std::pair<std::string,bool> result;
result.second = false;
size_t pos = line.find_last_not_of(" \t");
if (std::string::npos != pos && line[pos] == '\')
{
result.second = true;
--pos; // we don't want to have this '\' character in the string
}
result.first = line.substr(0, pos);
return result;
} // checkNewlineEscape
//< The main difficulty here is NOT to confuse a # inside a string
//< with a # signalling a comment
//< assuming strings are contained within "", let's roll
std::string removeComment(const std::string& line)
{
size_t pos = line.find_first_of("\"#");
while(std::string::npos != pos)
{
if (line[pos] == '"')
{
// We have detected the beginning of a string, we move pos to its end
// beware of the tricky presence of a '\' right before '"'...
pos = line.find_first_of("\"", pos+1);
while (std::string::npos != pos && line[pos-1] == '\')
pos = line.find_first_of("\"", pos+1);
}
else // line[pos] == '#'
{
// We have found the comment marker in a significant position
break;
}
pos = line.find_first_of("\"#", pos+1);
} // looking for comment marker
return line.substr(0, pos);
} // removeComment
It is fairly inefficient (but I trust the compiler for optmizations), but I believe it behaves correctly though it's untested so take it with a grain of salt. I have focused mainly on solving the functional issues, the naming convention I follow is different from yours but I don't think it should matter.
I want to point out a small and sweet version which lacks \ support but skips whitespace-lines and comments. (Note the std::ws in the call to std::getline.
#include <algorithm>
#include <iostream>
#include <sstream>
#include <string>
int main()
{
std::stringstream input(
" # blub\n"
"# foo bar\n"
" foo# foo bar\n"
"bar\n"
);
std::string line;
while (std::getline(input >> std::ws, line)) {
line.erase(std::find(line.begin(), line.end(), '#'), line.end());
if (line.empty()) {
continue;
}
std::cout << "line: \"" << line << "\"\n";
}
}
Output:
line: "foo"
line: "bar"