How to detect a CRLF in a stream - c++

I got a stringstream of with HTTP request content. As you know HTTP request end up with CRLF break. But operator>> won't recognize CRLF as if it's a normal end-of-file.
How can I detect this CRLF break?
EDIT:
All right, actually I'm using boost.iostreams. But I don't think there should be any differences.
char head[] = "GET / HTTP1.1\r\nConnection: close\r\nUser-Agent: Wget/1.12 (linux-gnu)\r\nHost: www.baidu.com\r\n\r\n";
io::stream<My_InOut> in(head, sizeof head);
string s;
while(in >> s){
char c = in.peek(); // what I am doing here is to check if next character is a normal break so that 's' is a complete word.
switch( c ){
case -1:
// is it eof or an incomplete word?
break;
case 0x20: // a complete word
break;
case 0x0d:
case 0x0a: // also known as \r\n should indicate a complete word
break;
}
In this code, I assume that the request could possibly be split into parts because of its transmission, so I wanted to recognize whether '-1' stand for actual end-of-request or just a break word that I need to read more to complete the request.

First of all, peek returns an int, not a char (at least, std::istream::peek returns int--I don't know about boost). This distinction is important for recognizing -1 as the end of the file rather than a character with the value of 0xFF.
Also be aware that i/o streams in text mode will transform the platform's line separator into '\n' (which, in C and C++, usually has the same value as a line feed, but it might not). So if you're running this on Windows, where the native line separator is CR+LF, you'll never see the CR. But if you run the same code on a Linux box, where the native separator is simply LF, you will.
So given your question:
How can I detect this CRLF break?
The answer is to open the stream in binary mode and check for the character values 0x0D followed by 0x0A.
That said, it's not unheard of for HTML code to overlook that the network protocol requires CR+LF. If you want to be abide by the "be liberal in what you accept" maxim, you just watch for either CR or LF and then skip the next character if it's the complement.

Related

How to detect the newline character with gzgetc in c/c++

I want to read an entire line of a file character by character using gzgetc and stop when the newline is encountered. I know there is a function to grab the entire line but I would like to try to do it this way first. I tried:
Int c;
do {
c = gzgetc((gzFile) fp);
cout << c;
} while (c != '\n');
The result was an infinite loop. I tried adding (char) before c, still the same result. What am I doing wrong? The data file I am trying to read is encoded in base64 and I want to read in each token separated by space. Some of the lines are variable length and have a mixture of encoded and not encoded data which I set up an algorithm for I just need to know how to stop at newline.
You need to also check for gzgetc() returning -1, which indicates an error or end of file, and exiting the loop in that case. Your infinite loop is likely due to one of those.

std::getline deal with \n, \r and \r\n [duplicate]

Specifically I'm interested in istream& getline ( istream& is, string& str );. Is there an option to the ifstream constructor to tell it to convert all newline encodings to '\n' under the hood? I want to be able to call getline and have it gracefully handle all line endings.
Update: To clarify, I want to be able to write code that compiles almost anywhere, and will take input from almost anywhere. Including the rare files that have '\r' without '\n'. Minimizing inconvenience for any users of the software.
It's easy to workaround the issue, but I'm still curious as to the right way, in the standard, to flexibly handle all text file formats.
getline reads in a full line, up to a '\n', into a string. The '\n' is consumed from the stream, but getline doesn't include it in the string. That's fine so far, but there might be a '\r' just before the '\n' that gets included into the string.
There are three types of line endings seen in text files:
'\n' is the conventional ending on Unix machines, '\r' was (I think) used on old Mac operating systems, and Windows uses a pair, '\r' following by '\n'.
The problem is that getline leaves the '\r' on the end of the string.
ifstream f("a_text_file_of_unknown_origin");
string line;
getline(f, line);
if(!f.fail()) { // a non-empty line was read
// BUT, there might be an '\r' at the end now.
}
Edit Thanks to Neil for pointing out that f.good() isn't what I wanted. !f.fail() is what I want.
I can remove it manually myself (see edit of this question), which is easy for the Windows text files. But I'm worried that somebody will feed in a file containing only '\r'. In that case, I presume getline will consume the whole file, thinking that it is a single line!
.. and that's not even considering Unicode :-)
.. maybe Boost has a nice way to consume one line at a time from any text-file type?
Edit I'm using this, to handle the Windows files, but I still feel I shouldn't have to! And this won't fork for the '\r'-only files.
if(!line.empty() && *line.rbegin() == '\r') {
line.erase( line.length()-1, 1);
}
As Neil pointed out, "the C++ runtime should deal correctly with whatever the line ending convention is for your particular platform."
However, people do move text files between different platforms, so that is not good enough. Here is a function that handles all three line endings ("\r", "\n" and "\r\n"):
std::istream& safeGetline(std::istream& is, std::string& t)
{
t.clear();
// The characters in the stream are read one-by-one using a std::streambuf.
// That is faster than reading them one-by-one using the std::istream.
// Code that uses streambuf this way must be guarded by a sentry object.
// The sentry object performs various tasks,
// such as thread synchronization and updating the stream state.
std::istream::sentry se(is, true);
std::streambuf* sb = is.rdbuf();
for(;;) {
int c = sb->sbumpc();
switch (c) {
case '\n':
return is;
case '\r':
if(sb->sgetc() == '\n')
sb->sbumpc();
return is;
case std::streambuf::traits_type::eof():
// Also handle the case when the last line has no line ending
if(t.empty())
is.setstate(std::ios::eofbit);
return is;
default:
t += (char)c;
}
}
}
And here is a test program:
int main()
{
std::string path = ... // insert path to test file here
std::ifstream ifs(path.c_str());
if(!ifs) {
std::cout << "Failed to open the file." << std::endl;
return EXIT_FAILURE;
}
int n = 0;
std::string t;
while(!safeGetline(ifs, t).eof())
++n;
std::cout << "The file contains " << n << " lines." << std::endl;
return EXIT_SUCCESS;
}
Are you reading the file in BINARY or in TEXT mode? In TEXT mode the pair carriage return/line feed, CRLF, is interpreted as TEXT end of line, or end of line character, but in BINARY you fetch only ONE byte at a time, which means that either character MUST be ignored and left in the buffer to be fetched as another byte! Carriage return means, in the typewriter, that the typewriter car, where the printing arm lies in, has reached the right edge of the paper and is returned to the left edge. This is a very mechanical model, that of the mechanical typewriter. Then the line feed means that the paper roll is rotated a little bit up so the paper is in position to begin another line of typing. As fas as I remember one of the low digits in ASCII means move to the right one character without typing, the dead char, and of course \b means backspace: move the car one character back. That way you can add special effects, like underlying (type underscore), strikethrough (type minus), approximate different accents, cancel out (type X), without needing an extended keyboard, just by adjusting the position of the car along the line before entering the line feed. So you can use byte sized ASCII voltages to automatically control a typewriter without a computer in between. When the automatic typewriter is introduced, AUTOMATIC means that once you reach the farthest edge of the paper, the car is returned to the left AND the line feed applied, that is, the car is assumed to be returned automatically as the roll moves up! So you do not need both control characters, only one, the \n, new line, or line feed.
This has nothing to do with programming but ASCII is older and HEY! looks like some people were not thinking when they begun doing text things! The UNIX platform assumes an electrical automatic typemachine; the Windows model is more complete and allows for control of mechanical machines, though some control characters become less and less useful in computers, like the bell character, 0x07 if I remember well... Some forgotten texts must have been originally captured with control characters for electrically controlled typewriters and it perpetuated the model...
Actually the correct variation would be to just include the \r, line feed, the carriage return being unnecessary, that is, automatic, hence:
char c;
ifstream is;
is.open("",ios::binary);
...
is.getline(buffer, bufsize, '\r');
//ignore following \n or restore the buffer data
if ((c=is.get())!='\n') is.rdbuf()->sputbackc(c);
...
would be the most correct way to handle all types of files. Note however that \n in TEXT mode is actually the byte pair 0x0d 0x0a, but 0x0d IS just \r: \n includes \r in TEXT mode but not in BINARY, so \n and \r\n are equivalent... or should be. This is a very basic industry confusion actually, typical industry inertia, as the convention is to speak of CRLF, in ALL platforms, then fall into different binary interpretations. Strictly speaking, files including ONLY 0x0d (carriage return) as being \n (CRLF or line feed), are malformed in TEXT mode (typewritter machine: just return the car and strikethrough everything...), and are a non-line oriented binary format (either \r or \r\n meaning line oriented) so you are not supposed to read as text! The code ought to fail maybe with some user message. This does not depend on the OS only, but also on the C library implementation, adding to the confusion and possible variations... (particularly for transparent UNICODE translation layers adding another point of articulation for confusing variations).
The problem with the previous code snippet (mechanical typewriter) is that it is very inefficient if there are no \n characters after \r (automatic typewriter text). Then it also assumes BINARY mode where the C library is forced to ignore text interpretations (locale) and give away the sheer bytes. There should be no difference in the actual text characters between both modes, only in the control characters, so generally speaking reading BINARY is better than TEXT mode. This solution is efficient for BINARY mode typical Windows OS text files independently of C library variations, and inefficient for other platform text formats (including web translations into text). If you care about efficiency, the way to go is to use a function pointer, make a test for \r vs \r\n line controls however way you like, then select the best getline user-code into the pointer and invoke it from it.
Incidentally I remember I found some \r\r\n text files too... which translates into double line text just as is still required by some printed text consumers.
The C++ runtime should deal correctly with whatever the endline convention is for your particular platform. Specifically, this code should work on all platforms:
#include <string>
#include <iostream>
using namespace std;
int main() {
string line;
while( getline( cin, line ) ) {
cout << line << endl;
}
}
Of course, if you are dealing with files from another platform, all bets are off.
As the two most common platforms (Linux and Windows) both terminate lines with a newline character, with Windows preceding it with a carriage return,, you can examine the last character of the line string in the above code to see if it is \r and if so remove it before doing your application-specific processing.
For example, you could provide yourself with a getline style function that looks something like this (not tested, use of indexes, substr etc for pedagogical purposes only):
ostream & safegetline( ostream & os, string & line ) {
string myline;
if ( getline( os, myline ) ) {
if ( myline.size() && myline[myline.size()-1] == '\r' ) {
line = myline.substr( 0, myline.size() - 1 );
}
else {
line = myline;
}
}
return os;
}
One solution would be to first search and replace all line endings to '\n' - just like e.g. Git does by default.
Other than writing your own custom handler or using an external library, you are out of luck. The easiest thing to do is to check to make sure line[line.length() - 1] is not '\r'. On Linux, this is superfluous as most lines will end up with '\n', meaning you'll lose a fair bit of time if this is in a loop. On Windows, this is also superfluous. However, what about classic Mac files which end in '\r'? std::getline would not work for those files on Linux or Windows because '\n' and '\r' '\n' both end with '\n', eliminating the need to check for '\r'. Obviously such a task that works with those files would not work well. Of course, then there exist the numerous EBCDIC systems, something that most libraries won't dare tackle.
Checking for '\r' is probably the best solution to your problem. Reading in binary mode would allow you to check for all three common line endings ('\r', '\r\n' and '\n'). If you only care about Linux and Windows as old-style Mac line endings shouldn't be around for much longer, check for '\n' only and remove the trailing '\r' character.
Unfortunately the accepted solution does not behave exactly like std::getline(). To obtain that behavior (to my tests), the following change is necessary:
std::istream& safeGetline(std::istream& is, std::string& t)
{
t.clear();
// The characters in the stream are read one-by-one using a std::streambuf.
// That is faster than reading them one-by-one using the std::istream.
// Code that uses streambuf this way must be guarded by a sentry object.
// The sentry object performs various tasks,
// such as thread synchronization and updating the stream state.
std::istream::sentry se(is, true);
std::streambuf* sb = is.rdbuf();
for(;;) {
int c = sb->sbumpc();
switch (c) {
case '\n':
return is;
case '\r':
if(sb->sgetc() == '\n')
sb->sbumpc();
return is;
case std::streambuf::traits_type::eof():
is.setstate(std::ios::eofbit); //
if(t.empty()) // <== change here
is.setstate(std::ios::failbit); //
return is;
default:
t += (char)c;
}
}
}
According to https://en.cppreference.com/w/cpp/string/basic_string/getline:
Extracts characters from input and appends them to str until one of the following occurs (checked in the order listed)
end-of-file condition on input, in which case, getline sets eofbit.
the next available input character is delim, as tested by Traits::eq(c, delim), in which case the delimiter character is extracted from input, but is not appended to str.
str.max_size() characters have been stored, in which case getline sets failbit and returns.
If no characters were extracted for whatever reason (not even the discarded delimiter), getline sets failbit and returns.
If it is known how many items/numbers each line has, one could read one line with e.g. 4 numbers as
string num;
is >> num >> num >> num >> num;
This also works with other line endings.

eof from string, not a stream

I have a secret "mission" to write Vigenère cipher with it's analysis with ascii alphabet.
I have some troubles with encrypting text.
There are two kinds of them:
1) If I use whole ascii table, there are some troubles with decrypting text, because i use "system" chars that kills my text (by the way, it is "War and Peace" written by Tolstoy). Should i use it truncated version?
if yes, so - could i do operations from next question with truncated ascii table?
2) I want to have whole my text in one string. I can do it by this:
string s;
string p = "";
ifstream in("text_for_encryption.txt");
while (getline(in, s))
{
p+=s;
p+="\n";
}
"s" is the temporary string, and "p" is the string that has all text from file in it (with endl's and, of course, EOF)
i will make a cycle for "p" which looks like as
while (not eof in p)
{
take first keyword.length() chars from "p"? check every of them for EOF and encrypt them. (they will be deleted from p)
kick them in file "encrypted_text.txt"
}
in pseudocode (yeah, it is shit-like :( ).
so, the question is - how can i compare a string element with eof?
maybe, i can't google good, but i couldn't find the answer for this question.
Thanks in advance for every advice!
Update:
if i will encrypt string-by-string, it wll be easy to get a length of a key by Fridman's method (if the key is quite small).
so i want to encrypt text with endl's for more security
For encrypting, it depends largely on what you want to encrypt,
and what you want to do with the encrypted text. The usual
solution is to encrypt the bytes values (not the characters);
this means that you'll have to read and write the encrypted file
in binary mode, but since it's not meant to be readable anyway,
that's usually not an issue.
For the rest, strings do not have "EOF" characters. In fact,
there is no such thing as an EOF character[1]. (Nor en endl
character, either.) EOF is, in fact, an "event" which occurs
when reading from a stream; in C++, it is, in fact, treated as
a sort of an error. std::istream functions which can return
EOF (e.g. std::istream::get()) return int, and not char,
in order to be able to return an out of band value.
Strings do have a known length. To visit all of the characters
in a string:
for ( std::string::const_iterator current = s.begin();
current != s.end();
++ current ) {
// Do something with *current...
}
(If you have C++11, you can replace
std::string::const_iterator with auto. This is much simpler
to type, but until you master the iterator idioms, it's probably
better to write the type out, to ensure you understand what is
going on.)
[1] Historically, text files have had EOF characters on some
systems. This is not the end of file that you see with
std::istream::get(), but even today, if you open a file in
text mode under Windows, a 0x1A in the file will trigger the end
of file event in the input.

C++ cin fails when reading more than 127 ASCII values

I've created a text file that has 256 characters, the first character of the text file being ASCII value 0 and the last character of the text value being ASCII value 255. The characters in between increment from 0 to 255 evenly. So character #27 is ASCII value 27. Character #148 should be ASCII value 148.
My goal is to read every character of this text file.
I've tried reading this with cin. I tried cin.get() and cin.read(), both of which are supposed to read unformatted input. But both fail when reading the 26th character. I think when I used an unsigned char, cin said it was reading read in 255, which simply isn't true. And when I used a normal signed char, cin said it was reading in -1. It should be reading in whatever the character equivalent of ASCII 26 is. Perhaps cin thinks it's hit EOF? But I've read on separate StackOverflow posts previously that EOF isn't an actual character that one can write. So I'm lost as to why cin is coughing on character values that represent integer -1 or integer 255. Could someone please tell me what I'm doing wrong, why, and what the best solution is, and why?
There's not much concrete code to paste. I've tried a few different non-working combinations all involving either cin.get() or cin.read() with either char or unsigned char and call casts to char and int in between. I've had no luck with being able to read past the 26th character, except for this:
unsigned char character;
while ( (character = (unsigned char)cin.get()) != EOF) { ... }
Interestingly enough though, although this doesn't stop my while loop at the 26th character, it doesn't move on either. It seems like cin, whether its cin.get() or cin.read() just refuses to advance to the next character the moment it detects something it doesn't like. I'm also aware that something like cin.ignore() exists, but my input isn't predictable; that is, these 256 characters for my text file are just a test case, and the real input is rather random. This is part of a larger homework assignment, but this specific question is not related to the assignment; I"m just stuck on part of the process.
Note: I am reading from the standard input stream, not a specific text file. Still no straightforward solution it seems. I can't believe this hasn't been done on cin before.
Update:
On Windows, it stops after character 26 probably due to that Ctrl-Z thing. I don't care that much for this problem. It only needs to work on Linux.
On Linux, though, it reads all characters from 0 - 127. But it doesn't seem to be reading the extended ASCII characters from 127 to 255. There's a "solution" program that produces output we're supposed to imitate, and that program is able to read all 255 characters somehow.
Question: How, using cin, can I read all 255 ASCII characters?
Solved
Using:
int characterInt;
unsigned char character;
while ( (characterInt = getchar()) != EOF )
{
// 'character' now stores values from 0 - 255
character = (unsigned char)(characterInt);
}
I presume you are on windows. On the windows platform character 26 is ctrl-z which is used in a console to represent end of file, so the iostreams is thinking your file ends at that character.
It onlt does this in text mode which cin is using, if you open a steam in binary mode it won't do this.
std::cin reads text streams, not arbitrary binary data.
As to why the 26th character is interesting, you are probably using a CP/M derivative (such as MS-DOS or MS-Windows). In those operating systems, Control-Z is used as an EOF character in text files.
EDIT:
On Linux, using g++ 4.4.3, the following program behaves precisely as expected, printing the numbers 0 thru 255, inclusive:
#include <iostream>
#include <iomanip>
int main () {
int ch;
while( (ch=std::cin.get()) != std::istream::traits_type::eof() )
std::cout << ch << " ";
std::cout << "\n";
}
There are two problems here. The first is that in Windows the default mode for cin is text and not binary, resulting in certain characters being interpreted instead of being input into the program. In particular the 26th character, Ctrl-Z, is being interpreted as end-of-file due to backwards compatibility taken to an extreme.
The other problem is due to the way cin >> works - it skips whitespace. This includes space obviously, but also tab, newline, etc. To read every character from cin you need to use cin.get() or cin.read().

Actual difference between end of line and end of file under windows?

I understand EOF and EOL but when I was reading this question (second part of answer) and i got my concepts broken :
Specially the para :
It won't stop taking input until it finds the end of file(cin uses
stdin, which is treated very much like a file)
so i want to know when we do some thing like in c++ under windows :
std::cin>>int_var; , and we press enter , this end the input but according to reference link it should only stop taking input after hitting ctrl+z.
So i would love to know how std::*stream deal with EOF and EOL.
Second part:
please have a look at this example :
std::cin.getline(char_array_of_size_256 ,256);
cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
cout << "artist is " << artist << endl;
If i remove std::cin.ignore() it simply stops taking input (which is known case) but when i keep it , it waits for a new input which is ended by '\n' . But it should simply clear up stream rather then waiting for any new input ending-up with '\n'.
Thanks for giving you time)
End-of-line and end-of-file are very different concepts.
End-of-line is really just another input character (or character sequence) that can appear anywhere in an input stream. If you're reading input one character at a time from a text stream, end-of-line simply means that you'll see a new-line ('\n') character. Some input routines treat this character specially; for example, it tells getline to stop reading. (Other routines treat ' ' specially; there's no fundamental difference.)
Different operating systems use different conventions for marking the end of a line. On Linux and other Unix-like systems, the end of a line in a file is marked with a single ASCII linefeed (LF, '\n') character. When reading from a keyboard, both LF and CR are typically mapped to '\n' (try typing either Enter, Control-J, or Control-M). On Windows, the end of a line in a file is marked with a CR-LF pair (\r\n). The C and C++ I/O systems (or the lower-level software they operate on top of) map all these markers to a single '\n' character, so your program doesn't have to worry about all the possible variations.
End-of-file is not a character, it's a condition that says there are no more characters available to be read. Different things can trigger this condition. When you're reading from a disk file, it's just the physical end of the file. When you're reading from a keyboard on Windows, control-Z denotes end-of-file; on Unix/Linux, it's typically control-D (though it can be configured differently).
(You'll usually have an end-of-line (character sequence) just before end-of-file, but not always; input can sometimes end in an unterminated line, on some systems.)
Different input routines have different ways of indicating that they've seen an end-of-file condition. Read the documentation for each one for the details.
As for EOF, that's a macro defined in <stdio.h> or <cstdio>. It expands to a negative integer constant (typically -1) that's returned by some functions to indicate that they've reached an end-of-file condition.
EDIT: For example, suppose you're reading from a text file containing two lines:
one
two
Let's say you're using C's getchar(), getc(), or fgetc() function to read one character at a time. The values returned on successive calls will be:
'o', 'n', 'e', '\n', 't', 'w', 'o', '\n', EOF
Or, in numeric form (on a typical system):
111, 110, 101, 10, 116, 119, 111, 10, -1
Each '\n', or 10 (0x0a) is a new-line character read from the file. The final -1 is the value of EOF; this isn't a character, but an indication that there are no more characters to be read.
Higher-level input routines, like C's fgets() and C++'s std::cin >> s or std::getline(std::cin, s), are built on top of this mechanism.
First "part"
so i want to know when we do some thing like in c++ under windows : std::cin>>int_var; , and we press enter , this end the input but according to reference link it should only stop taking input after hitting ctrl+z.
No, you're confusing formatted input operations with stream iterators. The following will use the formatted input operation (operator>>) repeatedly until the end of file is reached because the "end iterator" represents the end of the stream.
std::vector<int> integers;
std::copy(
std::istream_iterator<int>(std::cin),
std::istream_iterator<int>(),
std::back_inserter(integers));
If you use the following:
int i = 0;
std::cin >> i;
in an interactive shell (e.g. in console mode), std::cin will block on user input which is acquired line by line. So, if no data (or only white space) is available, this operation will actually force the user to type a line of input and press the enter key.
However,
int i = 0;
int j = 0;
std::cin >> i >> j;
may block on one or two lines of input, depending on what the user types. In particular, if the user types
1<space>2<enter>
then the two input operations will be applies using the same line of input.
Second "part"
Considering the code snippet:
std::cin.getline(char_array_of_size_256 ,256);
cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
cout << "artist is " << artist << endl;
If the line contains 255 or less lines of character data, std::cin.getline() will consume the end-of-line character. Thus, the second line will consume all characters until the next line is completed. If you want to capture only the current line and ignore all characters past 256, I suggest you use something like:
std::cin.getline(char_array_of_size_256 ,256);
if (std::cin.gcount() == 256) {
cin.ignore(std::numeric_limits<std::streamsize>::max(), '\n');
}
cout << "artist is " << artist << endl;
On the second part:
When the linked answer said "read into a string", I guess they meant
std::string s;
std::getline(std::cin, s);
which always reads the entire line into the string s (while setting s to the proper size).
That way there is nothing left over from the input line to clean up.