C++ getline() behaves strangely when reading stringstream containing \0 [duplicate] - c++

This question already has answers here:
How do you construct a std::string with an embedded null?
(11 answers)
Closed 8 years ago.
I'm trying to read a large buffer from a socket which uses \0 to delimit pieces of data and \n to delimit lines.
I thought getline() would be an easy way to get each line but it's behaving strangely.
I'm using \n as the delimiter in getline().
string line;
string test1 = "aaa,123\nbbb\nccc,456\n";
stringstream ss1(test1);
while(std::getline(ss1, line, '\n')) {
cout << line << endl;
}
// outputs:
// aaa,123
// bbb
// ccc,456
string test2 = "aaa\0123\0\nbbb\0\nccc\0456\0\n";
stringstream ss2(test2);
while(std::getline(ss2, line, '\n')) {
cout << line << endl;
}
// outputs:
// aaa
// 3
Why is this happening in test2? Where is the 3 coming from? Must I remove the \0 to make this work? Is there an easier/better way to mark strings in my buffer when I do a socket recv()?

\0 in a special symbol. It shows when the string ends.
For example, if you type in "a string", the compiler automatically adds a \0 on the end, which signifies the end of the string. However, it is legal to have a \0 in the middle of the string, it just means that everything after it is ignored.
So basically, any operation you do on the string, not just the getline, will treat the string as "aaa", ignoring everything after the first \0 that is found. But...
As #Fred Larson points out
Oh, I see where the 3 comes from. The first \0 isn't a null, it's the start of \012, which is a carriage return. Then the 3 follows.
So actually, the string is being treated as "aaa\n3". Which is why you get the output you do.
Edit: And thanks to Galik, I will also add that these rules I mention might only apply to a string literal / c-string. It may be a different case with std::strings, in which the length of the string is known ahead of time.

\0 is the standard string terminator symbol. As such, you may either read character by character or avoid \0 as delemeters

Related

Why is the length of a string off by one on the first read of a file?

I am perplexed with the way my program is performing. I am looping the following process:
1) take the name of a course from an input file
2) output the length of the name of the course
The problem is that the first value is always one less than the actual value of the string.
My first string contains 13 characters (including the colon), but nameOfClass.length() returns 12. The next string, the character count is 16 and indeed, nameOfClass.length() returns 16.
Every value after that also returns the expected value, it is only the first that returns the expected value minus 1.
Here's the (reduced) code:
std::ifstream inf("courseNames.txt");
int numberOfClasses = 10;
string nameOfClass;
for (int i = 0; i < numberOfClasses; i++) {
std::getline(inf, nameOfClass,':');
std::cout << nameOfClass.length() << "\n";
}
The file looks like this (courseNames.txt):
Pre-Calculus:
Public-Speaking:
English I:
Calculus I:
...etc. (6 more classes)
This is what I get:
12
16
10
11
Can anyone explain this behavior of the .length() function?
You have a problem, but you have the wrong conclusion. std::getline reads but doesn't output the delimiter, and so the first result is indeed 12.
It also doesn't output the delimiter for any subsequent lines, so why is there always one more? Well, look what is after that :. That's right, a new line!
Pre-Calculus:
^ a new line
So your nameOfClass variable, except for the first string, always stores an extra newline before the other characters.
The fix is easy enough, just ignore the newline after reading the string.
inf.ignore(); // ignore one character
So, not the first result was wrong, it was the only one right :)

Array conversion guidance

I'm stuck on an assignment which converts contents of an array (input from the user) to a pre-declared shorthand.
I want it to be as simple as strcpy(" and ", "+");
to change the word 'and' within a string, to a '+' sign.
Unfortunately, no matter how I structure the function; I get a deprecated conversion warning (variant loops, and direct applications, attempted).
Side note; this is assignment based, so my string shortcuts are severely limited, and no pointers (I've seen several versions of clearing the fault using them).
I'm not looking for someone to do my homework; just guidance on how strcpy can be applied without creating the dep. warning. Perhaps I shouldn't be using strcpy at all?
strcpy copies the contents of the second string into the memory of the first string. Since you're copying a string literal into a string literal it can't do it (you can't write to a string literal) and so it complains.
Instead you need to build your own search and replace system. You can use strstr() to search for a substring within a string, and it returns the pointer in memory to the start of that found string (if it's found).
Let's take the sample string Jack and Jill went up the hill.
char *andloc = strstr(buffer, " and ");
That would return the address of the start of the string (say 0x100) plus the offset of the word " and " (including spaces) within it (0x100 + 4) which would be 0x104.
Then, if found, you can replace it with the & symbol. However you can't use strcpy for that as it'll terminate the string. Instead you can set the bytes manually, or use memcpy:
if (andloc != NULL) { // it's been found
andloc[1] = '&';
andloc[2] = ' ';
}
or:
if (andloc != NULL) { // it's been found
memcpy(andloc, " & ", 3);
}
That would result in Jack & d Jill went up the hill. We're not quite there yet. Next you have to shuffle everything down to cover the "d " from the old " and ". For that you'd think you could now use strcpy or memcpy, however that's not possible - the strings (source and destination) overlap, and the manual pages for both specifically state that the strings must not overlap and to use memmove instead.
So you can move the contents of the string after the "d " to after the "& " instead:
memmove(andloc + 3, andloc + 5, strlen(andloc + 5) + 1);
Adding a number to a string like that adds to the address of the pointer. So we're looking at copying the data from 5 characters further on in the string that the old "and" location into a space starting at 3 characters on from the start of the old "and" location. The amount to copy is the length of the string from 5 characters on from the start of the "and" location plus one so it copies the NULL character at the end of the string.
Another manual way of doing it would be to iterate through each character until you find the end of the string:
char *to = andloc + 3;
char *from = andloc + 5;
while (*from) { // Until the end of the string
*to = *from; // Copy one character
to++; // Move to the ...
from++; // ... next character pair
}
*to = 0; // Add the end of string marker.
So now either way the string memory contains:
Jack & Jill went up the hill\0l\0
The \0 is the end of string marker, so the actual string "content" is only up as far as the first \0 and the l\0 is now ignored.
Note that this only works if you are replacing a part with something that is smaller. If you are replacing it with something bigger, so the string grows in size, you will be forced to use memmove, which first copies the content to a scratchpad, and ensure that your buffer has enough room in it to store the finished string (this kind of thing is often a big source of "buffer overruns" which are a security headache and one of the biggest causes of systems being hacked). Also you have to do the whole thing backwards - move the latter part of the string first to make room, then modify the gap between the two halves.

difference between cin.get() and cin.getline()

I am new to programming, and I have some questions on get() and getline() functions in C++.
My understanding for the two functions:
The getline() function reads a whole line, and using the newline character transmitted by the Enter key to mark the end of input. The get() function is much like getline() but rather than read and discard the newline character, get() leaves that character in the input queue.
The book(C++ Primer Plus) that I am reading is suggesting using get() over getline(). My confusion is that isn't getline() safer than get() since it makes sure to end line with '\n'. On the other hand, get() will just hangs the character in the input queue, thus potentially causing problem?
There are an equivalent number of advantages and drawbacks, and -essentially- all depends on what you are reading: get() leaves the delimiter in the queue thus letting you able to consider it as part of the next input. getline() discards it, so the next input will be just after it.
If you are talking about the newline character from a console input,it makes perfectly sense to discard it, but if we consider an input from a file, you can use as "delimiter" the beginning of the next field.
What is "good" or "safe" to do, depends on what you are doing.
cin.getline() reads input up to '\n' and stops
cin.get() reads input up to '\n' and keeps '\n' in the stream
For example :
char str1[100];
char str2[100];
cin.getline(str1 , 100);
cin.get(str2 , 100);
cout << str1 << " "<<str2;
input :
1 2
3 4
output 1 2 3 4 // the output expexted
When reverse them
For example :
char str1[100];
char str2[100];
cin.get(str2 , 100);
cin.getline(str1 , 100);
cout << str1 << " "<<str2;
input :
1 2
3 4
output 1 2 // the output unexpexted because cin.getline() read the '\n'
get() extracts char by char from a stream and returns its value (casted to an integer) whereas getline() is used to get a line from a file line by line. Normally getline is used to filter out delimiters in applications where you have a flat file(with thousands of line) and want to extract the output(line by line) using certain delimiter and then do some operation on it.
The difference between get() and the getline() functions is that the getline() function extracts the delimiting character but does not place it in string. Whereas the get() function does not extract the delimiting character from the input buffer
cin.get() takes the input of whole line which includes end of line space repeating it will consume the next whole line but getline() is used to get a line from a file line by line.

How does char extraction differ from string extraction?

When disabling whitespace skipping with chars and strings the behavior is different. It seems the only way to extract an entire string (including whitespace characters) is to use chars and noskipws. But this is not possible with strings because it won't extract after the first space.
std::string test = "a b c";
char c;
std::istringstream iss(test);
iss.unsetf(std::ios_base::skipws);
while (iss >> c)
std::cout << c;
will output a b c but change c to string and it only outputs a.
The >> operator for a string extracts words, and stops at the
first white space it sees. If it doesn't skip initial white
space, then it stops immediately, and returns an empty string.
You don't say how you want the string to be delimited. To read
until the end of line, just use std::getline. To read until
the end of file, you can use something like:
std::istringstream collector;
collector << iss.rdbuf();
std::string results = collector.str();
It's not the most efficient, but if the file is small, it will
do.

istream and cin.get()

I have a question about the difference between these two pieces of code:
char buffer5[5];
cin.get(buffer5, 5);
cout << buffer5;
cin.get(buffer5, 5);
cout << buffer5;
and
char buffer4;
while (cin.get(buffer4))
{
cout << buffer4;
}
In the first piece of code, the code gets 5 characters and puts it in buffer5. However, because you press enter, a newline character isn't put into the stream when calling get(), so the program will terminate and will not ask you for another round of 5 characters.
In the second piece of code, cin.get() waits for input to the input stream, so the loop doesn't just terminate (I think). Lets say I input "Apple" into the input stream. This will put 5 characters into the input stream, and the loop will print all characters to the output. However, unlike the first piece of code, it does not stop, even after two inputs as I can continuously keep inputting.
Why is it that I can continuously input character sequences into the terminal in the second piece of code and not the first?
First off, "pressing enter" has no special meaning to the IOStreams beyond entering a newline character (\n) into the input sequence (note, when using text streams the platform specific end of line sequences are transformed into a single newline character). When entering data on a console, the data is normally line buffered by the console and only forwarded to the program when pressing enter (typically this can be turned off but the details of this are platform specific and irrelevant to this question anyway).
With this out of the way lets turn our attention to the behavior of s.get(buffer, n) for an std::istream s and a pointer to an array of at least n characters buffer. The description of what this does is quite trivial: it calls s.get(buffer, n, s.widen('\n')). Since we are talking about std::istream and you probably haven't changed the std::locale we can assume that s.widen('\n') just returns '\n', i.e., the call is equivalent to s.get(buffer, n, '\n') where '\n' is called a delimiter and the question becomes what this function does.
Well, this function extracts up to m = 0 < n? n - 1: 0 characters, stopping when either m is reached or when the next character is identical to the delimiter which is left in the stream (you'd used std::istream::getline() if you'd wanted the delimiter to be extracted). Any extracted character is stored in the corresponding location of buffer and if 0 < n a null character is stored into location buffer[n - 1]. In case, if no character is extracted std::ios_base::failbit is set.
OK, with this we should have all ingredients to the riddle in place: When you entered at least one character but less than 5 characters the first call to get() succeeded and left the newline character as next character in the buffer. The next attempt to get() more characters immediately found the delimiter, stored no character, and indicated failure by setting std::ios_base::failbit. It is easy to verify this theory:
#include <iostream>
int main()
{
char buffer[5];
for (int count(0); std::cin; ++count) {
if (std::cin.get(buffer, 5)) {
std::cout << "get[" << count << "]='" << buffer << "'\n";
}
else {
std::cout << "get[" << count << "] failed\n";
}
}
}
If you enter no character, the first call to std::cin.get() fails. If you enter 1 to 4 characters, the first call succeeds but the second one fails. If you enter more than 4 characters, the second call also succeeds, etc. There are several ways to deal with the potentially stuck newline character:
Just use std::istream::getline() which behaves the same as std::istream::get() but also extracts the delimiter if this is why it stopped reading. This may chop one line into multiple reads, however, which may or may not be desired.
To avoid the limitation of a fixed line length, you could use std::getline() together with an std::string (i.e., std::getline(std::cin, string)).
After a successful get() you could check if the next character is a newline using std::istream::peek() and std::istream::ignore() it when necessary.
Which of these approaches meets your needs depends on what you are trying to achieve.