scanf() curious behaviour! - c++

I recently stumbled upon a curious case(atleast for me, since I hadn't encountered this before)..Consider the simple code below:-
int x;
scanf("%d",&x);
printf("%d",x);
The above code takes a normal integer input and displays the result as expected..
Now, if I modify the above code to the following:-
int x;
scanf("%d ",&x);//notice the extra space after %d
printf("%d",x);
This takes in another additional input before it gives the result of the printf statement.. I don't understand why a space results in change of behaviour of the scanf().. Can anyone explain this to me....

From http://beej.us/guide/bgc/output/html/multipage/scanf.html:
The scanf() family of functions reads data from the console or from a FILE stream, parses it, and stores the results away in variables you provide in the argument list.
The format string is very similar to that in printf() in that you can tell it to read a "%d", for instance for an int. But it also has additional capabilities, most notably that it can eat up other characters in the input that you specify in the format string.
What's happening is scanf is pattern matching the format string (kind of like a regular expression). scanf keeps consumes text from standard input (e.g. the console) until the entire pattern is matched.
In your second example, scanf reads in a number and stores it in x. But it has not yet reached the end of the format string -- there is still a space character left. So scanf reads additional whitespace character(s) from standard input in order to (try to) match it.

From the man page:
The format string consists of a sequence of directives which describe
how to process the sequence of input characters. If processing of a
directive fails, no further input is read, and scanf() returns. A
"failure" can be either of the following: input failure, meaning that
input characters were unavailable, or matching failure, meaning that
the input was inappropriate (see below).
A directive is one of the following:
? A sequence of white-space characters (space, tab, newline, etc;
see isspace(3)). This directive matches any amount of white
space, including none, in the input.

man scanf
[...]
A sequence of white-space characters (space, tab, newline, etc.;
see isspace(3)). This directive matches any amount of white
space, including none, in the input.

Related

Does istream::ignore discard more than n characters?

(this is possibly a duplicate of Why does std::basic_istream::ignore() extract more characters than specified?, however my specific case doesn't deal with the delim)
From cppreference, the description of istream::ignore is the following:
Extracts and discards characters from the input stream until and including delim.
ignore behaves as an UnformattedInputFunction. After constructing and checking the sentry object, it extracts characters from the stream and discards them until any one of the following conditions occurs:
count characters were extracted. This test is disabled in the special case when count equals std::numeric_limitsstd::streamsize::max()
end of file conditions occurs in the input sequence, in which case the function calls setstate(eofbit)
the next available character c in the input sequence is delim, as determined by Traits::eq_int_type(Traits::to_int_type(c), delim). The delimiter character is extracted and discarded. This test is disabled if delim is Traits::eof()
However, let's say I've got the following program:
#include <iostream>
int main(void) {
int x;
char p;
if (std::cin >> x) {
std::cout << x;
} else {
std::cin.clear();
std::cin.ignore(2);
std::cout << "________________";
std::cin >> p;
std::cout << p;
}
Now, let's say I input something like p when my program starts. I expect cin to 'fail', then clear to be called and ignore to discard 2 characters from the buffer. So 'p' and '\n' that are left in the buffer should be discarded. However, the program still expects input after ignore gets called, so in reality it's only get to the final std::cin>>p after I've given it more than 2 characters to discard.
My issue:
Inputting something like 'b' and hitting Enter immediately after the first input (so 2 after the characters get discarded, 'p' and '\n') keeps 'b' in the buffer and immediately passes it to cin, without first printing the message. How can I make it so that the message gets printed immediately after the two characters are discarded and then << is called?
After a lot of back and forth in the comments (and reproducing the problem myself), it's clear the problem is that:
You enter p<Enter>, which isn't parsable
You try to discard exactly two characters with ignore
You output the underscores
You prompt for the next input
but in fact things seem to stop at step 2 until you give it more input, and the underscores only appear later. Well, bad news, you're right, the code is blocking at step 2 in ignore. ignore is blocking waiting for a third character to be entered (really, checking if it's EOF after those two characters), and by the spec, this is apparently the correct thing to do, I think?
The problem here is the same basic issue as the problem you linked just a different manifestation. When ignore terminates because it's read the number of characters requested, it always attempts to reads one more character, because it needs to know if condition 2 might also be true (it happened to read the last character so it can take the appropriate action, putting cin in EOF state, or leaving the next character in the buffer for the next read otherwise):
Effects: Behaves as an unformatted input function (as described above). After constructing a sentry object, extracts characters and discards them. Characters are extracted until any of the following occurs:
n != numeric_limits::max() (18.3.2) and n characters have been extracted so far
end-of-file occurs on the input sequence (in which case the function calls setstate(eofbit), which may throw ios_base::failure (27.5.5.4));
traits::eq_int_type(traits::to_int_type(c), delim) for the next available input character c (in which case c is extracted).
Since you didn't provide an end character for ignore, it's looking for EOF, and if it doesn't find it after two characters, it must read one more to see if it shows up after the ignored characters (if it does, it'll leave cin in EOF state, if not, the character it peeked at will be the next one you read).
Simplest solution here is to not try to specifically discard exactly two characters. You want to get rid of everything through the newline, so do that with:
std::cin.ignore(std::numeric_limits<std::stringsize>::max(), '\n');
instead of std::cin.ignore(2);; that will read any and all characters until the newline (or EOF), consume the newline, and it won't ever overread (in the sense that it continues forever until the delimiter or EOF is found, there is no condition under which it finishes reading a count of characters and needs to peek further).
If for some reason you want to specifically ignore exactly two characters (how do you know they entered p<Enter> and not pabc<Enter>?), just call .get() on it a couple times or .read(&two_byte_buffer, 2) or the like, so you read the raw characters without the possibility of trying to peek beyond them.
For the record, this seems a little from the cppreference spec (which may be wrong); condition 2 in the spec doesn't specify it needs to verify if it is at EOF after reading count characters, and cppreference claims condition 3 (which would need to peek) is explicitly not checked if the "delimiter" is the default Traits::eof(). But the spec quote found in your other answer doesn't include that line about condition 3 not applying for Traits::eof(), and condition 2 might allow for checking if you're at EOF, which would end up with the observed behavior.
Your problem is related to your terminal. When you press ENTER, you are most likely getting two characters -- '\r' and '\n'. Consequently, there is still one character left in the input stream to read from. Change that line to:
std::cin.ignore(10, '\n'); // 10 is not magical. You may use any number > 2
to see the behavior you are expecting.
Passing exact number of characters in buffer will do the trick:
std::cin.ignore(std::cin.rdbuf()->in_avail());

What does '|' mean in scanf format string

char szA[256]={0};
scanf("%[^a]%s",&szA); //failed when trailing string
scanf("%[^a]|%s",&szA); //worked whatever the input
What does '|' mean in a format string. I cannot find official specification. Is there anyone who can give me some clue?
When I input something with several '|' ,the latter one still works(just means that the program not breakdown). Doesn't it need two buffers given after the format string. The former one crashed when input string can be divided into more than one string. So there is still other difference between them. What is it ?
So, I cannot understand why the latter one works when the buffer number less than directive number while the former one fails. Or can someone give me a input string to make the latter one crash.
What does '|' mean in a format string. I cannot find official specification. Is there anyone who can give me some clue?
It means that the code expects a literal | in the input stream.
Having said, that format specifier is not going to work.
The %[^a] part will capture all characters that are not a. That means it will capture even a | from the input stream. It will stop capturing when the character a is encountered in the stream. Of course that does not match the literal | in the format string. Hence, nothing after that will be processed.
If I provide the input def|akdk to the following program
#include <stdio.h>
int main()
{
char szA[256] = {0};
char temp[100] = {0};
int n = scanf("%[^a]|%s", szA, temp);
printf("%d\n%s\n%s\n", n, szA, temp);
}
I get the following output
1
def|
which makes perfect sense. BTW, the last line in the output is an empty line. I'm not sure how to show that in an answer.
When I change the scanf line to
int n = scanf("%[^a]a%s", szA, temp);
I get the following output
2
def|
kdk
which makes perfect sense.
It's not one of the format specifiers so it's a literal | character, meaning it must be present in the input stream. The official specification is the section entitled The fscanf function found in the ISO standard (e.g., C11 7.21.6.2) and the relevant section states:
The format is composed of zero or more directives: one or more white-space characters, an ordinary multibyte character (neither % nor a white-space character), or a conversion specification.
A directive that is an ordinary multibyte character is executed by reading the next characters of the stream. If any of those characters differ from the ones composing the directive, the directive fails and the differing and subsequent characters remain unread.
You can see the effect in the following complete program which fails to scan "four|1" when you're looking for the literal _ but works fine when you're looking for |.
#include <stdio.h>
int main(void) {
char cjunk[100];
int ijunk;
char inStr[] = "four|1";
if (sscanf(inStr, "%4s_%d", cjunk, &ijunk) != 2)
printf ("Could not scan\n");
if (sscanf(inStr, "%4s|%d", cjunk, &ijunk) == 2)
printf ("Scanned okay\n");
return 0;
}
So, after some conversation, in my comprehension, the latter one requires the remaining stream starts with '|' when dealing with the '|%s' directive. While the former directive excludes 'a' and leaves the remaining stream starts with 'a'. So the trailing directive always matches nothing and doesn't need to put anything into the buffer. So it never crashes even though the buffer not given.

C++ tokenization

I am writing a lexer in C++ and I am reading from a file character by character, however, how do you do tokenization in this case? I can't use strtok since I have character not a string. Somehow I need to keep reading until I reach a delimeter?
The answer is Yes. You need to keep reading until you hit a delimiter.
There are multiple solutions.
The simplest thing to do is exactly that: keep a buffer (std::string) of the characters you already read until you reach a delimiter. At that point, you build a token from the accumulated characters in the buffer, clear the buffer, and push the delimiter (if necessary) in the buffer.
Another solution would be to read ahead of the time: ie, pick up the entire line with std::getline (for example), and then check what's on this line. In general the end-of-line is a natural token delimiter.
This works well... when delimiters are easy.
Unfortunately some languages, like C++, have awkward grammars. For example, in C++ >> can be either:
the operator >> (for right-shift and stream extraction)
the end of two nested templates (ie could be rewritten as > >)
In those cases... well, just don't bother with the difference in the tokenizer, and let your AST building pass disambiguate, it's got more information.
On the basis of information provided you.
If you want to read upto a delimiter from a File, use getline(char *,int,char) function.
getline() is use to read upto n characters or upto a delimiter.
Example:
#include<fstream.h>
using namespace std;
main()
{
fstream f;
f.open("test.cpp",ios::in);
char *c;
f.getline(c,2,' ');
cout<<c; // upto 1 char or till a space
}

How does `cin>>` take three values though they are separated by space?

When looking at some code online I found
cin>>arr[0][0]>>arr[0][1]>>arr[0][2]
where I put a line of three integer values separated by space. I see that those three integers separated by space become the value of arr[0][0], arr[0][1] and arr[0][2].
It doesn't cause any trouble if there are more than one space between them.
plz, can anyone explain me how this work?
Most overloads of operator>> consume and discard all whitespace characters first thing. They begin parsing the actual value (say, an int) starting from the first non-whitespace character in the stream.
Reading almost any types of inputs from a stream will skip any leading whitespaces first, unless you explicitly turn that feature off. You should read std::basic_istream documentation for more information:
Extracts an integer value potentially skipping preceding whitespace. The value is stored to a given reference value.
This function behaves as a FormattedInputFunction. After constructing and checking the sentry object, which may skip leading whitespace, extracts an integer value by calling std::num_get::get().
The same applies to other stream input functions, including the scanf family where most format specifiers will consume all whitespace characters before reading the value:
All conversion specifiers other than [, c, and n consume and discard all leading whitespace characters (determined as if by calling isspace) before attempting to parse the input. These consumed characters do not count towards the specified maximum field width.

incorrect results with gets () due to the stray \n in istream BUT not with scanf() or cin?

In the program printed below the problem with gets () is that it takes the data for the first time only, and every subsequent call results in a null, due to the stray \n in the istream left while entering the number.
main()
{
char name[20];
int number;
for(int i=0;i<5;i++)
{
printf("enter name");
gets(s);
printf("enter phone number");
cin>>a;
}
}
Now my question is that why isn't the same happening for when I use scanf() or cin ? I mean whats the difference in the way cin and gets() takes their values which enables cin (and scanf ) to successfully leave that stray \n but not gets() ?
PS: I know about fgets(), and that gets() is deprecated and its ill-effects, and generally dont use it as well.
You're mixing line-oriented input with field-oriented input. Functions like gets(), fgets(), etc read lines. They don't care necessarily the contents of the line, but they'll read the whole line (provided there's space for it in some cases).
Field oriented inputs like cin's >> operator and scanf() don't care about lines, they care about fields. A call like scanf("%d %d %d", &x, &y, &z); doesn't care if they're on the same line or 3 separate lines (or even if you leave blank lines).
These field oriented input functions tend to leave behind newline characters that will confuse line oriented input functions. In general, you should avoid mixing the two, If you want to do both, it's often useful to read the line then use sscanf() or stringstream to do field based input from it. This also makes recovering from bad inputs a bit easier, and you won't have to worry about whether or not there are extra '\n' chars waiting for your next input function.
scanf and cin >> both read fields delimeted by whitespace, and ignore leading whitespace. Now whitepsace is spaces, tabs, AND NEWLINES, so extra newlines in the input doesn't bother them. More importantly, after reading something, they DO NOT read any of the following whitespace, instead leaving it on the input for the next call to read.
Now with scanf, you can tell it explicitly to read (and throw away) as much whitespace as it can by using a space at the end of the format string, but that will work poorly for interactive input, as it will KEEP trying to read until it gets some non-whitepsace. so if you change the end of your loop to:
printf("enter phone number");
scanf("%d ", &a);
it will seem to hang after entering the phone number, as its waiting for you to enter some non-whitespace, which will ultimately be read by the next loop iteration as the next name.
You CAN use scanf to do what you want -- read and consume the text following the phone number up to the newline -- but its not pretty:
scanf("%d%*[^\n]", &a); scanf("%*1[\n]");
You actually need two scanf calls to consume the newline as well as any space or other cruft that might be on the line.
gets(), unlike fgets(), should replace a terminating newline with a null character.
This is from gets() man page:
The gets() function shall read bytes from the standard input stream, stdin, into the array pointed to by s, until a is read or an end-of-file condition is encountered. Any shall be discarded and a null byte shall be placed immediately after the last byte read into the array."
Though fgets() will pass the newline unchanged:
The fgets() function shall read bytes from stream into the array pointed to by s, until n-1 bytes are read, or a is read and transferred to s, or an end-of-file condition is encountered. The string is then terminated with a null byte.
So, seems that your implementation is unconforming... Mine works as stated.