Sample code at Coliru:
#include <iostream>
#include <sstream>
#include <string>
int main()
{
double d; std::string s;
std::istringstream iss("234cdefipxngh");
iss >> d;
iss.clear();
iss >> s;
std::cout << d << ", '" << s << "'\n";
}
I'm reading off N3337 here (presumably that is the same as C++11). In [istream.formatted.arithmetic] we have (paraphrased):
operator>>(double& val);
As in the case of the inserters, these extractors depend on the locale’s num_get<> (22.4.2.1) object
to perform parsing the input stream data. These extractors behave as formatted input functions (as
described in 27.7.2.2.1). After a sentry object is constructed, the conversion occurs as if performed by the following code fragment:
typedef num_get< charT,istreambuf_iterator<charT,traits> > numget;
iostate err = iostate::goodbit;
use_facet< numget >(loc).get(*this, 0, *this, err, val);
setstate(err);
Looking over to 22.4.2.1:
The details of this operation occur in three stages
— Stage 1: Determine a conversion specifier
— Stage 2: Extract characters from in and determine a corresponding char value for the format
expected by the conversion specification determined in stage 1.
— Stage 3: Store results
In the description of Stage 2, it's too long for me to paste the whole thing here. However it clearly says that all characters should be extracted before conversion is attempted; and further that exactly the following characters should be extracted:
any of 0123456789abcdefxABCDEFX+-
The locale's decimal_point()
The locale's thousands_sep()
Finally, the rules for Stage 3 include:
— For a floating-point value, the function strtold.
The numeric value to be stored can be one of:
— zero, if the conversion function fails to convert the entire field.
This all seems to clearly specify that the output of my code should be 0, 'ipxngh'. However, it actually outputs something else.
Is this a compiler/library bug? Is there any provision that I'm overlooking for a locale to change the behaviour of Stage 2? (In another question someone posted an example of a system that does actually extract the characters, but also extracts ipxn which are not in the list specified in N3337).
Update
As pointed out by perreal, this text from Stage 2 is relevant:
If discard is true, then if ’.’ has not yet been accumulated, then the position of the character
is remembered, but the character is otherwise ignored. Otherwise, if ’.’ has already been
accumulated, the character is discarded and Stage 2 terminates. If it is not discarded, then a
check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated.
If the character is either discarded or accumulated then in is advanced by ++in and processing
returns to the beginning of stage 2.
So, Stage 2 can terminate if the character is in the list of allowed characters, but is not a valid character for %g. It doesn't say exactly, but presumably this refers to the definition of fscanf from C99 , which allows:
a nonempty sequence of decimal digits optionally containing a decimal-point
character, then an optional exponent part as defined in 6.4.4.2;
a 0x or 0X, then a nonempty sequence of hexadecimal digits optionally containing a
decimal-point character, then an optional binary exponent part as defined in 6.4.4.2;
INF or INFINITY, ignoring case
NAN or NAN(n-char-sequence opt ), ignoring case in the NAN part, where:
and also
In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.
So, actually the Coliru output is correct; and in fact the processing must attempt to validate the sequence of characters extracted so far as a valid input to %g, while extracting each character.
Next question: is it permitted, as in the thread I linked to earlier, to accept i , n, p etc in Stage 2?
These are valid characters for %g , however they are not in the list of atoms which Stage 2 is allowed to read (i.e. c == 0 for my latest quote, so the character is neither discarded nor accumulated).
This is a mess because it's likely that neither gcc/libstdc++'s nor clang/libc++'s implementation is conforming. It's unclear "a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1" means, but I think that the use of the phrase "next character" indicates that check should be context-sensitive (i.e., dependent on the characters that have already been accumulated), and so an attempt to parse, e.g., "21abc", should stop when 'a' is encountered. This is consistent with the discussion in LWG issue 2041, which added this sentence back to the standard after it had been deleted during the drafting of C++11. libc++'s failure to do so is bug 17782.
libstdc++, on the other hand, refuses to parse "0xABp-4" past the 0, which is actually clearly nonconforming based on the standard (it should parse "0xAB" as a hexfloat, as clearly allowed by the C99 fscanf specification for %g).
The accepting of i, p, and n is not allowed by the standard. See LWG issue 2381.
The standard describes the processing very precisely - it must be done "as if" by the specified code fragment, which does not accept those characters. Compare the resolution of LWG issue 221, in which they added x and X to the list of characters because num_get as then-described won't otherwise parse 0x for integer inputs.
Clang/libc++ accepts "inf" and "nan" along with hexfloats but not "infinity" as an extension. See bug 19611.
At the end of stage 2, it says:
If it is not discarded, then a check is made to determine if c is
allowed as the next character of an input field of the conversion
specifier returned by Stage 1. If so, it is accumulated.
If the character is either discarded or accumulated then in is advanced by ++in and processing returns to the beginning of stage 2.
So perhaps a is not allowed in the %g specifier and it is not accumulated or ignored.
Related
char szA[256]={0};
scanf("%[^a]%s",&szA); //failed when trailing string
scanf("%[^a]|%s",&szA); //worked whatever the input
What does '|' mean in a format string. I cannot find official specification. Is there anyone who can give me some clue?
When I input something with several '|' ,the latter one still works(just means that the program not breakdown). Doesn't it need two buffers given after the format string. The former one crashed when input string can be divided into more than one string. So there is still other difference between them. What is it ?
So, I cannot understand why the latter one works when the buffer number less than directive number while the former one fails. Or can someone give me a input string to make the latter one crash.
What does '|' mean in a format string. I cannot find official specification. Is there anyone who can give me some clue?
It means that the code expects a literal | in the input stream.
Having said, that format specifier is not going to work.
The %[^a] part will capture all characters that are not a. That means it will capture even a | from the input stream. It will stop capturing when the character a is encountered in the stream. Of course that does not match the literal | in the format string. Hence, nothing after that will be processed.
If I provide the input def|akdk to the following program
#include <stdio.h>
int main()
{
char szA[256] = {0};
char temp[100] = {0};
int n = scanf("%[^a]|%s", szA, temp);
printf("%d\n%s\n%s\n", n, szA, temp);
}
I get the following output
1
def|
which makes perfect sense. BTW, the last line in the output is an empty line. I'm not sure how to show that in an answer.
When I change the scanf line to
int n = scanf("%[^a]a%s", szA, temp);
I get the following output
2
def|
kdk
which makes perfect sense.
It's not one of the format specifiers so it's a literal | character, meaning it must be present in the input stream. The official specification is the section entitled The fscanf function found in the ISO standard (e.g., C11 7.21.6.2) and the relevant section states:
The format is composed of zero or more directives: one or more white-space characters, an ordinary multibyte character (neither % nor a white-space character), or a conversion specification.
A directive that is an ordinary multibyte character is executed by reading the next characters of the stream. If any of those characters differ from the ones composing the directive, the directive fails and the differing and subsequent characters remain unread.
You can see the effect in the following complete program which fails to scan "four|1" when you're looking for the literal _ but works fine when you're looking for |.
#include <stdio.h>
int main(void) {
char cjunk[100];
int ijunk;
char inStr[] = "four|1";
if (sscanf(inStr, "%4s_%d", cjunk, &ijunk) != 2)
printf ("Could not scan\n");
if (sscanf(inStr, "%4s|%d", cjunk, &ijunk) == 2)
printf ("Scanned okay\n");
return 0;
}
So, after some conversation, in my comprehension, the latter one requires the remaining stream starts with '|' when dealing with the '|%s' directive. While the former directive excludes 'a' and leaves the remaining stream starts with 'a'. So the trailing directive always matches nothing and doesn't need to put anything into the buffer. So it never crashes even though the buffer not given.
I was working on this answer. And I ran into a conundrum: scanf has an assignment suppressing '*':
If this option is present, the function does not assign the result of the conversion to any receiving argument
But when used in get_time the '*' gives a run-time error on Visual Studio, libc++, and libstdc++: str >> get_time(&tmbuf, "%T.%*Y") so I believe it's not supported.
As such I chose to ignore input by reading into tmbuf.tm_year twice:
str >> get_time(&tmbuf, "%H:%M:%S.%Y UTC %b %d %Y");
This works and seems to be my only option so far as get_time goes since the '*' isn't accepted. But as we all know, just because it works doesn't mean it's defined. Can someone confirm that:
It is defined to assign the same variable twice in get_time
The stream will always be read left-to-right so the 1stincidence of %Y will be stomped, not the 2nd
The standard specifies the exact algorithm of processing the format string of get_time in 22.4.5.1.1 time_get members. (time_get::get is what eventually gets called when you do str>>get_time(...)). I quote the important parts:
The function starts by evaluating err = ios_base::goodbit. It then enters a loop, reading zero or more characters from s at each iteration. Unless otherwise specified below, the loop terminates when the first of the following conditions holds:
(8.1) — The expression fmt == fmtend evaluates to true.
skip boring error-handling parts
(8.4) — The next element of fmt is equal to ’%’, optionally followed by a modifier character, followed by a conversion specifier character, format, together forming a conversion specification valid for the ISO/IEC 9945 function strptime. skip boring error-handling parts the function evaluates s = do_get(s, end, f, err, t, format, modifier) skip more boring error-handling parts the function increments fmt to point just past the end of the conversion specification and continues looping.
As can be seen from the description, the format string is processed strictly sequentially left to right. There's no provision to handle repeating conversion specifications specially. So the answer must be yes, what you have done is it is well defined and perfectly legal.
I'm trying to read a double, followed by a character, from cin using the snippet:
double d;
char c;
while(1) {
cin >> d >> c;
cout << d << c << endl;
}
The peculiar thing is that it works for some characters, but not for others. For example, it works for "2g", "2h", but fails for "2a", "2b", "2x" ...:
mwmbp:ppcpp mwisse$ ./a.out
2a
0
2b
0
2c
0
2g
2g
2h
2h
2i
0h
2x
0h
2z
2z
As pointed out by one of you, it does indeed work for integers. Do you know why it doesn't work for doubles? I have as yet been unable to find information on how cin interprets its input.
This is currently a bug on LLVM: https://llvm.org/bugs/show_bug.cgi?id=17782
Way back in 2014 it was assigned from Howard Hinnant to Marshall Clow since then... Well don't hold you breath on this getting fixed any time soon.
EDIT:
The istream extraction operator internally uses num_get::do_get Which sequentially performs these tasks for a double:
Selects a conversion specifier, for double that's %lg
Tests for an empty input stream
Checks if the next character in the string is contained in the ctype or numpunct facets
If scanf would allow the character obtained from 3 to be appended to the input field given the conversion specifier obtained in 1, if so 3 is repeated if not 5 is performed on the input field without this character
The double from the accepted input field is read in with
scanf prior to c++11
strtold in c++11 and c++14
strtod onward from c++17
If 5 fails failbit is assigned to the istream's iostate, but if 5 succeeded, the result is assigned to the double
If any thousands separators were allowed into the input field by facet numpunct in 3 their position is evaluated, if any of them violate the grouping rules of the facet, failbit is assigned to the istream's iostate
If the input field used in 5 was empty eofbit is assigned to the istream's iostate
That's a lot to say that for a double you're really concerned with scanf's %lg conversion specifier's rules for extraction of a double (which internally will depend upon strtof's constraints):
An optional plus or minus character
One of the following
"INF" or "INFINITY" (case insensitive)
"NAN" (case insensitive)
"0x" or "0X", an input field of hexadecimal digits and optionally a decimal point character, and optionally followed by a "p" or "P" a plus or minus sign and a decimal exponent
An input field of decimal digits and optionally a decimal point character and optionally an "e" or "E" a plus or minus sign and a non-empty exponent
Note that if your locale defines any other expression as an acceptable floating point input field this is also accepted. So if you've added some special sauce to the istream you're working with that may be where the problem lies. Outside of that, neither a trailing "a", "b", or "x" are an accepted suffix for the %lg conversion specifier, so your implementation is not compliant or there's something else you're not telling us.
Here is a live example of your inputs succeeding on gcc5.1 which is compliant: http://ideone.com/nGGW0L
Since the problem is caused by a bug (or feature, depending on your point of view), in libc++, it seems that the easiest way to avoid it is to use libstdc++ instead, until a fix is in place. If you're running on a mac, add -stdlib=libstdc++ to your compile flags. g++ -stdlib=libstdc++ test.cpp will correctly compile the code given in this post.
Libc++ appears to have other, similar, bugs, one of which I posted here: Trying to read lines from an ASCII file using C++ , Ubuntu vs Mac...?, before learning about these different libraries.
Is it perfectly ok (= well defined behaviour according to the standard) to call :
mystream.read(buffer, 0);
or
mystream.write(buffer, 0);
(and of course nothing will be read or written).
I would like to know if I have to test if the provided size is null before calling one of these two functions.
Yes, the behavior is well-defined: both functions will go through the motions for unformatted input/output functions (constructing the sentry, setting failbit if eofbit is set, flushing the tied stream if necessary), and then they will get to this clause:
§27.7.2.3[istream.unformatted]/30
Characters are extracted and stored until either of the following occurs:
— n characters are stored;
§27.7.3.7[ostream.unformatted]/5
Characters are inserted until either of the following occurs
— n characters are inserted;
"zero characters are stored/inserted" is true before anything is stored or extracted.
Looking at actual implementations, I see for (; gcount < n; ++gcount) in libc++ or sgetn(buffer, n); in stdlibc++ which has the equivalent loop
Another extraction from 27.7.2.3 Unformatted input functions/1 gives us a clue that zero-size input buffers are valid case:
unformatted input functions taking a character array of non-zero size as an argument shall also store a null character (using charT()) in the first location of the array.
In the C++ standard (section 27.6.1.3\24), for the
istream ignore() function in the IOStreams library, it implies that if you supply an argument for 'n' of numeric_limits::max(), it will continue to ignore characters
forever up until the delimiter is found, even way beyond the actual
max value for streamsize (i.e. the 'n' argument is interpreted as infinite).
For the gcc implementation this does indeed appear to be how
ignore() is implemented, but I'm still unclear as to
whether this is implementation specific, or mandated by the standard.
Can someone who knows this well confirm that this is guaranteed by a
standard compliant iostreams library?
The standard says that numeric_limits<streamsize>::max() is a special value that doesn't affect the number of characters skipped.
Effects: Behaves as an unformatted input function (as described in 27.7.2.3, paragraph 1). After constructing a sentry object, extracts characters and discards them. Characters are extracted until any of the following occurs:
-- if n != numeric_limits<streamsize>::max() (18.3.2), n characters are extracted
-- end-of-file occurs on the input sequence (in which case the function calls setstate(eofbit), which may throw ios_base::failure (27.5.5.4));
-- traits::eq_int_type(traits::to_int_type(c), delim) for the next available input character c (in which case c is extracted).
According to here:
istream& istream::ignore ( streamsize n = 1, int delim = EOF );
Extract and discard characters
Extracts characters from the input sequence and discards them.
The extraction ends when n characters have been extracted and discarded or when the character delim is found, whichever comes first. In the latter case, the delim character itself is also extracted.
In your case, when numeric_limits::max() number of characters have been reached, the first condition is met.
[Per Bo]
However, according to spec, the above case is applied only when n is not equal to numeric_limits<streamsize>::max().