C++ - Reading a double, followed by a character, from cin - c++

I'm trying to read a double, followed by a character, from cin using the snippet:
double d;
char c;
while(1) {
cin >> d >> c;
cout << d << c << endl;
}
The peculiar thing is that it works for some characters, but not for others. For example, it works for "2g", "2h", but fails for "2a", "2b", "2x" ...:
mwmbp:ppcpp mwisse$ ./a.out
2a
0
2b
0
2c
0
2g
2g
2h
2h
2i
0h
2x
0h
2z
2z
As pointed out by one of you, it does indeed work for integers. Do you know why it doesn't work for doubles? I have as yet been unable to find information on how cin interprets its input.

This is currently a bug on LLVM: https://llvm.org/bugs/show_bug.cgi?id=17782
Way back in 2014 it was assigned from Howard Hinnant to Marshall Clow since then... Well don't hold you breath on this getting fixed any time soon.
EDIT:
The istream extraction operator internally uses num_get::do_get Which sequentially performs these tasks for a double:
Selects a conversion specifier, for double that's %lg
Tests for an empty input stream
Checks if the next character in the string is contained in the ctype or numpunct facets
If scanf would allow the character obtained from 3 to be appended to the input field given the conversion specifier obtained in 1, if so 3 is repeated if not 5 is performed on the input field without this character
The double from the accepted input field is read in with
scanf prior to c++11
strtold in c++11 and c++14
strtod onward from c++17
If 5 fails failbit is assigned to the istream's iostate, but if 5 succeeded, the result is assigned to the double
If any thousands separators were allowed into the input field by facet numpunct in 3 their position is evaluated, if any of them violate the grouping rules of the facet, failbit is assigned to the istream's iostate
If the input field used in 5 was empty eofbit is assigned to the istream's iostate
That's a lot to say that for a double you're really concerned with scanf's %lg conversion specifier's rules for extraction of a double (which internally will depend upon strtof's constraints):
An optional plus or minus character
One of the following
"INF" or "INFINITY" (case insensitive)
"NAN" (case insensitive)
"0x" or "0X", an input field of hexadecimal digits and optionally a decimal point character, and optionally followed by a "p" or "P" a plus or minus sign and a decimal exponent
An input field of decimal digits and optionally a decimal point character and optionally an "e" or "E" a plus or minus sign and a non-empty exponent
Note that if your locale defines any other expression as an acceptable floating point input field this is also accepted. So if you've added some special sauce to the istream you're working with that may be where the problem lies. Outside of that, neither a trailing "a", "b", or "x" are an accepted suffix for the %lg conversion specifier, so your implementation is not compliant or there's something else you're not telling us.
Here is a live example of your inputs succeeding on gcc5.1 which is compliant: http://ideone.com/nGGW0L

Since the problem is caused by a bug (or feature, depending on your point of view), in libc++, it seems that the easiest way to avoid it is to use libstdc++ instead, until a fix is in place. If you're running on a mac, add -stdlib=libstdc++ to your compile flags. g++ -stdlib=libstdc++ test.cpp will correctly compile the code given in this post.
Libc++ appears to have other, similar, bugs, one of which I posted here: Trying to read lines from an ASCII file using C++ , Ubuntu vs Mac...?, before learning about these different libraries.

Related

Error subtracting hex constant when it ends in an 'E' [duplicate]

This question already has an answer here:
Why doesn't "0xe+1" compile?
(1 answer)
Closed 8 months ago.
int main()
{
0xD-0; // Fine
0xE-0; // Fails
}
This second line fails to compile on both clang and gcc. Any other hex constant ending is ok (0-9, A-D, F).
Error:
<source>:4:5: error: unable to find numeric literal operator 'operator""-0'
4 | 0xE-0;
| ^~~~~
I have a fix (adding a space after the constant and before the subtraction), so I'd mainly like to know why? Is this something to do with it thinking there's an exponent here?
https://godbolt.org/z/MhGT33PYP
Actually, this behaviour is mandated by the C++ standard (and documented), as strange as it may seem. This is because of how C++ compiles using Preprocessing Tokens (a.k.a pp-tokens).
If we look closely at how the compiler generates a token for numbers:
A preprocessing number is made up of a digit, optionally preceded by a period, and may be followed by letters, underscores, digits, periods, and any one of: e+ e- E+ E-.
According to this, The compiler reads 0x, then E-, which it interprets it as part of the number as having E- is allowed in a numeral pp-token and no space precedes it or is in between the E and the - (this is why adding a space is an easy fix).
This means that 0xE-0 is taken in as a single preprocessing token. In other words, the compiler interprets it as one number, instead of two numbers 0xE and 0 and an operation -. Therefore, the compiler is expecting E to represent an exponent for a floating-point literal.
Now let's take a look at how C++ interprets floating-point literals. Look at the section under "Examples". It gives this curious code sample:
<< "\n0x1p5 " << 0x1p5 // double
<< "\n0x1e5 " << 0x1e5 // integer literal, not floating-point
E is interpreted as part of the integer literal, and does not make the number a hexadecimal floating literal! Therefore, the compiler recognizes 0xE as a single, integer, hexidecimal number. Then, there is the -0 which is technically part of the same preprocessing token and therefore is not an operator and another integer. Uh oh. This is now invalid, as there is no -0 suffix.
And so the compiler reports an error, as such.

Error in getting ASCII of character in C++

I saw this question : How to convert an ASCII char to its ASCII int value?
The most voted answer (https://stackoverflow.com/a/15999291/14911094) states the solution as :
Just do this:
int(k)
But i am having issues with this.
My code is :
std::cout << char(144) << std::endl;
std::cout << (int)(char(144)) << std::endl;
std::cout << int('É') << std::endl;
Now the output comes as :
É
-112
-55
Now i can understand the first line but what is happening for the second an the third lines?
Firstly how can some ASCII be negative and secondly how can that be different for the same character.
Also as far as i have tested this is not some random garbage from the memory as this stays same for every time i run the program also :
If i change it to 145 :
æ
-111
The output to changes by 1 so as far as i guess this may due to some kind of overflow.
But i cannot get it exactly as i am converting to int and that should be enough(4 bytes) to store the result.
Can any one suggest a solution?
If your platform is using ASCII for the character encoding (most do these days), then bear in mind that ASCII is only a 7 bit encoding.
It so happens that char is a signed type on your platform. (The signedness or otherwise of char doesn't matter for ASCII as only the first 7 bits are required.)
Hence char(144) gives you a char with a value of -112. (You have a 2's complement char type on your platform: from C++14 you can assume that, but you can't in C).
The third line implies that that character (which is not in the ASCII set) has a value of -55.
int(unsigned char('É'))
would force it to a positive value on all but the most exotic of platforms.
The C++ standard only guarantees that characters in the basic execution character set1 have non-negative encodings. Characters outside that basic set may have negative encodings - it depends on the locale.
Upper- and lowercase Latin alphabet, decimal digits, most punctuation, and control characters like tab, newline, form feed, etc.

How can force the user/OS to input an Ascii string

This is an extended question of this one: Is std::string suppose to have only Ascii characters
I want to build a simple console application that take an input from the user as set of characters. Those characters include 0->9 digits and a->z letters.
I am dealing with input supposing that it is an Ascii. For example, I am using something like : static_cast<unsigned int>(my_char - '0') to get the number as unsigned int.
How can I make this code cross-platform? How can tell that I want the input to be Ascii always? Or I have missed a lot of concepts and static_cast<unsigned int>(my_char - '0') is just a bad way?
P.S. In Ascii (at least) digits have sequenced order. However, in others encoding, I do not know if they have. (I am pretty sure that they are but there is no guarantee, right?)
How can force the user/OS to input an Ascii string
You cannot, unless you let the user specify the numeric values of such ASCII input.
It all depends how the terminal implementation used to serve std::cin translates key strokes like 0 to a specific number, and what your toolchain expects to match that number with it's intrinsic translation for '0'.
You simply shouldn't expect ASCII values explicitly (e.g. using magic numbers), but char literals to provide portable code. The assumption that my_char - '0' will result in the actual digits value is true for all character sets. The C++ standard states in [lex.charset]/3 that
The basic execution character set and the basic execution wide-character set shall each contain all the members of the basic source character set, plus control characters representing alert, backspace, and carriage return, plus a null character (respectively, null wide character), whose representation has all zero bits. For each basic execution character set, the values of the members shall be non-negative and distinct from one another. In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous.[...]
emphasis mine
You can't force or even verify that beforehand . "Evil user" can always sneak a UTF-8 encoded string into your application, with no characters above U+7F. And such string happens to be also Ascii-encoded.
Also, whatever platform specific measure you take, user can pipe a UTF-16LE encoded file. Or /dev/urandom
Your mistakes string encoding with some magic property of an input stream - and it is not. It is, well, encoding, like JPEG or AVI, and must be handled exactly the same way - read an input, match with format, report errors on parsing failure.
For your case, if you want to accept only ASCII, read input stream byte by byte and throw/exit with error if you ever encounters a byte with the value outside ASCII domain.
However, if later you encounter a terminal providing data with some incompatible encoding, like UTF16LE, you have no choice but to write a detection (based on byte order mark) and a conversion routine.

Characters extracted by istream >> double

Sample code at Coliru:
#include <iostream>
#include <sstream>
#include <string>
int main()
{
double d; std::string s;
std::istringstream iss("234cdefipxngh");
iss >> d;
iss.clear();
iss >> s;
std::cout << d << ", '" << s << "'\n";
}
I'm reading off N3337 here (presumably that is the same as C++11). In [istream.formatted.arithmetic] we have (paraphrased):
operator>>(double& val);
As in the case of the inserters, these extractors depend on the locale’s num_get<> (22.4.2.1) object
to perform parsing the input stream data. These extractors behave as formatted input functions (as
described in 27.7.2.2.1). After a sentry object is constructed, the conversion occurs as if performed by the following code fragment:
typedef num_get< charT,istreambuf_iterator<charT,traits> > numget;
iostate err = iostate::goodbit;
use_facet< numget >(loc).get(*this, 0, *this, err, val);
setstate(err);
Looking over to 22.4.2.1:
The details of this operation occur in three stages
— Stage 1: Determine a conversion specifier
— Stage 2: Extract characters from in and determine a corresponding char value for the format
expected by the conversion specification determined in stage 1.
— Stage 3: Store results
In the description of Stage 2, it's too long for me to paste the whole thing here. However it clearly says that all characters should be extracted before conversion is attempted; and further that exactly the following characters should be extracted:
any of 0123456789abcdefxABCDEFX+-
The locale's decimal_point()
The locale's thousands_sep()
Finally, the rules for Stage 3 include:
— For a floating-point value, the function strtold.
The numeric value to be stored can be one of:
— zero, if the conversion function fails to convert the entire field.
This all seems to clearly specify that the output of my code should be 0, 'ipxngh'. However, it actually outputs something else.
Is this a compiler/library bug? Is there any provision that I'm overlooking for a locale to change the behaviour of Stage 2? (In another question someone posted an example of a system that does actually extract the characters, but also extracts ipxn which are not in the list specified in N3337).
Update
As pointed out by perreal, this text from Stage 2 is relevant:
If discard is true, then if ’.’ has not yet been accumulated, then the position of the character
is remembered, but the character is otherwise ignored. Otherwise, if ’.’ has already been
accumulated, the character is discarded and Stage 2 terminates. If it is not discarded, then a
check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1. If so, it is accumulated.
If the character is either discarded or accumulated then in is advanced by ++in and processing
returns to the beginning of stage 2.
So, Stage 2 can terminate if the character is in the list of allowed characters, but is not a valid character for %g. It doesn't say exactly, but presumably this refers to the definition of fscanf from C99 , which allows:
a nonempty sequence of decimal digits optionally containing a decimal-point
character, then an optional exponent part as defined in 6.4.4.2;
a 0x or 0X, then a nonempty sequence of hexadecimal digits optionally containing a
decimal-point character, then an optional binary exponent part as defined in 6.4.4.2;
INF or INFINITY, ignoring case
NAN or NAN(n-char-sequence opt ), ignoring case in the NAN part, where:
and also
In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.
So, actually the Coliru output is correct; and in fact the processing must attempt to validate the sequence of characters extracted so far as a valid input to %g, while extracting each character.
Next question: is it permitted, as in the thread I linked to earlier, to accept i , n, p etc in Stage 2?
These are valid characters for %g , however they are not in the list of atoms which Stage 2 is allowed to read (i.e. c == 0 for my latest quote, so the character is neither discarded nor accumulated).
This is a mess because it's likely that neither gcc/libstdc++'s nor clang/libc++'s implementation is conforming. It's unclear "a check is made to determine if c is allowed as the next character of an input field of the conversion specifier returned by Stage 1" means, but I think that the use of the phrase "next character" indicates that check should be context-sensitive (i.e., dependent on the characters that have already been accumulated), and so an attempt to parse, e.g., "21abc", should stop when 'a' is encountered. This is consistent with the discussion in LWG issue 2041, which added this sentence back to the standard after it had been deleted during the drafting of C++11. libc++'s failure to do so is bug 17782.
libstdc++, on the other hand, refuses to parse "0xABp-4" past the 0, which is actually clearly nonconforming based on the standard (it should parse "0xAB" as a hexfloat, as clearly allowed by the C99 fscanf specification for %g).
The accepting of i, p, and n is not allowed by the standard. See LWG issue 2381.
The standard describes the processing very precisely - it must be done "as if" by the specified code fragment, which does not accept those characters. Compare the resolution of LWG issue 221, in which they added x and X to the list of characters because num_get as then-described won't otherwise parse 0x for integer inputs.
Clang/libc++ accepts "inf" and "nan" along with hexfloats but not "infinity" as an extension. See bug 19611.
At the end of stage 2, it says:
If it is not discarded, then a check is made to determine if c is
allowed as the next character of an input field of the conversion
specifier returned by Stage 1. If so, it is accumulated.
If the character is either discarded or accumulated then in is advanced by ++in and processing returns to the beginning of stage 2.
So perhaps a is not allowed in the %g specifier and it is not accumulated or ignored.

What information is used when parsing a (float) number?

What information does the Standard library of C++ use when parsing a (float) number?
Here's the possibilities I know to parse a (single) float number with std c++:
double atof( const char *str )
sscanf
double strtod( const char* str, char** str_end );
istringstream, via operator>> or
via num_get directly
It seems obvious, that at the very least, we have to know what character is used as decimal separator.
iostreams, in particular num_get::get, in addition also talk about:
ios_base I/O format flags - Is there any information here that is used when parsing floating point?
the thousands_separator (* see below)
On the other hand, in std::strtod, which seems to be what sscanf is defined in terms of (which in turn is referenced by num_get), there the only variable information seems to be what is considered a space and the decimal character, although it doesn't seem to be specified where that is defined. (At least neither on cppref nor on MSDN.)
So, what information is actually used, and what comprises a valid parseable float representation for the C++ Standard lib?
From what I see, only the decimal separator from the global (Cor C++ ???) is needed and, in addition, if the number contains a thousands separator, I would expect it to only be parsed correctly by num_get since strod/sscanf do not support the thousands separator.
(*) The group (thousands) separator is an interesting case to me. As far as I can tell the "C" functions do not make any reference to it and last time I checked C and C++ standard printf function will never write it. So is it really processed by the strtod/scanf functions? (I know that there is a POSIX printf extension for the group separator, but that's not really standard, and notably missing from Microsoft's implementation.)
The C11 spec for strtod() seems to have a opening big enough for any size truck to drive through. It appears so open ended, I see no limitation.
§7.22.1.3 6 In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.
For non- "standard C" locales, the isspace(), decimal (radix) point, group separator, digits per group and sign seem to constitute the typical variants. But apparently there is no limit.
For fun experimented with 500+ locales using printf(), sscanf(), strftime() and isspace().
All tested locales had a radix (decimal) point of '.' or ',', the same +/- sign, no digit grouping, and the expected 0-9.
strftime(... "%Y" ...) did not use a digit separator over years 1000-99999.
sscanf("1,234.5", "%lf", .. and sscanf("1.234,5", "%lf", .. did not produce 1234.5 in any locale.
All int values in the range 0 to 255 produced the same isspace() results with the sometimes exception of 154 and 160.
Of course these test do not prove a limit to what may occur, but do represent a sample of possibilities.