tokenizing ints vs floats in lex/flex

tokenizing ints vs floats in lex/flex - regex

I'm teaching myself a little flex/bison for fun. I'm writing an interpreter for the 1975 version of MS Extended BASIC (Extended as in "has strings"). I'm slightly stumped by one issue though.
Floats can be identified by looking for a . or an E (etc), and then fail over to an int otherwise. So I did this...
[0-9]*[0-9.][0-9]*([Ee][-+]?[0-9]+)? {
yylval.d = atof(yytext);
return FLOAT;
}
[0-9]+ {
yylval.i = atoi(yytext);
return INT;
}
sub-fields in the yylval union are .d for double, .i for int and .s for string.
But it is also possible that you need to use a float because the number is too large to store in an int - which in this case is a 16-bit signed.
Is there a way to do this in the regex? Or do I have to do this in the associated c-side code with an if?

If you want integer to take priority over float (so that a literal which looks like an integer is an integer), then you need to put the integer pattern first. (The pattern with the longest match always wins, but if two patterns both match the same longest prefix, the first one wins.) So your basic outline is:
integer-pattern { /* integer rule */ }
float-pattern { /* float rule */ }
Your float rule looks reasonable, but note that it will match a single ., possibly followed by an exponent. Very few languages consider a lone . as a floating point constant (that literal is conventionally written as 0 :-) ) So you might want to change it to something like
[0-9]*([0-9]\.?|\.[0-9])[0-9]*([Ee][-+]?[0-9]+)
To use a regex to match a non-negative integer which fits into a 16-bit signed int, you can use the following ugly pattern:
0*([12]?[0-9]{1,4}|3(2(7(6[0-7]|[0-5][0-9])|[0-6][0-9]{2})|[0-1][0-9]{3}))
(F)lex will produce efficient code to implement this regex, but that doesn't necessarily make it a good idea.
Notes:
The pattern recognises integers with redundant leading zeros, like 09. Some languages (like C) consider that to be an invalid octal literal, but I don't think Basic has that restriction.
The pattern does not recognise 32768, since that's too big to be a positive integer. However, it is not too big to be a negative integer; -32768 would be perfectly fine. This is always a corner case in parsing integer literals. If you were just lexing integer literals, you could easily handle the difference between positive and negative limits by having a separate pattern for literals starting with a -, but including the sign in the integer literal is not appropriate for expression parsers, since it produces an incorrect lexical analysis of a-1. (It would also be a bit weird for -32768 to be a valid integer literal, while - 32768 is analysed as a floating point expression which evaluates to -32768.0.) There's really no good solution here, unless your language includes unsigned integer literals (like C), in which case you could analyse literals from 0 to 32767 as signed integers; from 32768 to 65535 as unsigned integers; and from 65536 and above as floating point.

The literals for integer and floating point numbers are the same for many programming languages. For example, the Java Language Specification (and several others) contains the grammar rules for internet and floating-point literals. In these rules, 0 does not validate as a floating point literal. That's the main problem I see with your current approach.
When parsing, you should not use atoi or atof since they don't check for errors. Use strtoul and strtod instead.
The action for integer numbers should be:
if strtoul succeeds:
if the number is less than 0x8000:
llval.i = number
return INT
strtod must succeed
llval.d = number
return FLOAT

Related

Converting char to int in C

I have searched for this for quite some time before posting this question. The answer to it should be fairly easy though, since I am an ultra-beginner atm.
I have a char* in which I want a user to put some digits (over 20), that in turn can be called upon specifically.
This is what I've tried:
char* digits = GetString();
int prime = digits[0];
When I verify whether this worked with printf I find prime to have become 0.
printf("prime:%d\ndigits:%c\n",prime, digits[0]);
Why would this be and what could I do to make this work?
Edit: Is it perhaps easier to make an int array and use GetLongLong?

Neither C or C++ guarantees what value will be used to encode the character 0, but both guarantee that digits will be contiguous and ordered, so (for example) digits[0]-48 may or may not work, but digits[0] - '0' is guaranteed to work (presuming that digits[0] actually holds a digit, of course).
The precise requirement in the C++ standard (§2.3/3) is:
In both the source and execution basic character sets, the value of each character after 0 in the
above list of decimal digits shall be one greater than the value of the previous.
At least as of C99, the C standard has identical wording, but at §5.2.1/3.

The character zero ('0') has the numeric value of 48, '1' is 49, and so on.
You may find this a useful idiom to get the numeric value from the ascii value.
int prime = digits[0] - '0';
You may also find looking at man ascii informative (or similar man page if you use some other charset).

string consisting of positive integer or positive decimal, convert toI number format

I was given the following assignment, but I do not understand exactly what is the problem exactly...
Write a program that reads a string consisting of a positive integer or a positive decimal number and converts the numbers to the number format.
from my perspective what is saying is that I need to read a string consisting of x integer/decimal and convert it to an integer or a double...is that right? it seems so easy because I can just use strtol() and strtod().

It depends on whether you are allowed to use the standard library functions or not. If so, then, yes, it is just as easy as you described. If not, you will have to parse the string looking for decimal points, minus signs, etc. and convert by your own algorithms.

I'm not writing an answer for you, but you basically your assignment asks to convert strings of form say :
"123"
"-321"
"+3219871230"
"0123"
to corresponding numbers.
Also, since this is an assignment your professor shouldn't be interested in seeing solutions using library functions strtod, strtol etc.

What information is used when parsing a (float) number?

What information does the Standard library of C++ use when parsing a (float) number?
Here's the possibilities I know to parse a (single) float number with std c++:
double atof( const char *str )
sscanf
double strtod( const char* str, char** str_end );
istringstream, via operator>> or
via num_get directly
It seems obvious, that at the very least, we have to know what character is used as decimal separator.
iostreams, in particular num_get::get, in addition also talk about:
ios_base I/O format flags - Is there any information here that is used when parsing floating point?
the thousands_separator (* see below)
On the other hand, in std::strtod, which seems to be what sscanf is defined in terms of (which in turn is referenced by num_get), there the only variable information seems to be what is considered a space and the decimal character, although it doesn't seem to be specified where that is defined. (At least neither on cppref nor on MSDN.)
So, what information is actually used, and what comprises a valid parseable float representation for the C++ Standard lib?
From what I see, only the decimal separator from the global (Cor C++ ???) is needed and, in addition, if the number contains a thousands separator, I would expect it to only be parsed correctly by num_get since strod/sscanf do not support the thousands separator.
(*) The group (thousands) separator is an interesting case to me. As far as I can tell the "C" functions do not make any reference to it and last time I checked C and C++ standard printf function will never write it. So is it really processed by the strtod/scanf functions? (I know that there is a POSIX printf extension for the group separator, but that's not really standard, and notably missing from Microsoft's implementation.)

The C11 spec for strtod() seems to have a opening big enough for any size truck to drive through. It appears so open ended, I see no limitation.
§7.22.1.3 6 In other than the "C" locale, additional locale-specific subject sequence forms may be accepted.
For non- "standard C" locales, the isspace(), decimal (radix) point, group separator, digits per group and sign seem to constitute the typical variants. But apparently there is no limit.
For fun experimented with 500+ locales using printf(), sscanf(), strftime() and isspace().
All tested locales had a radix (decimal) point of '.' or ',', the same +/- sign, no digit grouping, and the expected 0-9.
strftime(... "%Y" ...) did not use a digit separator over years 1000-99999.
sscanf("1,234.5", "%lf", .. and sscanf("1.234,5", "%lf", .. did not produce 1234.5 in any locale.
All int values in the range 0 to 255 produced the same isspace() results with the sometimes exception of 154 and 160.
Of course these test do not prove a limit to what may occur, but do represent a sample of possibilities.

Is there any C++ style guide that talks about numeric literal suffixes?

In all of the C++ style guides I have read, I never have seen any information about numerical literal suffixes (i.e. 3.14f, 0L, etc.).
Questions
Is there any style guide out there that talks about there usage, or is there a general convention?
I occasionally encounter the f suffix in graphics programming. Is there any trend on there usage in the type of programming domain?

The only established convention (somewhat established, anyway) of which I'm aware is to always use L rather than l, to avoid its being mistaken for a 1. Beyond that, it's pretty much a matter of using what you need when you need it.
Also note that C++ 11 allows user-defined literals with user-defined suffixes.

There is no general style guide that I've found. I use capital letters and I'm picky about using F for float literals and L for long double. I also use the appropriate suffixes for integral literals.
I assume you know what these suffixes mean: 3.14F is a float literal, 12.345 is a double literal, 6.6666L is a long double literal.
For integers: U is unsigned, L is long, LL is long long. Order between U and the Ls doesn't matter but I always put UL because I declare such variables unsigned long for example.
If you assign a variable of one type a literal of another type, or supply a numeric literal of one type for function argument of another type a cast must happen. Using the proper suffix avoids this and is useful along the same lines as static_cast is useful for calling out casts. Consistent usage of numeric literal suffixes is good style and avoids numeric surprises.
People differ on whether lower or upper case is best. Pick a style that looks good to you and be consistent.

The CERT C Coding Standard
recommends to use uppercase letters:
DCL16-C. Use "L," not "l," to indicate a long value
Lowercase letter l (ell) can easily be confused with the digit 1 (one). This can be particularly confusing when indicating that an integer literal constant is a long value. This recommendation is similar to DCL02-C. Use visually distinct identifiers.
Likewise, you should use uppercase LL rather than lowercase ll when indicating that an integer literal constant is a long long value.

MISRA C++ 2008 for the C++03 language states in rule M2-13-3 (at least, as cited by this Autosar document) that
A “U” suffix shall
be applied to all octal or hexadecimal
integer literals of unsigned type.
The linked document also compares to JSF-AV 2005 and HIC++v4.0, all these four standards require the suffixes to be uppercase.
Nevertheless I can't find a rule (but I don't have a hardcopy of MISRA C++ at hand) that states that the suffixes shall be used whenever needed. However, IIRC there is one in MISRA C++ (or maybe was it just my former company coding guidelines…)

Web search for "c++ numeric suffixes" returns:
http://cpp.comsci.us/etymology/literals.html
http://www.cplusplus.com/forum/general/27226/
http://bytes.com/topic/c/answers/758563-numeric-constants
Are these what you're looking for?

How can I find out what the current charset is in C++?

How can I find out what the current charset is in C++?
In a console application (WinXP) I am getting negative values for some characters (like äöüé) with
(int)mystring[a]
and this surprises me. I was expecting the values to be between 127 and 256.
So is there something like GetCharset() or SetCharset() in c++?

It depends on how you look at the value you have at hand. char can be signed(e.g. on Windows), or unsigned like on some other systems. So, what you should do is to print the value as unsigned to get what you are asking for.
C++ until now is char-set agnostic. For Windows console specifically, you can use: GetConsoleOutputCP.

Look at std::numeric_limits<char>::min() and max(). Or CHAR_MIN and CHAR_MAX if you don't like typing, or if you need an integer constant expression.
If CHAR_MAX == UCHAR_MAX and CHAR_MIN == 0 then chars are unsigned (as you expected). If CHAR_MAX != UCHAR_MAX and CHAR_MIN < 0 they are signed (as you're seeing).
In the standard 3.9.1/1, ensures that there are no other possibilities: "... a plain char can take on either the same values as a signed char or an unsigned char; which one is implementation-defined."
This tells you whether char is signed or unsigned, and that's what's confusing you. You certainly can't call anything to modify it: from the POV of a program it's baked into the compiler even if the compiler has ways of changing it (GCC certainly does: -fsigned-char and -funsigned-char).
The usual way to deal with this is if you're going to cast a char to int, cast it through unsigned char first. So in your example, (int)(unsigned char)mystring[a]. This ensures you get a non-negative value.
It doesn't actually tell you what charset your implementation uses for char, but I don't think you need to know that. On Microsoft compilers, the answer is essentially that commonly-used character encoding "ISO-8859-mutter-mutter". This means that chars with 7-bit ASCII values are represented by that value, while values outside that range are ambiguous, and will be interpreted by a console or other recipient according to how that recipient is configured. ISO Latin 1 unless told otherwise.
Properly speaking, the way characters are interpreted is locale-specific, and the locale can be modified and interrogated using a whole bunch of stuff towards the end of the C++ standard that personally I've never gone through and can't advise on ;-)
Note that if there's a mismatch between the charset in effect, and the charset your console uses, then you could be in for trouble. But I think that's separate from your issue: whether chars can be negative or not is nothing to do with charsets, just whether char is signed.

chars are normally signed by default.
Try this.
cout << (unsigned char) mystring[a] << endl;

The only gurantee that the standard provides are for members of the basic character set:
2.2 Character sets
3 The basic execution character set
and the basic execution wide-character
set shall each contain all the members
of the basic source character set,
plus control characters representing
alert, backspace, and carriage return,
plus a null character (respectively,
null wide character), whose
representation has all zero bits. For
each basic execution character set,
the values of the members shall be
non-negative and distinct from one
another. In both the source and
execution basic character sets, the
value of each character after 0 in the
above list of decimal digits shall be
one greater than the value of the
previous. The execution character set
and the execution wide-character set
are supersets of the basic execution
character set and the basic execution
wide-character set, respectively. The
values of the members of the execution
character sets are
implementation-defined, and any
additional members are locale-specific
Further, the type char is supposed to hold:
3.9.1 Fundamental types
1 Objects declared as characters (char) shall be large enough to store any member of the
implementation’s basic
character set.
So, no gurantees whethere you will get the correct value for the characters you have mentioned. However, try to use an unsigned int to hold this value (for all practical purposes, it never makes sense to use a signed type to hold char values ever, if you are going to print them/pass around).

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js