How compilers identify the length of byte shift operators - c++

Consider the following line:
int mask = 1 << shift_amount;
we know that mask is 4 bytes because it was explicitly declared int, but this 1 that to be shifted has unknown length. If the compiler chose type as char it would be 8 bits, or it could be unsigned short with size 16 bits, so shifting result will really depend on the size of the compiler's decision about how to treat that 1. How does the compiler decide here? And is it safe to leave the code this way or should it instead be:
int flag = 1;
int mask = flag << shift_amount;

1 is an int (typically 4 bytes). If you wanted it to be a type other than int you'd use a suffix, like 1L for long. For more details see https://en.cppreference.com/w/cpp/language/integer_literal.
You can also use a cast like (long)1 or if you want a known fixed length, (int32_t)1.
As Eric Postpischil points out in a comment, values smaller than int like (short)1 are not useful because the left-hand argument to << is promoted to int anyway.

The 2018 C standard says in 6.4.4 3:
Each constant has a type, determined by its form and value, as detailed later.
This means we can always tell what the type of a constant is just from the text of the constant itself, without regard to the expression it appears in. (Here, “constant” actually means a literal: A thing whose value is given by its text. For example 34 and 'A' literally represent the number 34 and the character A, in contrast to an identifier foo that refers to some object.)
(This answer addresses C specifically. The rules described below are different in C++.)
The subclauses of 6.4.4 detail the various kinds of constants (integers, floating-point, enumerations, and characters). An integer constant without a suffix that can be represented in an int is an int, so 1 is an int.
If an integer constant has a suffix or does not fit in an int, then its type is affected by its suffix, its value, and whether it is decimal, octal, or hexadecimal, according to a table in 6.4.4.1 5.
Floating-point constants are double if they have no suffix, float with f or F, and long double with l or L.
Enumeration constants (declared with enum) have type int. (And these are not directly literals as I describe above, because they are names for values, but the name does indicate the value by way of the enum declaration.)
Character constants without a prefix have type int. Constants with prefixes L, u, or U have type wchar_t, char16_t, or char32_t, respectively.

Related

How is char_traits<char>::eof() encoded in platforms where sizeof(int) == 1?

I found these excerpts in the C++ standard (quotations taken from N4687, but it will likely have been there since forever):
[char.traits.typedefs]
For a certain character container type char_type, a related container type INT_T shall be a
type or class which can represent all of the valid characters converted from the corresponding char_type values, as well as an end-of-file value, eof().
[char.traits.require]
Expression: X::eof()
Type: X::int_type
Returns: a value e such that X::eq_int_type(e,X::to_int_type(c)) is false for all values c.
Expression: X::eq_int_type(e,f)
Type: bool
Returns: for all c and d, X::eq(c,d) is equal to X::eq_int_type(X::to_int_type(c), X::to_int_type(d)) (...)
c and d denote values of type CharT; (...); e and f denote values of type X::int_type
[char.traits.specializations.char]
using char_type = char;
using int_type = int;
[basic.fundamental]
Plain char, signed char, and unsigned char are three distinct types, collectively called narrow character types. (...) A char, a signed char, and an unsigned char occupy the same amount of storage and have the same alignment requirements (...) For narrow character types, all bits of the object representation participate in the value representation. (...) For unsigned narrow character types, each possible bit pattern of the value representation represents a distinct number.
There are five standard signed integer types : “signed char”, “short int”, “int”, “long int”, and “long long int”. In this list, each type provides at least as much storage as those preceding it in the list.
I haven't found anything preventing sizeof(int) == 1 in the surrounding text. This is obviously not the case in most modern platforms where sizeof(int) is 4 or 8 but is explicitly used as an example e.g. in cppreference:
Note: this allows the extreme case in which bytes are sized 64 bits, all types (including char) are 64 bits wide, and sizeof returns 1 for every type.
The question
If int was as large as char, the standard does not leave much space for any object representation of the former that would compare inequal to all values (via to_int_type) of the latter, leaving just some corner cases (like negative zero existing in signed char but mapping to INT_MIN in int) unlikely to be implemented efficiently in hardware. Moreover, with P0907 it seems even signed char will not allow any two different bit strings representing equal values, thus forcing it to 2^(bitsize) distinct values, the int as well, and closing every possible loophole.
How, in such platform, would one conform to the requirements of std::char_traits<char>? Do we have a real-world example of such platform and the corresponding implementation?
Suppose, for example, that we had a platform where char is signed and 32 bits long, and int is as well. It is possible to satisfy all the requirements using the following definitions, where X is std::char_traits<char>:
X::eq and X::eq_int are simple equality comparisons;
X::to_char_type returns the value of its argument;
X::eof returns -1, and
X::to_int_type(c) is c, unless c is -1, in which case it's -2.
The mapping of -1 onto -2 guarantees that X::eq_int_type(X::eof(), X::to_int_type(c)) is false for all c, which is the requirement on X::eof according to C++20 Table 69.
This would probably correspond to an implementation where -1 and -2 (and maybe even all negative numbers), are "invalid" character values, i.e., they're completely legal to store in a char, but reading from a file will never yield a byte with such a value. Of course, nothing would stop you from writing a custom stream buffer that yields such "invalid" values as long as you're willing to accept the fact that it will not be possible to distinguish between "the next character is -1" and "we are at the end of the stream".
The only possible issue with this implementation is that the requirement on X::to_char_type(e) is that it equals
if for some c, X::eq_int_type(e,X::to_int_type(c)) is true, c; else some unspecified value.
This could be read as implying that if any such c exists, then it is unique. That would be violated when e is -2, because c here could be either -1 or -2.
If we assume that uniqueness is required, then I don't think there's any possible solution.

Using constants and their associated modifiers using gcc

I was not sure what to call these flags, but what I am referring to is:
#define TEST_DEF 50000U //<- the "U" here
Google searching when you are not familiar with the jargon used to describe your question is futile.
What I am trying to do is use these constant definitions and make sure the value is only of a certain length, namely 8 or 16 bits.
How can I do this and what is it referred to as?
For integers, the section of the standard (ISO/IEC 9899:2011 — aka C2011 or C11) defining these suffixes is:
§6.4.4.1 Integer constants
Where it defines the integer-suffixes:
integer-suffix:
unsigned-suffix long-suffixopt
unsigned-suffix long-long-suffix
long-suffix unsigned-suffixopt
long-long-suffix unsigned-suffixopt
unsigned-suffix: one of
u U
long-suffix: one of
l L
long-long-suffix: one of
ll LL
The corresponding suffixes for floating point numbers are f, F, l and L (for float and long double).
Note that it would be perverse to use l because it is far too easily confused with 1, so the qualifiers are most often written with upper-case letters.
If you want to create integer literals that are of a given size, then the facilities to do so are standardized by <stdint.h> (added in C99).
The header (conditionally) defines fixed-size types such as int8_t and uint16_t. It also (unconditionally) provides minimum-sized types such as int_least8_t and uint_least16_t. If it cannot provide exact types (perhaps because the word size is 36 bits, so sizes 9, 18 and 36 are handled), it can still provide the least types.
It also provide macros such as INT8_C which ensure that the argument is an int_least8_t value.
Hence, you could use:
#include <stdint.h>
#define TEST_DEF UINT16_C(50000)
and you are guaranteed that the value will be at least 16 bits of unsigned integer, and formatted/qualified correctly.
§7.20.4 Macros for integer constants
¶1 The following function-like macros expand to integer constants suitable for initializing
objects that have integer types corresponding to types defined in <stdint.h>. Each
macro name corresponds to a similar type name in 7.20.1.2 or 7.20.1.5.
¶2 The argument in any instance of these macros shall be an unsuffixed integer constant (as
defined in 6.4.4.1) with a value that does not exceed the limits for the corresponding type.
¶3 Each invocation of one of these macros shall expand to an integer constant expression
suitable for use in #if preprocessing directives. The type of the expression shall have
the same type as would an expression of the corresponding type converted according to
the integer promotions. The value of the expression shall be that of the argument.
7.20.4.1 Macros for minimum-width integer constants
¶1 The macro INTN_C(value) shall expand to an integer constant expression
corresponding to the type int_leastN_t. The macro UINTN_C(value) shall expand
to an integer constant expression corresponding to the type uint_leastN_t. For
example, if uint_least64_t is a name for the type unsigned long long int,
then UINT64_C(0x123) might expand to the integer constant 0x123ULL.
There are five integer literal suffixes in C: u, l, ul, ll, and ull. Unlike nearly everything else in C they are case insensitive; also, ul and ull can be written as lu and llu respectively (however, lul is not acceptable).
They control the type of the constant. They work approximately like this:
literal │ type
────────┼───────────────────────
500 │ int
500u │ unsigned int
500l │ long int
500ul │ unsigned long int
500ll │ long long int
500ull │ unsigned long long int
This is only an approximation, because if the constant is too large for the indicated type, it is "promoted" to a larger type. The rules for this are sufficiently complicated that I'm not going to try to describe them. The rules for "promoting" hexadecimal and octal literals are slightly different than the rules for "promoting" decimal literals, and they are also slightly different in C99 versus C90 and different again in C++.
Because of the promotion effect, is not possible to use these suffixes to limit constants to any size. If you write 281474976710656 on a system where int and long are both 32 bits wide, the constant will be given type long long even though you didn't say to do that. Moreover, there are no suffixes to force a constant to have type short nor char. You can indicate your intent with the [U]INT{8,16,32,64,MAX}_C macros from <stdint.h>, but those do not impose any upper limit either, and on all systems I can conveniently get at right now (OSX, Linux), *INT8_C and *INT16_C actually produce values with type (unsigned) int.
Your compiler may, but is not required to, warn if you write ((uint8_t) 512) or similar (where 512 is a compile-time constant value outside the range of the type. In C11 you can use static_assert (from <assert.h>) to force the issue but it might be a bit tedious to write.
This is an unsigned literal (U is suffix). See: http://en.cppreference.com/w/cpp/language/integer_literal

Unsigned vs signed range guarantees

I've spent some time poring over the standard references, but I've not been able to find an answer to the following:
is it technically guaranteed by the C/C++ standard that, given a signed integral type S and its unsigned counterpart U, the absolute value of each possible S is always less than or equal to the maximum value of U?
The closest I've gotten is from section 6.2.6.2 of the C99 standard (the wording of the C++ is more arcane to me, I assume they are equivalent on this):
For signed integer types, the bits of the object representation shall be divided into three
groups: value bits, padding bits, and the sign bit. (...) Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed type and Nin the unsigned type, then M≤N).
So, in hypothetical 4-bit signed/unsigned integer types, is anything preventing the unsigned type to have 1 padding bit and 3 value bits, and the signed type having 3 value bits and 1 sign bit? In such a case the range of unsigned would be [0,7] and for signed it would be [-8,7] (assuming two's complement).
In case anyone is curious, I'm relying at the moment on a technique for extracting the absolute value of a negative integer consisting of first a cast to the unsigned counterpart, and then the application of the unary minus operator (so that for instance -3 becomes 4 via cast and then 3 via unary minus). This would break on the example above for -8, which could not be represented in the unsigned type.
EDIT: thanks for the replies below Keith and Potatoswatter. Now, my last point of doubt is on the meaning of "subrange" in the wording of the standard. If it means a strictly "less-than" inclusion, then my example above and Keith's below are not standard-compliant. If the subrange is intended to be potentially the whole range of unsigned, then they are.
For C, the answer is no, there is no such guarantee.
I'll discuss types int and unsigned int; this applies equally to any corresponding pair of signed and unsigned types (other than char and unsigned char, neither of which can have padding bits).
The standard, in the section you quoted, implicitly guarantees that UINT_MAX >= INT_MAX, which means that every non-negative int value can be represented as an unsigned int.
But the following would be perfectly legal (I'll use ** to denote exponentiation):
CHAR_BIT == 8
sizeof (int) == 4
sizeof (unsigned int) == 4
INT_MIN = -2**31
INT_MAX = +2**31-1
UINT_MAX = +2**31-1
This implies that int has 1 sign bit (as it must) and 31 value bits, an ordinary 2's-complement representation, and unsigned int has 31 value bits and one padding bit. unsigned int representations with that padding bit set might either be trap representations, or extra representations of values with the padding bit unset.
This might be appropriate for a machine with support for 2's-complement signed arithmetic, but poor support for unsigned arithmetic.
Given these characteristics, -INT_MIN (the mathematical value) is outside the range of unsigned int.
On the other hand, I seriously doubt that there are any modern systems like this. Padding bits are permitted by the standard, but are very rare, and I don't expect them to become any more common.
You might consider adding something like this:
#if -INT_MIN > UINT_MAX
#error "Nope"
#endif
to your source, so it will compile only if you can do what you want. (You should think of a better error message than "Nope", of course.)
You got it. In C++11 the wording is more clear. §3.9.1/3:
The range of non-negative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the value representation of each corresponding signed/unsigned type shall be the same.
But, what really is the significance of the connection between the two corresponding types? They are the same size, but that doesn't matter if you just have local variables.
In case anyone is curious, I'm relying at the moment on a technique for extracting the absolute value of a negative integer consisting of first a cast to the unsigned counterpart, and then the application of the unary minus operator (so that for instance -3 becomes 4 via cast and then 3 via unary minus). This would break on the example above for -8, which could not be represented in the unsigned type.
You need to deal with whatever numeric ranges the machine supports. Instead of casting to the unsigned counterpart, cast to whatever unsigned type is sufficient: one larger than the counterpart if necessary. If no large enough type exists, then the machine may be incapable of doing what you want.

Why int plus uint returns uint?

int plus unsigned int returns an unsigned int. Should it be so?
Consider this code:
#include <boost/static_assert.hpp>
#include <boost/typeof/typeof.hpp>
#include <boost/type_traits/is_same.hpp>
class test
{
static const int si = 0;
static const unsigned int ui = 0;
typedef BOOST_TYPEOF(si + ui) type;
BOOST_STATIC_ASSERT( ( boost::is_same<type, int>::value ) ); // fails
};
int main()
{
return 0;
}
If by "should it be" you mean "does my compiler behave according to the standard": yes.
C++2003: Clause 5, paragraph 9:
Many binary operators that expect operands of arithmetic or enumeration type cause conversions and yield
result types in a similar way. The purpose is to yield a common type, which is also the type of the result.
This pattern is called the usual arithmetic conversions, which are defined as follows:
blah
Otherwise, blah,
Otherise, blah, ...
Otherwise, if either operand is unsigned, the other shall be converted to unsigned.
If by "should it be" you mean "would the world be a better place if it didn't": I'm not competent to answer that.
Unsigned integer types mostly behave as members of a wrapping abstract algebraic ring of values which are equivalent mod 2^N; one might view an N-bit unsigned integer not as representing a particular integer, but rather the set of all integers with a particular value in the bottom N bits. For example, if one adds together two binary numbers whose last 4 digits are ...1001 and ...0101, the result will be ...1110. If one adds ...1111 and ...0001, the result will be ...0000; if one subtracts ...0001 from ...0000 the result will be ...1111. Note that concepts of overflow or underflow don't really mean anything, since the upper-bit values of the operands are unknown and the upper-bit values of the result are of no interest. Note also that adding a signed integer whose upper bits are known to one whose upper bits are "don't know/don't care" should yield a number whose upper bits are "don't know/don't care" (which is what unsigned integer types mostly behave as).
The only places where unsigned integer types fail to behave as members of a wrapping algebraic ring is when they participate in comparisons, are used in numerical division (which implies comparisons), or are promoted to other types. If the only way to convert an unsigned integer type to something larger was to use an operator or function for that purpose, the use of such an operator or function could make clear that it was making assumptions about the upper bits (e.g. turning "some number whose lower bits are ...00010110" into "the number whose lower bits are ...00010110 and whose upper bits are all zeroes). Unfortunately, C doesn't do that. Adding a signed value to an unsigned value of equal size yields a like-size unsigned value (which makes sense with the interpretation of unsigned values above), but adding a larger signed integer to an unsigned type will cause the compiler to silently assume that all upper bits of the latter are zeroes. This behavior can be especially vexing in cases where, depending upon a compilers' promotion rules, some compilers may deem two expressions as having the same size while others may view them as different sizes.
It is likely that the behavior stems from the logic behind pointer types (memory location, e.g. std::size_t) plus a memory location difference (std::ptrdiff_t) is also a memory location.
In other words, std::size_t = std::size_t + std::ptrdiff_t.
When this logic is translated to underlaying types this means, unsigned long = unsigned long + long, or unsigned = unsigned + int.
The "other" explanation from #supercat is also possibly correct.
What is clear is that unsigned integer were not designed or should not be interpreted to be mathematical positive numbers, no even in principle. See https://www.youtube.com/watch?v=wvtFGa6XJDU

What does the "L" mean at the end of an integer literal?

I have this constant:
#define MAX_DATE 2958465L
What does the L mean in this sense?
It is a long integer literal.
Integer literals have a type of int by default; the L suffix gives it a type of long (Note that if the value cannot be represented by an int, then the literal will have a type of long even without the suffix).
In this scenario the Ldoes nothing.
The L after a number gives the constant the long type, but because in this scenario the constant is immediately assigned to an int variable nothing is changed.
L tells the compiler that the number is of type Long. Long is a signed type greater than or equal to an int in size. On most modern compilers, this means that the number will take 4 bytes of memory. This happens to be the same as an int on most compilers so it won't have an effect in this case.
see this link
it says :
Literal constants (often referred to as literals or constants) are invariants whose values are implied by their representations.
base:decimal
example:1L
description:
Any decimal number (digits 0-9) not beginning with a 0 (zero) and followed by L or l