How is long long implemented in C++? - c++

In C++, as far as I know, all data types are implemented as classes. ( Don't know if it is right, but I read it as a justification for statements such as int a(5); which calls the parametric constructor of int. )
If so, how are long and short implemented? I just found out that long long and short short are valid types but short long and long short are not (Checked the latter ones just because it sounds funny!)
Similarly, how are signed and unsigned implemented?
PS. By implemented, what I mean is "Is written using C/C++ features or is it written at a lower level in the compiler itself".
So the equivalent parts of declaration of variable of a basic type and a userdefined object or variable is (read downwards)
auto|register|static|extern <=> auto|register|static|extern
const <=> const
(signed|unsigned)(long|short)datatype <=> class name etc
variable name <=> object/variable name
? Is that assusmption correct?

On the particular question, long long is implemented by the compiler in a compiler + platform specific way (usually more platform than compiler dependent).
As to the original misconception, no, not all types are classes in C++. The languages tries to provide a uniform syntax in as much as possible for all types, trying to have all types behave similarly in as much as possible, and be used in a similar way. As a matter of fact, it is actually quite the other way around: C++ tries in as much as possible to have classes behave like primitive types (value semantics).
The particular reason to be able to initialize an integer that way is actually quite related to classes, just in a different way. In a class constructor definition there are initializer lists that define how each one of the members is initialized before the constructor block is executed. The syntax for each initializer element in the list is basically (I would have to lookup the exact definition): member_name( initializer ), so for example you would get:
class my_int_vector {
int * p;
int size;
public:
my_int_vector() : p(0), size(0) {}
//...
}
Both pointers and integers are fundamental types, but they can be initialized in a way similar to that of classes. If that type of initialization was not allowed for fundamental types, and only the type name = value; syntax was allowed, the initializer list syntax would have to be extended, and you would not be able to seamlessly change those types at a later time (say, change the int to a atomic_int).

Built-in data types are not implemented as classes. They just have some syntactic similarities.
short, long, int and every other built-in type name are keywords which get treated specially by the compiler. So, in a nutshell, they're implemented as magic. They're not classes, they're just themselves.
So no, these types cannot be implemented in terms of other language features, the way std::string or other standard library components can. The built-in types are a fundamental part of the core language.

Not all data types are classes in C++. The primitive C data types are not (they're called scalars). They're not "implemented" at all, but rather they form a core feature of the language that has a direct translation to machine code. The syntax int i(5); is equivalent to C's int i = 5; and initializes the variable at declaration time.

You're laboring under a number of misconceptions:
Not all data types are implemented as classes in C++. In fact, only class types are implemented as classes. Enums, pointers and arrays, in addition to the fundamental types, are not implemented as classes.
short short is not a valid type. The signed integral types in
C++03 are signed char, short, int, and long. Period. C++11
adds long long, and allows the implementation to add others.
I'm not sure what you mean by "implemented" here. On almost all modern machine, all of the integral types (signed or unsigned) are directly supported in hardware; the C++ compiler just generates the appropriate hardware instructions. (On older machines, long, and sometimes even int or short, often required function calls for the basic operations, and on some rare and exotic machines, unsigned arithmetic requires added instructions.)
How you write a definition is a question of syntax, not of implementation. Both T v(i); and T v = i; are legal if the type supports copy; for all but class types, they are perfectly identical. (With a modern compiler; I've used some compilers in the past that had bugs in this regard. But that's a fairly distant past.)

Related

Is it possible to auto cast (long *) to (int *)

I'm working on an existing code base, so the answer "do it right, just use one type" doesn't work because they already didn't do it right, I just have to live with it.
I know that coercion works so that this works:
int a = 1;
long b = a;
even though int and long are different base types. However, this doesn't work:
int a;
long *b = &a;
because there's no auto conversion between "pointer to int" and "pointer to long".
If, rather than base types, I was working with classes, I could get it to work by providing a conversion. My question is this. Is there a way to provide conversions for base types (or rather pointers to base types) so I could start the process of converting my code to use a single 32-bit integral type? As it stands, I either need to do an "all or nothing" edit OR provide a boatload of explicit casts.
My question is this. Is there a way to provide conversions for base types (or rather pointers to base types) so I could start the process of converting my code to use a single 32-bit integral type?
No, there is no way to influence the set of implicit conversions between pointer types aside from inheritance relations between classes.
Even if you add a (long*) cast or reinterpret_cast<long*> everywhere, as mentioned in the comments, accessing the value through that pointer will be an aliasing violation and therefore cause undefined behavior. This is not related to the size, alignment or the representation of the int and long types. Rather compilers are explicitly allowed to make optimizations that assume that a long pointer can never be pointing to a int object and compilers will perform such optimizations that will break code in possibly very subtle ways.
Note that this is different for casts between e.g. signed int* and unsigned int*. Signed and unsigned variants of the same integral type are allowed to alias one another, so that either pointer type can be used to access the object. The compiler is not allowed to perform optimizations in this case that assume that pointers of the two types don't point to the same address at the same time.
GCC and Clang offer the -fno-strict-aliasing option to disable optiizations based on the aliasing rules (still assuming that the type do actually have the same size, alignment and compatible representations), but I don't know whether MSVC has a similar option. Some compilers may also explicitly allow additional types to alias that the standard does not allow to alias, but I would only rely on that if the compiler documents these clearly. I don't know whether MSVC makes any such guarantees for int and long.

Unions, aliasing and type-punning in practice: what works and what does not?

I have a problem understanding what can and cannot be done using unions with GCC. I read the questions (in particular here and here) about it but they focus the C++ standard, I feel there's a mismatch between the C++ standard and the practice (the commonly used compilers).
In particular, I recently found confusing informations in the GCC online doc while reading about the compilation flag -fstrict-aliasing. It says:
-fstrict-aliasing
Allow the compiler to assume the strictest aliasing rules applicable to the language being compiled. For C (and C++), this activates optimizations based on the type of expressions. In particular, an object of one type is assumed never to reside at the same address as an object of a different type, unless the types are almost the same.
For example, an unsigned int can alias an int, but not a void* or a double. A character type may alias any other type.
Pay special attention to code like this:
union a_union {
int i;
double d;
};
int f() {
union a_union t;
t.d = 3.0;
return t.i;
}
The practice of reading from a different union member than the one most recently written to (called “type-punning”) is common.
Even with -fstrict-aliasing, type-punning is allowed, provided the memory is accessed through the union type. So, the code above works as expected.
This is what I think I understood from this example and my doubts:
1) aliasing only works between similar types, or char
Consequence of 1): aliasing - as the word suggests - is when you have one value and two members to access it (i.e. the same bytes);
Doubt: are two types similar when they have the same size in bytes? If not, what are similar types?
Consequence of 1) for non similar types (whatever this means), aliasing does not work;
2) type punning is when we read a different member than the one we wrote to; it's common and it works as expected as long as the memory is accessed through the union type;
Doubt: is aliasing a specific case of type-punning where types are similar?
I get confused because it says unsigned int and double are not similar, so aliasing does not work; then in the example it's aliasing between int and double and it clearly says it works as expected, but calls it type-punning:
not because types are or are not similar, but because it's reading from a member it did not write. But reading from a member it did not write is what I understood aliasing is for (as the word suggests). I'm lost.
The questions:
can someone clarify the difference between aliasing and type-punning and what uses of the two techniques are working as expected in GCC? And what does the compiler flag do?
Aliasing can be taken literally for what it means: it is when two different expressions refer to the same object. Type-punning is to "pun" a type, ie to use a object of some type as a different type.
Formally, type-punning is undefined behaviour with only a few exceptions. It happens commonly when you fiddle with bits carelessly
int mantissa(float f)
{
return (int&)f & 0x7FFFFF; // Accessing a float as if it's an int
}
The exceptions are (simplified)
Accessing integers as their unsigned/signed counterparts
Accessing anything as a char, unsigned char or std::byte
This is known as the strict-aliasing rule: the compiler can safely assume two expressions of different types never refer to the same object (except for the exceptions above) because they would otherwise have undefined behaviour. This facilitates optimizations such as
void transform(float* dst, const int* src, int n)
{
for(int i = 0; i < n; i++)
dst[i] = src[i]; // Can be unrolled and use vector instructions
// If dst and src alias the results would be wrong
}
What gcc says is it relaxes the rules a bit, and allows type-punning through unions even though the standard doesn't require it to
union {
int64_t num;
struct {
int32_t hi, lo;
} parts;
} u = {42};
u.parts.hi = 420;
This is the type-pun gcc guarantees will work. Other cases may appear to work but may one day silently be broken.
Terminology is a great thing, I can use it however I want, and so can everyone else!
are two types similar when they have the same size in bytes? If not, what are similar types?
Roughly speaking, types are similar when they differ by constness or signedness. Size in bytes alone is definitely not sufficient.
is aliasing a specific case of type-punning where types are similar?
Type punning is any technique that circumvents the type system.
Aliasing is a specific case of that which involves placing objects of different types at the same address. Aliasing is generally allowed when types are similar, and forbidden otherwise. In addition, one may access an object of any type through a char (or similar to char) lvalue, but doing the opposite (i.e. accessing an object of type char through a dissimilar type lvalue) is not allowed. This is guaranteed by both C and C++ standards, GCC simply implements what the standards mandate.
GCC documentation seems to use "type punning" in a narrow sense of reading a union member other than the one last written to. This kind of type punning is allowed by the C standard even when types are not similar. OTOH the C++ standard does not allow this. GCC may or may not extend the permission to C++, the documentation is not clear on this.
Without -fstrict-aliasing, GCC apparently relaxes these requirements, but it isn't clear to what exact extent. Note that -fstrict-aliasing is the default when performing an optimised build.
Bottom line, just program to the standard. If GCC relaxes the requirements of the standard, it isn't significant and isn't worth the trouble.
In ANSI C (AKA C89) you have (section 3.3.2.3 Structure and union members):
if a member of a union object is accessed after a value has been stored in a different member of the object, the behavior is implementation-defined
In C99 you have (section 6.5.2.3 Structure and union members):
If the member used to access the contents of a union object is not the same as the member last used to store a value in the object, the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type as described in 6.2.6 (a process sometimes called "type punning"). This might be a trap representation.
IOW, union-based type punning is allowed in C, although the actual semantics may be different, depending on the language standard supported (note that the C99 semantics is narrower than the C89's implementation-defined).
In C99 you also have (section 6.5 Expressions):
An object shall have its stored value accessed only by an lvalue expression that has one of the following types:
— a type compatible with the effective type of the object,
— a qualified version of a type compatible with the effective type of the object,
— a type that is the signed or unsigned type corresponding to the effective type of the object,
— a type that is the signed or unsigned type corresponding to a qualified version of the effective type of the object,
— an aggregate or union type that includes one of the aforementioned types among its members (including, recursively, a member of a subaggregate or contained union), or
— a character type.
And there's a section (6.2.7 Compatible type and composite type) in C99 that describes compatible types:
Two types have compatible type if their types are the same. Additional rules for
determining whether two types are compatible are described in 6.7.2 for type specifiers,
in 6.7.3 for type qualifiers, and in 6.7.5 for declarators. ...
And then (6.7.5.1 Pointer declarators):
For two pointer types to be compatible, both shall be identically qualified and both shall be pointers to compatible types.
Simplifying it a bit, this means that in C by using a pointer you can access signed ints as unsigned ints (and vice versa) and you can access individual chars in anything. Anything else would amount to aliasing violation.
You can find similar language in the various versions of the C++ standard. However, as far as I can see in C++03 and C++11 union-based type punning isn't explicitly allowed (unlike in C).
According to the footnote 88 in the C11 draft N1570, the "strict aliasing rule" (6.5p7) is intended to specify the circumstances in which compilers must allow for the possibility that things may alias, but makes no attempt to define what aliasing is. Somewhere along the line, a popular belief has emerged that accesses other than those defined by the rule represent "aliasing", and those allowed don't, but in fact the opposite is true.
Given a function like:
int foo(int *p, int *q)
{ *p = 1; *q = 2; return *p; }
Section 6.5p7 doesn't say that p and q won't alias if they identify the same storage. Rather, it specifies that they are allowed to alias.
Note that not all operations which involve accessing storage of one type as another represent aliasing. An operation on an lvalue which is freshly visibly derived from another object doesn't "alias" that other object. Instead, it is an operation upon that object. Aliasing occurs if, between the time a reference to some storage is created and the time it is used, the same storage is referenced in some way not derived from the first, or code enters a context wherein that occurs.
Although the ability to recognize when an lvalue is derived from another is a Quality of Implementation issue, the authors of the Standard must have expected implementations to recognize some constructs beyond those mandated. There is no general permission to access any of the storage associated with a struct or union by using an lvalue of member type, nor does anything in the Standard explicitly say that an operation involving someStruct.member must be recognized as an operation on a someStruct. Instead, the authors of the Standard expected that compiler writers who make a reasonable effort to support constructs their customers need should be better placed than the Committee to judge the needs of those customers and fulfill them. Since any compiler that makes an even-remotely-reasonable effort to recognize derived references would notice that someStruct.member is derived from someStruct, the authors of the Standard saw no need to explicitly mandate that.
Unfortunately, the treatment of constructs like:
actOnStruct(&someUnion.someStruct);
int q=*(someUnion.intArray+i)
has evolved from "It's sufficiently obvious that actOnStruct and the pointer dereference should be expected to act upon someUnion (and consequently all the members thereof) that there's no need to mandate such behavior" to "Since the Standard doesn't require that implementations recognize that the actions above might affect someUnion, any code relying upon such behavior is broken and need not be supported". Neither of the above constructs is reliably supported by gcc or clang except in -fno-strict-aliasing mode, even though most of the "optimizations" that would be blocked by supporting them would generate code that is "efficient" but useless.
If you're using -fno-strict-aliasing on any compiler having such an option, almost anything will work. If you're using -fstrict-aliasing on icc, it will try to support constructs that use type punning without aliasing, though I don't know if there's any documentation about exactly what constructs it does or does not handle. If you use -fstrict-aliasing on gcc or clang, anything at all that works is purely by happenstance.
I think it's good to add a complementary answer, simply because when I asked the question I did not know how to fulfill my needs without using UNION: I got stubborn on using it because it seemed to answer precisely my needs.
The good way to do type punning and to avoid possible consequences of undefined behavior (depending on the compiler and other env. settings) is to use std::memcpy and copy the memory bytes from one type to another. This is explained - for example - here and here.
I've also read that often when a compiler produces valid code for type punning using unions, it produces the same binary code as if std::memcpy was used.
Finally, even if this information does not directly answer my original question it's so strictly related that I felt it was useful to add it here.

What is the difference between a proper defined union and a reinterpret_cast?

Can you propose at least 1 scenario where there is a substantial difference between
union {
T var_1;
U var_2;
}
and
var_2 = reinterpret_cast<U> (var_1)
?
The more i think about this, the more they look like the same thing to me, at least from a practical viewpoint.
One difference that I found is that while the union size is big as the biggest data type in terms of size, the reinterpret_cast as described in this post can lead to a truncation, so the plain old C-style union is even safer than a newer C++ casting.
Can you outline the differences between this 2 ?
Contrary to what the other answers state, from a practical point of view there is a huge difference, although there might not be such a difference in the standard.
From the standard point of view, reinterpret_cast is only guaranteed to work for roundtrip conversions and only if the alignment requirements of the intermediate pointer type are not stronger than those of the source type. You are not allowed (*) to read through one pointer and read from another pointer type.
At the same time, the standard requires similar behavior from unions, it is undefined behavior to read out of a union member other than the active one (the member that was last written to)(+).
Yet compilers often provide additional guarantees for the union case, and all compilers I know of (VS, g++, clang++, xlC_r, intel, Solaris CC) guarantee that you can read out of an union through an inactive member and that it will produce a value with exactly the same bits set as those that were written through the active member.
This is particularly important with high optimizations when reading from network:
double ntohdouble(const char *buffer) { // [1]
union {
int64_t i;
double f;
} data;
memcpy(&data.i, buffer, sizeof(int64_t));
data.i = ntohll(data.i);
return data.f;
}
double ntohdouble(const char *buffer) { // [2]
int64_t data;
double dbl;
memcpy(&data, buffer, sizeof(int64_t));
data = ntohll(data);
dbl = *reinterpret_cast<double*>(&data);
return dbl;
}
The implementation in [1] is sanctioned by all compilers I know (gcc, clang, VS, sun, ibm, hp), while the implementation in [2] is not and will fail horribly in some of them when aggressive optimizations are used. In particular, I have seen gcc reorder the instructions and read into the dbl variable before evaluating ntohl, thus producing the wrong results.
(*) With the exception that you are always allowed to read from a [signed|unsigned] char* regardless of that the real object (original pointer type) was.
(+) Again with some exceptions, if the active member shares a common prefix with another member, you can read through the compatible member that prefix.
There are some technical differences between a proper union and a (let's assume) a proper and safe reinterpret_cast. However, I can't think of any of these differences which cannot be overcome.
The real reason to prefer a union over reinterpret_cast in my opinion isn't a technical one. It's for documentation.
Supposing you are designing a bunch of classes to represent a wire protocol (which I guess is the most common reason to use type-punning in the first place), and that wire protocol consists of many messages, submessages and fields. If some of those fields are common, such as msg type, seq#, etc, using a union simplifies tying these elements together and helps to document exactly how the protocol appears on the wire.
Using reinterpret_cast does the same thing, obviously, but in order to really know what's going on you have to examine the code that advances from one packet to the next. Using a union you can just take a look at the header and get an idea what's going on.
In C++11, union is class type, you can an hold a member with non-trivial member functions. You can't simply cast from one member to another.
§ 9.5.3
[ Example: Consider the following union:
union U {
int i;
float f;
std::string s;
};
Since std::string (21.3) declares non-trivial versions of all of the special member functions, U will have
an implicitly deleted default constructor, copy/move constructor, copy/move assignment operator, and destructor. To use U, some or all of these member functions must be user-provided. — end example ]
From a practical point of view, they're most probably 100% identical, at least on real, non-fictional computers. You take the binary representation of one type and stuff it into another type.
From a language lawyer point of view, using reinterpret_cast is well-defined for some occasions (e.g. pointer to integer conversions) and implementation-specific otherwise.
Union type punning, on the other hand is very clearly undefined behaviour, always (though undefined does not necessarily mean "doesn't work"). The standard says that the value of at most one of the non-static data members can be stored in a union at any time. This means that if you set var1 then var1 is valid, but var2 is not.
However, since var1 and var2 are stored at the same memory location, you can of course still read and write any of the types as you like, and assuming they have the same storage size, no bits are "lost".

enum [tag] [: type] {enum-list} [declarator] in C++ - is this legal?

I am working with a custom enumerated type in C++, but it does not have many values. I want to try to reduce the size that they take up, and I've heard that enum types are always integers by default. I then came across the MSDN entry on C++ enumerations, and found the following syntax very interesting:
enum [: type] {enum-list};
Sure enough, it compiled with what I wanted (VS2008) when I did the following:
enum plane : unsigned char { xy, xz, yz };
Now, you can see from my enumeration constants that I don't need much in terms of space - an unsigned char type would be perfect for my uses.
However, I have to say, I've never seen this form used anywhere else on the internet - most don't even seem aware of it. I'm trying to make this code cross-platform (and possibly for use on embedded systems), so it left me wondering... Is this proper C++ syntax, or only supported by the MSVC compiler?
Edit: It seems that this feature is now part of C++11 and above, and is called scoped enumerations.
This is non-standard, but it is expected to be a part of the C++0x standard. For me, when I compile in Visual Studio 2005 with warning level set to maximum I get the following warning:
warning C4480: nonstandard extension used: specifying underlying type for enum 'Test'
From the Wikipedia page for C++0x:
In standard C++, enumerations are not type-safe. They are effectively
integers, even when the enumeration types are distinct. This allows
the comparison between two enum values of different enumeration types.
The only safety that C++03 provides is that an integer or a value of
one enum type does not convert implicitly to another enum type.
Additionally, the underlying integral type is implementation-defined;
code that depends on the size of the enumeration is therefore
non-portable. Lastly, enumeration values are scoped to the enclosing
scope. Thus, it is not possible for two separate enumerations to have
matching member names.
Additionally, C++0x will allow standard enumerations to provide
explicit scoping as well as the definition of the underlying type:
enum Enum3 : unsigned long {Val1 = 1, Val2};
As 0A0D's said, the notation you're using is non-Standard in C++03, but has been adopted by C++11 under the term "scoped enums".
If that's not soon enough for you, then you can consider explicitly specifying the bit field width used for the enum fields in the size-critical structures in which they're embedded. This approach is ugly - I mention if for completeness; if it was a good solution the notation above wouldn't be being adopted for C++11. One problem is that you rely on an optional compiler warning to detect too-small bit-fields to hold the possible values, and may have to manually review them as the enumeration values change.
For example:
enum E
{
A, B, C
};
struct X
{
E e1 : 2;
E e2 : 2;
E e3 : 2;
E e4 : 2;
};
Note: the enum may occupy more bits than requested - on GCC 4.5.2 with no explicit compiler options, sizeof(X) above is still 4....
It's perfectly legal. Embedding the type in an object to save space however will not be effective without specifically ensuring the alignment. If you leave default alignment, padding bytes will be added (depending on the architecture) usually to a quad-word boundary.

what is a type in C++?

What all constructs(class,struct,union) classify as types in C++? Can anyone explain the rationale behind calling and qualifying certain C++ constructs as a 'type'.
$3.9/1 - "There are two kinds of
types: fundamental types and compound
types. Types describe objects (1.8),
references (8.3.2), or functions
(8.3.5). ]"
Fundamental types are char, int, bool and so on.
Compound types are arrays, enums, classes, references, unions etc
A variable contains a value.
A type is a specification of the value. (eg, number, text, date, person, truck)
All variables must have a type, because they must hold strictly defined values.
Types can be built-in primitives (such as int), custom types (such as enums and classes), or some other things.
Other answers address the kinds of types C++ makes available, so I'll address the motivation part. Note that C++ didn't invent the notion of a data type. Quoting from the Wikipedia entry on Type system.
a type system may be defined as "a
tractable syntactic framework for
classifying phrases according to the
kinds of values they compute"
Another interesting definition from the Data Type page:
a data type (or datatype) is a
classification identifying one of
various types of data, such as
floating-point, integer, or Boolean,
stating the possible values for that
type, the operations that can be done
on that type, and the way the values
of that type are stored
Note that this last one is very close to what C++ means by "type". Perhaps obvious for the built-in (fundamental) types like bool:
possible values are true and false
operations - per definition of arithmetic operators that can accept bool as argument
the way it's stored - actually not mandated by the C++ standard, but one can guess that on some systems a type requiring only a single bit can be stored efficiently (although I think most C++ systems don't do this optimization).
For more complex, user created, types, the situation is more difficult. Consider enum types: you know exactly the range of values a variable of an enum type can get. What about struct and class? There, also, your type declaration tells the compiler what possible values the struct can have, what operations you can do on it (operator overloading and functions accepting objects of this type), and it will even infer how to store it.
Re range of values, although huge, remember it's finite. Even a struct with N 32-bit integers has a finite range of possible values, which is 2^(32N).
Cite from the book "Bjarne Stroustrup - Programming Principles and Practice Using C++", page 77, chapter 3.8:
A type defines a set of possible values and a set of operations (for an object).
An object is some memory that holds a value of a given type.
A value is a set of bits in memory interpreted according to a type.
A variable is a named object.
A declaration is a statement that gives a name to an object.
A definition is a declaration that sets aside memory for an object.
Sounds like a matter of semantics to me...A type refers to something with a construct that can be used to describe it in a away that conforms to traditional Object Oriented concepts(properties and methods). Anything that isn't called a type is probably created with a less robust construct.