Can you assign the value of one union member to another? - c++

Consider the following code snippet:
union
{
int a;
float b;
};
a = /* ... */;
b = a; // is this UB?
b = b + something;
Is the assignment of one union member to another valid?

Unfortunately I believe the answer to this question is that this operation on unions is under specified in C++, although self assignment is perfectly ok.
Self assignment is well defined behavior, if we look at the draft C++ standard section 1.9 Program execution paragraph 15 has the following examples:
void f(int, int);
void g(int i, int* v) {
i = v[i++]; // the behavior is undefined
i = 7, i++, i++; // i becomes 9
i = i++ + 1; // the behavior is undefined
i = i + 1; // the value of i is incremented
f(i = -1, i = -1); // the behavior is undefined
}
and self assignment is covered in the i = i + 1 example.
The problem here is that unlike C89 forward which supports type-punning in C++ it is not clear. We only know that:
In a union, at most one of the non-static data members can be active at any time
but as this discussion in the WG21 UB study group mailing list shows this concept is not well understood, we have the following comments:
While the standard uses the term "active field", it does not define it
and points out this non-normative note:
Note: In general, one must use explicit destructor calls and placement new operators to change the active member of a union. — end note
so we have to wonder whether:
b = a;
makes b the active member or not? I don't know and I don't see a way to prove it with the any of the current versions of the draft standard.
Although in all practicality most modern compilers for example gcc supports type-punning in C++, which means that the whole concept of the active member is bypassed.

I would expect that unless the source and destination variables are the same type, such a thing would be Undefined Behavior in C, and I see no reason to expect C++ to handle it any differently. Given long long *x,*y;, some compilers might process a statement like *x = *y >>8; by generating code to read all of *y, compute the result, and store it to *x, but a compiler might perfectly legitimately write code that copied parts of *y to *x individually. The standard makes clear that if *x and *y are pointers to the same object of the same type, the compiler must ensure that no part of the value gets overwritten while that part is still needed in the computation, but compilers are not required to deal with aliasing in other situations.

Related

Self-assignment of variable in its definition

The following C++ program compiles just fine (g++ 5.4 at least gives a warning when invoked with -Wall):
int main(int argc, char *argv[])
{
int i = i; // !
return 0;
}
Even something like
int& p = p;
is swallowed by the compiler.
Now my question is: Why is such an initialization legal? Is there any actual use-case or is it just a consequence of the general design of the language?
This is a side effect of the rule that a name is in scope immediately after it is declared. There's no need to complicate this simple rule just to prevent writing code that's obvious nonsense.
Just because the compiler accepts it (syntactically valid code) does not mean that it has well defined behaviour.
The compiler is not required to diagnose all cases of Undefined Behaviour or other classes of problems.
The standard gives it pretty free hands to accept and translate broken code, on the assumption that if the results were to be undefined or nonsensical the programmer would not have written that code.
So; the absense of warnings or errors from your compiler does not in any way prove that your program has well defined behaviour.
It is your responsibility to follow the rules of the language.
The compiler usually tries to help you by pointing out obvious flaws, but in the end it's on you to make sure your program makes sense.
And something like int i = i; does not make sense but is syntactically correct, so the compiler may or may not warn you, but in any case is within its rights to just generate garbage (and not tell you about it) because you broke the rules and invoked Undefined Behaviour.
I guess the gist of your question is about why the second identifier is recognized as identifying the same object as the first, in int i = i; or int &p = p;
This is defined in [basic.scope.pdecl]/1 of the C++14 standard:
The point of declaration for a name is immediately after its complete declarator and before its initializer (if any), except as noted below. [Example:
unsigned char x = 12;
{ unsigned char x = x; }
Here the second x is initialized with its own (indeterminate) value. —end example ]
The semantics of these statements are covered by other threads:
Is int x = x; UB?
Why can a Class& be initialized to itself?
Note - the quoted example differs in semantics from int i = i; because it is not UB to evaluate an uninitialized unsigned char, but is UB to evaluate an uninitialized int.
As noted on the linked thread, g++ and clang can give warnings when they detect this.
Regarding rationale for the scope rule: I don't know for sure, but the scope rule existed in C so perhaps it just made its way into C++ and now it would be confusing to change it.
If we did say that the declared variable is not in scope for its initializer, then int i = i; might make the second i find an i from an outer scope, which would also be confusing.

Does `volatile` permits type punning with unions?

We all know that type punning like this
union U {float a; int b;};
U u;
std::memset(u, 0, sizeof u);
u.a = 1.0f;
std::cout << u.b;
is undefined behavior in C++.
It is undefined because after u.a = 1.0f; assignment .a becomes an active field and .b becomes inactive field, and it's undefined behavior to read from an inactive field. We all know this.
Now, consider following code
union U {float a; int b;};
U u;
std::memset(u, 0, sizeof u);
u.a = 1.0f;
char *ptr = new char[std::max(sizeof (int),sizeof (float))];
std::memcpy(ptr, &u.a, sizeof (float));
std::memcpy(&u.b, ptr, sizeof (int));
std::cout << u.b;
And now it becomes well-defined, because this kind of type punning is allowed.
Also, as you see, u memory remains same after memcpy() calls.
Now let's add threads and the volatile keyword.
union U {float a; int b;};
volatile U u;
std::memset(u, 0, sizeof u);
u.a = 1.0f;
std::thread th([&]
{
char *ptr = new char[sizeof u];
std::memcpy(ptr, &u.a, sizeof u);
std::memcpy(&u.b, ptr, sizeof u);
});
th.join();
std::cout << u.b;
The logic remains same, but we just have second thread. Because of the volatile keyword code remains well-defined.
In real code this second thread can be implemented through any crappy threading library and compiler can be unaware of that second thread. But because of the volatile keyword it's still well-defined.
But what if there is no other threads?
union U {float a; int b;};
volatile U u;
std::memset(u, 0, sizeof u);
u.a = 1.0f;
std::cout << u.b;
There is no other threads. But compiler does not know that there is no other threads!
From compiler point of view, nothing changed! And if third example was well-defined, last one must be well-defined too!
And we don't need that second thread because it does not change u memory anyway.
If volatile is used, compiler assumes that u can be modified silently at any point. At such modification any field can become active.
And so, compiler can never track what field of volatile union is active.
It can't assume that a field remains active after it was assigned to (and that other fields remain inactive), even if nothing really modifies that union.
And so, in last two examples compiler shall give me exact bit representation of 1.0f converted to int.
The questions are: Is my reasoning correct? Are 3rd and 4th examples really well-defiend? What the standard says about it?
In real code this second thread can be implemented through any crappy threading library and compiler can be unaware of that second thread. But because of the volatile keyword it's still well-defined.
That statement is false, and so the rest of the logic upon which you base your conclusion is unsound.
Suppose you have code like this:
int* currentBuf = bufferStart;
while(currentBuf < bufferEnd)
{
*currentBuf = foobar;
currentBuf++;
}
If foobar is not volatile then a compiler is permitted to reason as follows: "I know that foobar is never aliased by currentBuf and therefore does not change within the loop, therefore I may optimize the code as"
int* currentBuf = bufferStart;
int temp = foobar;
while(currentBuf < bufferEnd)
{
*currentBuf = temp;
currentBuf++;
}
If foobar is volatile then this and many other code generation optimizations are disabled. Notice I said code generation. The CPU is entirely within its rights however to move reads and writes around to its heart's content, provided that the memory model of the CPU is not violated.
In particular, the compiler is not required to force the CPU to go back to main memory on every read and write of foobar. All it is required to do is to eschew certain optimizations. (This is not strictly true; the compiler is also obliged to ensure that certain properties involving long jumps are preserved, and a few other minor details that have nothing to do with threading.) If there are two threads, and each is on a different processor, and each processor has a different cache, volatile introduces no requirement that the caches be made coherent if they both contain a copy of the memory for foobar.
Some compilers may choose to implement those semantics for your convenience, but they are not required to do so; consult your compiler documentation.
I note that C# and Java do require acquire and release semantics on volatiles, but those requirements can be surprisingly weak. In particular, the x86 will not reorder two volatile writes or two volatile reads, but is permitted to reorder a volatile read of one variable before a volatile write of another, and in fact the x86 processor can do so in rare situations. (See http://blog.coverity.com/2014/03/26/reordering-optimizations/ for a puzzle written in C# that illustrates how low-lock code can be wrong even if everything is volatile and has acquire-release semantics.)
The moral is: even if your compiler is helpful and does impose additional semantics on volatile variables like C# or Java, it still may be the case that there is no consistently observed sequence of reads and writes across all threads; many memory models do not impose this requirement. This can then cause weird runtime behaviour. Again, consult your compiler documentation if you want to know what volatile means for you.
No - your reasoning is wrong. The volatile part is a general misunderstanding - volatile is not working as you state.
The union part is as well wrong. Read this Accessing inactive union member and undefined behavior?
With c++ (11) you can only expect correct/well defined behaviour when the last write correspond to the next read.

return {0} from a function in C?

I got the following discrepancy between compiling as 'C' vs. compiling as 'C++'
struct A {
int x;
int y;
};
struct A get() {
return {0};
}
When compiling as 'C++' everything goes fine.
However, when compiling as 'C'; i am getting:
error : expected expression
which i can fix by doing:
return (struct A){0};
However, i wonder where the difference comes from. Can any one point in the language reference where this difference comes from?
The two use completely different mechanisms, one of which is C++11-specific, the other of which is C99-specific.
The first bit,
struct A get() {
return {0};
}
depends on [stmt.return] (6.6.3 (2)) in C++11, which says
(...) A return statement with a braced-init-list initializes the object or reference to be returned from the function by copy-list-initialization from the specified initializer list. [ Example:
std::pair<std::string,int> f(const char *p, int x) {
return {p,x};
}
-- end example ]
This passage does not exist in C (nor C++ before C++11), so the C compiler cannot handle it.
On the other hand,
struct A get() {
return (struct A){0};
}
uses a C99 feature called "compound literals" that does not exist in C++ (although some C++ compilers, notably gcc, provide it as a language extension; gcc warns about it with -pedantic). The semantics are described in detail in section 6.5.2.5 of the C99 standard; the money quote is
4 A postfix expression that consists of a parenthesized type name followed by a brace-enclosed list of initializers is a compound literal. It provides an unnamed object whose value is given by the initializer list. (footnote 80)
80) Note that this differs from a cast expression. For example, a cast specifies a conversion to scalar types or void only, and the result of a cast expression is not an lvalue.
So in this case (struct A){0} is an unnamed object that is copied into the return value and returned. (note that modern compilers will elide this copy, so you do not have to fear runtime overhead from it)
And there you have it, chapters and verses. Why these features exist the way they do in their respective languages may prove to be a fascinating discussion, but I fear that it is difficult for anyone outside the respective standardization committees to give an authoritative answer to the question. Both features were introduced after C and C++ split ways, and they're not developed side by side (nor would it make sense to do so). Divergence is inevitable even in small things.

Is initializing a garbage variable truly initializing or just an assignment?

I generally see examples of initialisation vs assignment like this:
int funct1(void)
{int a = 5; /*initialization*/
a = 6;} /*assignment*/
Obviously something left as garbage or undefined somehow is uninitialized.
But could some one please define if initialization is reserved for definition statements and/or whether assignments can be called initialisation?
int funct2(void)
{int b;
b = 5;} /*assignment, initialization or both??*/
Is there much of a technical reason why we can't say int b is initialised to garbage (from the compilers point of view)?
Also if possible could this be compared with initializing and assinging on non-primitive data types.
I'll resurrect this thread to add an important point of view, since the puzzlement about terminology by the OP is understandable. As #OliCharlesworth pointed out (and he's perfectly right about that) as far as the C language standard is concerned initialization and assignment are two completely different things. For example (assuming local scope):
int n = 1; // definition, declaration and **initialization**
int k; // just definition + declaration, but no initialization
n = 12; // assignment of a previously initialized variable
k = 42; // assignment of a previously UNinitialized variable
The problem is that many books that teach programming aren't so picky about terminology, so they call "initialization" any "operation" that gives a variable its first meaningful value. So, in the example above, n = 12 wouldn't be an initialization, whereas k = 42 would. Of course this terminology is vague, imprecise and may be misleading (although it is used too often, especially by teachers when introducing programming to newbies). As a simple example of such an ambiguity let's recast the previous example taking global scope into account:
// global scope
int n = 1; // definition, declaration and **initialization**
int k; // definition, declaration and **implicit initialization to 0**
int main(void)
{
n = 12; // assignment of a previously initialized variable
k = 42; // assignment of a previously initialized variable
// ... other code ...
}
What would you say about the assignments in main? The first is clearly only an assignment, but is it the second an initialization, according to the vague, generic terminology? Is the default value 0 given to k its first "meaningful" value or not?
Moreover a variable is commonly said to be uninitialized if no initialization or assignment has been applied to it. Given:
int x;
x = 42;
one would commonly say that x is uninitialized before the assignment, but not after it. The terms assignment and initializer are defined syntactically, but terms like "initialization" and "uninitialized" are often used to refer to the semantics (in somewhat informal usage). [Thanks to Keith Thompson for this last paragraph].
I dislike this vague terminology, but one should be aware that it is used and, alas, not too rare.
As far as the language standard is concerned, only statements of the form int a = 5; are initialisation. Everything of the form b = 5; is an assignment.
The same is true of non-primitive types.
And to "Is there much of a technical reason why we can't say int b is initialised to garbage", well, if you don't put any value into a memory location, it's not "initialisation". From the compiler's point of view, no machine language instruction is generated to write to the location, so nothing happens.

Accessing struct members directly

I have a testing struct definition as follows:
struct test{
int a, b, c;
bool d, e;
int f;
long g, h;
};
And somewhere I use it this way:
test* t = new test; // create the testing struct
int* ptr = (int*) t;
ptr[2] = 15; // directly manipulate the third word
cout << t->c; // look if it really affected the third integer
This works correctly on my Windows - it prints 15 as expected, but is it safe? Can I be really sure the variable is on the spot in memory I want it to be - expecially in case of such combined structs (for example f is on my compiler the fifth word, but it is a sixth variable)?
If not, is there any other way to manipulate struct members directly without actually having struct->member construct in the code?
It looks like you are asking two questions
Is it safe to treat &test as a 3 length int arrray?
It's probably best to avoid this. This may be a defined action in the C++ standard but even if it is, it's unlikely that everyone you work with will understand what you are doing here. I believe this is not supported if you read the standard because of the potential to pad structs but I am not sure.
Is there a better way to access a member without it's name?
Yes. Try using the offsetof macro/operator. This will provide the memory offset of a particular member within a structure and will allow you to correctly position a point to that member.
size_t offset = offsetof(mystruct,c);
int* pointerToC = (int*)((char*)&someTest + offset);
Another way though would be to just take the address of c directly
int* pointerToC = &(someTest->c);
No you can't be sure. The compiler is free to introduce padding between structure members.
To add to JaredPar's answer, another option in C++ only (not in plain C) is to create a pointer-to-member object:
struct test
{
int a, b, c;
bool d, e;
int f;
long g, h;
};
int main(void)
{
test t1, t2;
int test::*p; // declare p as pointing to an int member of test
p = &test::c; // p now points to 'c', but it's not associating with an object
t1->*p = 3; // sets t1.c to 3
t2->*p = 4; // sets t2.c to 4
p = &test::f;
t1->*p = 5; // sets t1.f to 5
t2->*p = 6; // sets t2.f to 6
}
You are probably looking for the offsetof macro. This will get you the byte offset of the member. You can then manipulate the member at that offset. Note though, this macro is implementation specific. Include stddef.h to get it to work.
It's probably not safe and is 100% un-readable; thus making that kind of code unacceptable in real life production code.
use set methods and boost::bind for create functor which will change this variable.
Aside from padding/alignment issues other answers have brought up, your code violates strict aliasing rules, which means it may break for optimized builds (not sure how MSVC does this, but GCC -O3 will break on this type of behavior). Essentially, because test *t and int *ptr are of different types, the compiler may assume they point to different parts of memory, and it may reorder operations.
Consider this minor modification:
test* t = new test;
int* ptr = (int*) t;
t->c = 13;
ptr[2] = 15;
cout << t->c;
The output at the end could be either 13 or 15, depending on the order of operations the compiler uses.
According to paragraph 9.2.17 of the standard, it is in fact legal to cast a pointer-to-struct to a pointer to its first member, providing the struct is POD:
A pointer to a POD-struct object,
suitably converted using a
reinterpret_cast, points to its
initial member (or if that member is a
bit-field, then to the unit in which
it resides) and vice versa. [Note:
There might therefore be unnamed
padding within a POD-struct object,
but not at its beginning, as necessary
to achieve appropriate alignment. ]
However, the standard makes no guarantees about the layout of structs -- even POD structs -- other than that the address of a later member will be greater than the address of an earlier member, provided there are no access specifiers (private:, protected: or public:) between them. So, treating the initial part of your struct test as an array of 3 integers is technically undefined behaviour.