I was just going through all the possible Undefined Behaviours in this thread, and one of them is
The result of assigning to partially overlapping objects
I wondered if anyone could give me a definition of what "partially overlapping objects" are and an example in code of how that could possibly be created?
As pointed out in other answers, a union is the most obvious way to arrange this.
This is an even clearer example of how partially overlapping objects might arise with the built in assignment operator. This example would not otherwise exhibit UB if it were not for the partially overlapping object restrictions.
union Y {
int n;
short s;
};
void test() {
Y y;
y.s = 3; // s is the active member of the union
y.n = y.s; // Although it is valid to read .s and then write to .x
// changing the active member of the union, .n and .s are
// not of the same type and partially overlap
}
You can get potential partial overlap even with objects of the same type. Consider this example in the case where short is strictly larger than char on an implementation that adds no padding to X.
struct X {
char c;
short n;
};
union Y {
X x;
short s;
};
void test() {
Y y;
y.s = 3; // s is the active member of the union
y.x.n = y.s; // Although it is valid to read .s and then write to .x
// changing the active member of the union, it may be
// that .s and .x.n partially overlap, hence UB.
}
A union is a good example for that.
You can create a structure of memory with overlapping members.
for example (from MSDN):
union DATATYPE // Declare union type
{
char ch;
int i;
long l;
float f;
double d;
} var1;
now if you use assign the char member all other member are undefined. That's because they are at the same memory block, and you've only set an actual value to a part of it:
DATATYPE blah;
blah.ch = 4;
If you will then try to access blah.i or blah.d or blah.f they will have an undefined value. (because only first byte, which is a char, had its value set)
This refers to the problem of pointer aliasing, which is forbidden in C++ to give compilers an easier time optimizing. A good explanation of the problem can be found in this thread
May be he mean a strict aliasing rule? Object in memory should not overlap with object of other type.
"Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias each other.)"
The canonical example is using memcpy:
char *s = malloc(100);
int i;
for(i=0; i != 100;++i) s[i] = i; /* just populate it with some data */
char *t = s + 10; /* s and t may overlap since s[10+i] = t[i] */
memcpy(t, s, 20); /* if you are copying at least 10 bytes, there is overlap and the behavior is undefined */
The reason why memcpy is undefined behavior is because there's no required algorithm for performing the copy. In this circumstance, memmove was introduced as a safe alternative.
Related
I know about the memcpy/memmove to a union member, does this set the 'active' member? question , but I guess my question is different. So:
Suppose sizeof( int ) == sizeof( float ) and I have the following code snippet:
union U{
int i;
float f;
};
U u;
u.i = 1; //i is the active member of u
::std::memcpy( &u.f, &u.i, sizeof( u ) ); //copy memory content of u.i to u.f
My questions:
Does the code lead to an undefined behaviour (UB)? If yes why?
If the code does not lead to an UB, what is the active member of u after the memcpy call and why?
What would be the answer to previous two questions if sizeof( int ) != sizeof( float ) and why?
Regardless of the union, the behaviour of std::memcpy is undefined if the source and destination overlap. This is the case for every member of the union, and it would not be different if the sizes weren't the same.
If you were to use std::memmove instead, there is no longer an issue due to the overlap, and it also doesn't matter that you copy from a member of a union. Since both types are trivially copyable, the behaviour is defined and u.f becomes the active member of the union, but the union holds the same bytes as before in practice.
The only issue would arise if sizeof(U) was larger than sizeof(int), because you would be copying potentially uninitialized bytes. This is undefined behaviour.
You are not allowed to use memcpy to copy overlapping regions of memory:
If the objects overlap, the behavior is undefined.
Your code has undefined behavior because of this violation of memcpy's precondition, as u.f and u.i occupy the same address in memory.
You can use std::bit_cast to switch the active members in a union. Additionally, it's constexpr so you can even use the union in a core constant calculation.
#include <memory>
#include <bit>
union U {
int i;
float f;
constexpr void switch_to_int() { this->i = std::bit_cast<int>(f); }
constexpr void switch_to_float() { this->f = std::bit_cast<float>(i); }
};
constexpr int foo() {
U u{};
u.f = 2.0f;
u.switch_to_int();
return u.i;
}
int main()
{
constexpr int i = foo();
}
Compiler Explorer
std::bit_cast uses memcpy under the hood and compilers do a great job of optimizing the code. In this case memcpy is used to create an r-value that is then written to the new, active union element. This is what's left of the call to foo() at runtime.
mov eax,40000000h
Yes, it's undefined behaviour (UB). Because &u.f and &u.i points to the same start address. See the definition of memcpy:
void * memcpy (void *__restrict, const void *__restrict, size_t);
The C99 keyword restrict is an indication to the compiler that different object pointer types and function parameter arrays do not point to overlapping regions of memory.
This enables the compiler to perform optimizations that might otherwise be prevented because of possible aliasing.
It is your responsibility to ensure that restrict-qualified pointers do not point to overlapping regions of memory.
__restrict, permitted in C90 and C++, is a synonym for restrict.
Because it is UB. The result are undefined.
Also UB. Because &u.f and &u.i always points to the same start address, regardless equality of their lengths. sizeof(U) will get the maximum size of all the members of the union.
Suppose we have a standard-layout class, simplified to this:
struct X {
int num;
Object obj; // also standard layout
char buf[512];
};
As I understand it, if we have an instance of X, we can take its address and cast it to char* and look at the content of X as if it was an array of bytes.
However, it's a little less clear if we do the following, taking the address of a member and walking past it as if we were walking through X itself:
X x;
char* p = reinterpret_cast<char*>(&x.obj) + sizeof(Object);
*((int*)p) = 1234; // write an int into buf
Is this valid and well defined?
I was commenting to a colleague that if we take the address of obj we should limit ourselves to the memory of that object, and not assume that reading past its end is safely going into the next field of the containing struct (buf in this case). I have been asked to defend that statement. In looking for an answer only managed to confuse myself a bit by all the similar questions that don't seem to address this precisely. :)
Standard citations appreciated. Also, I'd prefer to stick to the question "is this is valid", and ignore the question of its sensibility.
As I understand it, if we have an instance of X, we can take its address and cast it to char* and look at the content of X as if it was an array of bytes.
The standard doesn't actually allow this currently. Casting to char* will not change the pointer value (per [expr.static.cast]/13) and as a result you will not be allowed to apply pointer arithmetic on it as it violates [expr.add]/4 and/or [expr.add]/6.
This is however often assumed to be allowed in practice and probably considered a defect in the standard. The paper P1839 by Timur Doumler and Krystian Stasiowski is trying to address that.
But even applying the proposed wording in this paper (revision P1839R5)
X x;
char* p = reinterpret_cast<char*>(&x.obj) + sizeof(Object);
*((int*)p) = 1234; // write an int into buf
will have undefined behavior, at least assuming I am interpreting its proposed wording and examples correctly. (I might not be though.)
First of all, there is no guarantee that buf will be correctly aligned for an int. If it isn't, then the cast (int*)p will produce an unspecified pointer value. But also, there is no guarantee in general that there is no padding between obj and buf.
Even if you assume correct alignment and no padding, because e.g. you have guarantees from your ABI or compiler, there are still problems.
First, the proposal would only allow unsigned char*, not char* or std::byte*, to access the object representation. See "Known issues" section.
Second, after fixing that, p would be a pointer one-past the object representation of obj, so it doesn't point to an object. As a consequence the cast (int*)p cannot point to any int object that might have been implicitly created in buf when X x;'s lifetime started. Instead [expr.static.cast]/13 will apply and the value of the pointer remains unchanged.
Trying to dereference the int* pointer pointing one-past-the-end of the object representation of obj will then cause undefined behavior (as it is not pointing to an object).
You also can't save this using std::launder on the pointer, because a pointer to an int nested inside buf would give you access to bytes which are not reachable through a pointer to the object representation of buf, violating std::launder's precondition, see [ptr.launder]/4.
In a broader picture, if you look at how e.g. std::launder is specified, it seems to me that the intention is definitively not to allow this. The way it is specified, it is impossible to use a pointer (in)to a member of a class (except the first if standard layout) to access memory of other (non-overlapping) members. This specifically seems to be intended to allow a compiler to do optimization by pointer analysis based on assuming that these other members are unreachable. (I don't know whether there is any compiler actually doing this though.)
I was commenting to a colleague that if we take the address of obj we should limit ourselves to the memory of that object,
Well said!
char* p = reinterpret_cast<char*>(&x.obj) + sizeof(Object);
Nope, sizeof(Object) isn't necessarily a multiple of your platform's alignment, so that the next member (.buf) might not necessarily start immediately at .obj's end.
Generally, hm. If you really need to do this, write a union that's either a char[512] or a int or whatever you need it to be, and put it in place of char buf[512]; that's what they're for.
Taking your code and demonstrating this:
#include "fmt/format.h"
#include <algorithm>
#include <array>
#include <string>
#include <utility>
#include <vector>
struct Object {
std::array<char, 17> values;
};
struct X {
int num;
Object obj; // also standard layout
char buf[512];
};
int main() {
X x;
auto ptr_to_x = reinterpret_cast<char *>(&x);
auto ptr_distance = reinterpret_cast<char *>(&x.buf) - ptr_to_x;
std::vector<std::pair<std::string, unsigned int>> statements{
{"instance x", sizeof(x)},
{"class X", sizeof(X)},
{"distance between beginning of X and buf", ptr_distance},
{"Object", sizeof(Object)}};
std::size_t maxlenkey = 0;
std::size_t maxlenval = 0;
for (const auto &[key, val] : statements) {
maxlenkey = std::max(key.size(), maxlenkey);
maxlenval = std::max(fmt::formatted_size("{:d}", val), maxlenval);
}
for (const auto &[key, value] : statements) {
fmt::print("length of {: <{}s} {:{}d}\n", key, maxlenkey, value, maxlenval);
}
}
prints:
length of instance x 536
length of class X 536
length of distance between beginning of X and buf 21
length of Object 17
So, writing 4 bytes at ((char*)&x)+sizeof(Object) will definitely not do the same as actually writing to buf.
Consider following code:
union U
{
int a;
float b;
};
int main()
{
U u;
int *p = &u.a;
*(float *)p = 1.0f; // <-- this line
}
We all know that addresses of union fields are usually same, but I'm not sure is it well-defined behavior to do something like this.
So, question is: Is it legal and well-defined behavior to cast and dereference a pointer to union field like in the code above?
P.S. I know that it's more C than C++, but I'm trying to understand if it's legal in C++, not C.
All members of a union must reside at the same address, that is guaranteed by the standard. What you are doing is indeed well-defined behavior, but it shall be noted that you cannot read from an inactive member of a union using the same approach.
Accessing inactive union member - undefined behavior?
Note: Do not use c-style casts, prefer reinterpret_cast in this case.
As long as all you do is write to the other data-member of the union, the behavior is well-defined; but as stated this changes which is considered to be the active member of the union; meaning that you can later only read from that you just wrote to.
union U {
int a;
float b;
};
int main () {
U u;
int *p = &u.a;
reinterpret_cast<float*> (p) = 1.0f; // ok, well-defined
}
Note: There is an exception to the above rule when it comes to layout-compatible types.
The question can be rephrased into the following snippet which is semantically equivalent to a boiled down version of the "problem".
#include <type_traits>
#include <algorithm>
#include <cassert>
int main () {
using union_storage_t = std::aligned_storage<
std::max ( sizeof(int), sizeof(float)),
std::max (alignof(int), alignof(float))
>::type;
union_storage_t u;
int * p1 = reinterpret_cast< int*> (&u);
float * p2 = reinterpret_cast<float*> (p1);
float * p3 = reinterpret_cast<float*> (&u);
assert (p2 == p3); // will never fire
}
What does the Standard (n3797) say?
9.5/1 Unions [class.union]
In a union, at most one of the non-static data members can be
active at any time, that is, the value of at most one of the
non-static dat amembers ca nbe stored in a union at any time.
[...] The size of a union is sufficient to contain the largest of
its non-static data members. Each non-static data member is
allocated as if it were the sole member of a struct. All non-static data members of a union object have the same address.
Note: The wording in C++11 (n3337) was underspecified, even though the intent has always been that of C++14.
Yes, it is legal. Using explicit casts, you can do almost anything.
As other comments have stated, all members in a union start at the same address / location so casting a pointer to a different member is pointless.
The assembly language will be the same. You want to make the code easy to read so I don't recommend the practice. It is confusing and there is no benefit.
Also, I recommend a "type" field so that you know when the data is in float format versus int format.
Was just reading about some anonymous structures and how it is isn't standard and some general use case for it is undefined behaviour...
This is the basic case:
struct Point {
union {
struct {
float x, y;
};
float v[2];
};
};
So writing to x and then reading from v[0] would be undefined in that you would expect them to be the same but it may not be so.
Not sure if this is in the standard but unions of the same type...
union{ float a; float b; };
Is it undefined to write to a and then read from b ?
That is to say does the standard say anything about binary representation of arrays and sequential variables of the same type.
The standard says that reading from any element in a union other
than the last one written is undefined behavior. In theory, the
compiler could generate code which somehow kept track of the
reads and writes, and triggered a signal if you violated the
rule (even if the two are the same type). A compiler could also
use the fact for some sort of optimization: if you write to a
(or x), it can assume that you do not read b (or v[0])
when optimizing.
In practice, every compiler I know supports this, if the union
is clearly visible, and there are cases in many (most?, all?)
where even legal use will fail if the union is not visible
(e.g.:
union U { int i; float f; };
int f( int* pi, int* pf ) { int r = *pi; *pf = 3.14159; return r; }
// ...
U u;
u.i = 1;
std::cout << f( &u.i, &u.f );
I've actually seen this fail with g++, although according to the
standard, it is perfectly legal.)
Also, even if the compiler supports writing to Point::x and
reading from Point::v[0], there's no guarantee that Point::y
and Point::v[1] even have the same physical address.
The standard requires that in a union "[e]ach data member is allocated as if it were the sole member of a struct." (9.5)
It also requires that struct { float x, y; } and float v[2] must have the same internal representation (9.2) and thus you could safely reinterpret cast one as the other
Taken together these two rules guarantee that the union you describe will function provided that it is genuinely written to memory. However, because the standard only requires that the last data member written be valid it's theoretically possible to have an implementation that fails if the union is only used as a local variable. I'd be amazed if that ever actually happens, however.
I did not get why you have used float v[2];
The simple union for a point structure can be defined as:
union{
struct {
float a;
float b;
};
} Point;
You can access the values in unioin as:
Point.a = 10.5;
point.b = 12.2; //example
I have a testing struct definition as follows:
struct test{
int a, b, c;
bool d, e;
int f;
long g, h;
};
And somewhere I use it this way:
test* t = new test; // create the testing struct
int* ptr = (int*) t;
ptr[2] = 15; // directly manipulate the third word
cout << t->c; // look if it really affected the third integer
This works correctly on my Windows - it prints 15 as expected, but is it safe? Can I be really sure the variable is on the spot in memory I want it to be - expecially in case of such combined structs (for example f is on my compiler the fifth word, but it is a sixth variable)?
If not, is there any other way to manipulate struct members directly without actually having struct->member construct in the code?
It looks like you are asking two questions
Is it safe to treat &test as a 3 length int arrray?
It's probably best to avoid this. This may be a defined action in the C++ standard but even if it is, it's unlikely that everyone you work with will understand what you are doing here. I believe this is not supported if you read the standard because of the potential to pad structs but I am not sure.
Is there a better way to access a member without it's name?
Yes. Try using the offsetof macro/operator. This will provide the memory offset of a particular member within a structure and will allow you to correctly position a point to that member.
size_t offset = offsetof(mystruct,c);
int* pointerToC = (int*)((char*)&someTest + offset);
Another way though would be to just take the address of c directly
int* pointerToC = &(someTest->c);
No you can't be sure. The compiler is free to introduce padding between structure members.
To add to JaredPar's answer, another option in C++ only (not in plain C) is to create a pointer-to-member object:
struct test
{
int a, b, c;
bool d, e;
int f;
long g, h;
};
int main(void)
{
test t1, t2;
int test::*p; // declare p as pointing to an int member of test
p = &test::c; // p now points to 'c', but it's not associating with an object
t1->*p = 3; // sets t1.c to 3
t2->*p = 4; // sets t2.c to 4
p = &test::f;
t1->*p = 5; // sets t1.f to 5
t2->*p = 6; // sets t2.f to 6
}
You are probably looking for the offsetof macro. This will get you the byte offset of the member. You can then manipulate the member at that offset. Note though, this macro is implementation specific. Include stddef.h to get it to work.
It's probably not safe and is 100% un-readable; thus making that kind of code unacceptable in real life production code.
use set methods and boost::bind for create functor which will change this variable.
Aside from padding/alignment issues other answers have brought up, your code violates strict aliasing rules, which means it may break for optimized builds (not sure how MSVC does this, but GCC -O3 will break on this type of behavior). Essentially, because test *t and int *ptr are of different types, the compiler may assume they point to different parts of memory, and it may reorder operations.
Consider this minor modification:
test* t = new test;
int* ptr = (int*) t;
t->c = 13;
ptr[2] = 15;
cout << t->c;
The output at the end could be either 13 or 15, depending on the order of operations the compiler uses.
According to paragraph 9.2.17 of the standard, it is in fact legal to cast a pointer-to-struct to a pointer to its first member, providing the struct is POD:
A pointer to a POD-struct object,
suitably converted using a
reinterpret_cast, points to its
initial member (or if that member is a
bit-field, then to the unit in which
it resides) and vice versa. [Note:
There might therefore be unnamed
padding within a POD-struct object,
but not at its beginning, as necessary
to achieve appropriate alignment. ]
However, the standard makes no guarantees about the layout of structs -- even POD structs -- other than that the address of a later member will be greater than the address of an earlier member, provided there are no access specifiers (private:, protected: or public:) between them. So, treating the initial part of your struct test as an array of 3 integers is technically undefined behaviour.