Accessing struct members directly - c++

I have a testing struct definition as follows:
struct test{
int a, b, c;
bool d, e;
int f;
long g, h;
};
And somewhere I use it this way:
test* t = new test; // create the testing struct
int* ptr = (int*) t;
ptr[2] = 15; // directly manipulate the third word
cout << t->c; // look if it really affected the third integer
This works correctly on my Windows - it prints 15 as expected, but is it safe? Can I be really sure the variable is on the spot in memory I want it to be - expecially in case of such combined structs (for example f is on my compiler the fifth word, but it is a sixth variable)?
If not, is there any other way to manipulate struct members directly without actually having struct->member construct in the code?

It looks like you are asking two questions
Is it safe to treat &test as a 3 length int arrray?
It's probably best to avoid this. This may be a defined action in the C++ standard but even if it is, it's unlikely that everyone you work with will understand what you are doing here. I believe this is not supported if you read the standard because of the potential to pad structs but I am not sure.
Is there a better way to access a member without it's name?
Yes. Try using the offsetof macro/operator. This will provide the memory offset of a particular member within a structure and will allow you to correctly position a point to that member.
size_t offset = offsetof(mystruct,c);
int* pointerToC = (int*)((char*)&someTest + offset);
Another way though would be to just take the address of c directly
int* pointerToC = &(someTest->c);

No you can't be sure. The compiler is free to introduce padding between structure members.

To add to JaredPar's answer, another option in C++ only (not in plain C) is to create a pointer-to-member object:
struct test
{
int a, b, c;
bool d, e;
int f;
long g, h;
};
int main(void)
{
test t1, t2;
int test::*p; // declare p as pointing to an int member of test
p = &test::c; // p now points to 'c', but it's not associating with an object
t1->*p = 3; // sets t1.c to 3
t2->*p = 4; // sets t2.c to 4
p = &test::f;
t1->*p = 5; // sets t1.f to 5
t2->*p = 6; // sets t2.f to 6
}

You are probably looking for the offsetof macro. This will get you the byte offset of the member. You can then manipulate the member at that offset. Note though, this macro is implementation specific. Include stddef.h to get it to work.

It's probably not safe and is 100% un-readable; thus making that kind of code unacceptable in real life production code.

use set methods and boost::bind for create functor which will change this variable.

Aside from padding/alignment issues other answers have brought up, your code violates strict aliasing rules, which means it may break for optimized builds (not sure how MSVC does this, but GCC -O3 will break on this type of behavior). Essentially, because test *t and int *ptr are of different types, the compiler may assume they point to different parts of memory, and it may reorder operations.
Consider this minor modification:
test* t = new test;
int* ptr = (int*) t;
t->c = 13;
ptr[2] = 15;
cout << t->c;
The output at the end could be either 13 or 15, depending on the order of operations the compiler uses.

According to paragraph 9.2.17 of the standard, it is in fact legal to cast a pointer-to-struct to a pointer to its first member, providing the struct is POD:
A pointer to a POD-struct object,
suitably converted using a
reinterpret_cast, points to its
initial member (or if that member is a
bit-field, then to the unit in which
it resides) and vice versa. [Note:
There might therefore be unnamed
padding within a POD-struct object,
but not at its beginning, as necessary
to achieve appropriate alignment. ]
However, the standard makes no guarantees about the layout of structs -- even POD structs -- other than that the address of a later member will be greater than the address of an earlier member, provided there are no access specifiers (private:, protected: or public:) between them. So, treating the initial part of your struct test as an array of 3 integers is technically undefined behaviour.

Related

Standard layout, taking address of member and indexing past it to next member

Suppose we have a standard-layout class, simplified to this:
struct X {
int num;
Object obj; // also standard layout
char buf[512];
};
As I understand it, if we have an instance of X, we can take its address and cast it to char* and look at the content of X as if it was an array of bytes.
However, it's a little less clear if we do the following, taking the address of a member and walking past it as if we were walking through X itself:
X x;
char* p = reinterpret_cast<char*>(&x.obj) + sizeof(Object);
*((int*)p) = 1234; // write an int into buf
Is this valid and well defined?
I was commenting to a colleague that if we take the address of obj we should limit ourselves to the memory of that object, and not assume that reading past its end is safely going into the next field of the containing struct (buf in this case). I have been asked to defend that statement. In looking for an answer only managed to confuse myself a bit by all the similar questions that don't seem to address this precisely. :)
Standard citations appreciated. Also, I'd prefer to stick to the question "is this is valid", and ignore the question of its sensibility.
As I understand it, if we have an instance of X, we can take its address and cast it to char* and look at the content of X as if it was an array of bytes.
The standard doesn't actually allow this currently. Casting to char* will not change the pointer value (per [expr.static.cast]/13) and as a result you will not be allowed to apply pointer arithmetic on it as it violates [expr.add]/4 and/or [expr.add]/6.
This is however often assumed to be allowed in practice and probably considered a defect in the standard. The paper P1839 by Timur Doumler and Krystian Stasiowski is trying to address that.
But even applying the proposed wording in this paper (revision P1839R5)
X x;
char* p = reinterpret_cast<char*>(&x.obj) + sizeof(Object);
*((int*)p) = 1234; // write an int into buf
will have undefined behavior, at least assuming I am interpreting its proposed wording and examples correctly. (I might not be though.)
First of all, there is no guarantee that buf will be correctly aligned for an int. If it isn't, then the cast (int*)p will produce an unspecified pointer value. But also, there is no guarantee in general that there is no padding between obj and buf.
Even if you assume correct alignment and no padding, because e.g. you have guarantees from your ABI or compiler, there are still problems.
First, the proposal would only allow unsigned char*, not char* or std::byte*, to access the object representation. See "Known issues" section.
Second, after fixing that, p would be a pointer one-past the object representation of obj, so it doesn't point to an object. As a consequence the cast (int*)p cannot point to any int object that might have been implicitly created in buf when X x;'s lifetime started. Instead [expr.static.cast]/13 will apply and the value of the pointer remains unchanged.
Trying to dereference the int* pointer pointing one-past-the-end of the object representation of obj will then cause undefined behavior (as it is not pointing to an object).
You also can't save this using std::launder on the pointer, because a pointer to an int nested inside buf would give you access to bytes which are not reachable through a pointer to the object representation of buf, violating std::launder's precondition, see [ptr.launder]/4.
In a broader picture, if you look at how e.g. std::launder is specified, it seems to me that the intention is definitively not to allow this. The way it is specified, it is impossible to use a pointer (in)to a member of a class (except the first if standard layout) to access memory of other (non-overlapping) members. This specifically seems to be intended to allow a compiler to do optimization by pointer analysis based on assuming that these other members are unreachable. (I don't know whether there is any compiler actually doing this though.)
I was commenting to a colleague that if we take the address of obj we should limit ourselves to the memory of that object,
Well said!
char* p = reinterpret_cast<char*>(&x.obj) + sizeof(Object);
Nope, sizeof(Object) isn't necessarily a multiple of your platform's alignment, so that the next member (.buf) might not necessarily start immediately at .obj's end.
Generally, hm. If you really need to do this, write a union that's either a char[512] or a int or whatever you need it to be, and put it in place of char buf[512]; that's what they're for.
Taking your code and demonstrating this:
#include "fmt/format.h"
#include <algorithm>
#include <array>
#include <string>
#include <utility>
#include <vector>
struct Object {
std::array<char, 17> values;
};
struct X {
int num;
Object obj; // also standard layout
char buf[512];
};
int main() {
X x;
auto ptr_to_x = reinterpret_cast<char *>(&x);
auto ptr_distance = reinterpret_cast<char *>(&x.buf) - ptr_to_x;
std::vector<std::pair<std::string, unsigned int>> statements{
{"instance x", sizeof(x)},
{"class X", sizeof(X)},
{"distance between beginning of X and buf", ptr_distance},
{"Object", sizeof(Object)}};
std::size_t maxlenkey = 0;
std::size_t maxlenval = 0;
for (const auto &[key, val] : statements) {
maxlenkey = std::max(key.size(), maxlenkey);
maxlenval = std::max(fmt::formatted_size("{:d}", val), maxlenval);
}
for (const auto &[key, value] : statements) {
fmt::print("length of {: <{}s} {:{}d}\n", key, maxlenkey, value, maxlenval);
}
}
prints:
length of instance x 536
length of class X 536
length of distance between beginning of X and buf 21
length of Object 17
So, writing 4 bytes at ((char*)&x)+sizeof(Object) will definitely not do the same as actually writing to buf.

Is using std::memcpy on a whole union guaranteed to preserve the active union member?

In C++, it's well-defined to read from an union member which was most recently written, aka the active union member.
My question is whether std::memcpying a whole union object, as opposed to copying a particular union member, to an uninitialized memory area will preserve the active union member.
union A {
int x;
char y[4];
};
A a;
a.y[0] = 'U';
a.y[1] = 'B';
a.y[2] = '?';
a.y[3] = '\0';
std::byte buf[sizeof(A)];
std::memcpy(buf, &a, sizeof(A));
A& a2 = *reinterpret_cast<A*>(buf);
std::cout << a2.y << '\n'; // is `A::y` the active member of `a2`?
The assignments you have are okay because the assignment to non-class member a.y "begins its lifetime". However, your std::memcpy does not do that, so any accesses to members of a2 are invalid. So, you are relying on the consequences of undefined behaviour. Technically. In practice most toolchains are rather lax about aliasing and lifetime of primitive-type union members.
Unfortunately, there's more UB here, in that you're violating aliasing for the union itself: you can pretend that a T is a bunch of bytes, but you can't pretend that a bunch of bytes is a T, no matter how much reinterpret_casting you do. You could instantiate an A a2 normally and std::copy/std::memcpy over it from a, and then you're just back to the union member lifetime problem, if you care about that. But, I guess, if this option were open to you, you'd just be writing A a2 = a in the first place…
My question is whether std::memcpying a whole union object, as opposed to copying a particular union member, to an uninitialized memory area will preserve the active union member.
It'll be copied as expected.
It's how you're reading the result that may, or may not, make your program have undefined behavior.
Using std::memcpy copies char's from one source to a destination. Raw memory copying is ok. Reading from memory as something it wasn't initialized to be is not ok.
So far as I can tell, the C++ Standard makes no distinction between the following two functions on platforms where the sizes of int and foo would happen to be identical [as would typically be the case]
struct s1 { int x; };
struct s2 { int x; };
union foo { s1 a; s2 b; } u1, u2;
void test1(void)
{
u1.a.x = 1;
u2.b.x = 2;
std::memcpy(&u1, &u2, sizeof u1);
}
void test2(void)
{
u1.a = 1;
u2.b = 2;
std::memcpy(&u1.a.x, &u2.b.x, sizeof u1.a.x);
}
If a union of trivially-copyable types is a trivially-copyable type, that would suggest that the active member of u1 after the memcpy in test1 should be b. In the equivalent function test2, however, copying all the bytes from an int object into an one that's part of active union member s1.a should leave the active union member as a.
IMHO, this issue could be easily resolved by recognizing that a union may have multiple "potentially active" members, and allowing certain actions to be performed on any member that is at least potentially active (rather than limiting them to one particular active member). That would among other things allow the Common Initial Sequence rule to be made much clearer more useful, without unduly inhibiting optimizations, by specifying that the act of taking the address of a union member makes it "at least potentially" active, until the next time the union is written via non-character-based access, and providing that common-initial-sequence inspection or bytewise writing of a potentially active union member is allowed, but does not change the active member.
Unfortunately, when the Standard was first written, there was no effort to explore all of the relevant corner cases, much less reach a consensus about how they should be handled. At the time, I don't think there would have been opposition to the idea of officially accommodating multiple potentially-active members, since most compiler designs would naturally accommodate that without difficulty. Unfortunately, some compilers have evolved in a way that would make support for such constructs more difficult than it would have been if accommodated from the start, and their maintainers would block any changes that would contradict their design decisions, even if the Standard was never intended to allow such decisions in the first place.
Before I answer your question, I think your code should add this:
static_assert(std::is_trivial<A>());
Because in order to keep compatibility with C, trivial types get extra guarantees. For example, the requirement of running an object's constructor before using it (See https://eel.is/c++draft/class.cdtor) applies only to an object whose constructor is not trivial.
Because your union is trivial, your code is fine up to and including the memcpy. Where you run into trouble is *reinterpret_cast<A*>(buf);
Specifically, you are using an A object before its lifetime has begin.
As stated in https://eel.is/c++draft/basic.life , lifetime begins when storage with proper alignment and size for the type has been obtained, and its initialization is complete. A trivial type has "vacuous" initialization, so no problem there, however the storage is a problem.
When your example gets storage for buf,
std::byte buf[sizeof(A)];
It does not obtain proper alignment for the type. You'd need to change that line to:
alignas(A) std::byte buf[sizeof(A)];

Is there a standard-compliant way to determine the alignment of a non-static member?

Suppose I have some structure S and a non-static member member, as in this example:
struct S { alignas(alignof(void *)) char member[sizeof(void *)]; };
How do you get the alignment of member?
The operator alignof can only be applied to complete types, not expressions [in 7.6.2.5.1], although GCC allows it, so alignof(S::member) and Clang supports it.
What is the "language-lawyerly" standard way to do it without this restriction?
Also, sizeof allows expression arguments, is there a reason for the asymmetry?
The practical concern is to be able to get the alignment of members of template structures, you can do decltype to get their type, sizeof to get their size, but then you also need the alignment.
The alignment of a type or variable is a description of what memory addresses the variable can inhabit—the address must be a multiple of the alignment*. However, for data-members, the address of the data-member can be any K * alignof(S) + offsetof(S, member). Let's define the alignment of a data-member to be the maximum possible integer E such that &some_s.member is always a multiple of E.
Given a type S with member member, let A = alignof(S), O = offsetof(S, member).
The valid addresses of S{}.member are V = K * A + O for some integer K.
V = K * A + O = gcd(A, O) * (K * A / gcd(A, O) + O / gcd(A, O)).
For the case where K = 1, no other factors exist.
Thus, gcd(A, O) is the best factor valid for unknown K.
In other words, "alignof(S.member)" == gcd(alignof(S), offsetof(S, member)).
Note that this alignment is always a power of two, as alignof(S) is always a power of two.
*: In my brief foray into the standard, I couldn't find this guarantee, meaning that the address of the variable could be K * alignment + some_integer. However, this doesn't affect the final result.
We can define a macro to compute the alignment of a data-member:
#include <cstddef> // for offsetof(...)
#include <numeric> // for std::gcd
// Must be a macro, as `offsetof` is a macro because the member name must be known
// at preprocessing time.
#define ALIGNOF_MEMBER(cls, member) (::std::gcd(alignof(cls), offsetof(cls, member)))
This is only guaranteed valid for standard layout types, as offsetof is only guaranteed valid for standard layout types. If the class is not standard layout, this operation is conditionally supported.
Example:
#include <cstddef>
#include <numeric>
struct S1 { char foo; alignas(alignof(void *)) char member[sizeof(void *)]; };
struct S2 { char foo; char member[sizeof(void *)]; };
#define ALIGNOF_MEMBER(cls, member) (::std::gcd(alignof(cls), offsetof(cls, member)))
int f1() { return ALIGNOF_MEMBER(S1, member); } // returns alignof(void *) == 8
int f2() { return ALIGNOF_MEMBER(S1, foo); } // returns 8*
int f3() { return ALIGNOF_MEMBER(S2, member); } // returns 1
// *: alignof(S1) == 8, so the `foo` member must always be at an alignment of 8
Compiler Explorer
I don't think it's possible. In the general case, declaring a non-static data member with an alignment specifier might not change the layout of the class that contains it. In the below example, if (as is most common) int has a size and alignment of 4, the structs S1 and S2 are likely to have the same layout, with a total size of 8 bytes. Each is likely to have 3 bytes of padding at the end:
struct S1 {
int x;
char y;
};
struct S2 {
int x;
alignas(4) char y;
};
This prevents us from using any information about the layout of the struct to determine the alignment of y. And as the OP noted, alignof(S::member) isn't valid.
By the way, there also isn't any way to query the alignment specifier of a regular variable. You can use the std::align function to check whether the variable is allocated at an address that is appropriately aligned for an object with alignment X, but this doesn't imply that the variable was actually declared with an alignment of X or greater. It could have been declared with an alignment less than X and coincidentally ended up allocated at an address that could have supported an object with alignment X.
Since this functionality is unsupported not only for non-static data members but also regular variables, I'm inclined to think that it's not an oversight; it's deliberately not supported because it's not useful. The compiler needs to know the alignment specifier so that it can allocate the variable or data member appropriately. That is not the programmer's job. Sure, the programmer may need to know the alignment requirement of a type in order to appropriately allocate memory for instances of that type, but you cannot, as the programmer, create additional instances of a variable, other than by triggering some condition that makes it happen automatically (e.g., continuing to the next iteration of a loop will deallocate and reallocate automatic variables in the loop's body). Nor can you, as of now, create a second class at compile time that's guaranteed to be layout-compatible with a given class, which is the main application I can think of for the hypothetical "query alignment of non-static data member" feature. I expect that, once C++ provides enough other reflection functionality so that something like that is close to possible, someone will also put forth a realistic proposal to add a way to query the alignment of a non-static data member.

What are 'partially overlapping objects'?

I was just going through all the possible Undefined Behaviours in this thread, and one of them is
The result of assigning to partially overlapping objects
I wondered if anyone could give me a definition of what "partially overlapping objects" are and an example in code of how that could possibly be created?
As pointed out in other answers, a union is the most obvious way to arrange this.
This is an even clearer example of how partially overlapping objects might arise with the built in assignment operator. This example would not otherwise exhibit UB if it were not for the partially overlapping object restrictions.
union Y {
int n;
short s;
};
void test() {
Y y;
y.s = 3; // s is the active member of the union
y.n = y.s; // Although it is valid to read .s and then write to .x
// changing the active member of the union, .n and .s are
// not of the same type and partially overlap
}
You can get potential partial overlap even with objects of the same type. Consider this example in the case where short is strictly larger than char on an implementation that adds no padding to X.
struct X {
char c;
short n;
};
union Y {
X x;
short s;
};
void test() {
Y y;
y.s = 3; // s is the active member of the union
y.x.n = y.s; // Although it is valid to read .s and then write to .x
// changing the active member of the union, it may be
// that .s and .x.n partially overlap, hence UB.
}
A union is a good example for that.
You can create a structure of memory with overlapping members.
for example (from MSDN):
union DATATYPE // Declare union type
{
char ch;
int i;
long l;
float f;
double d;
} var1;
now if you use assign the char member all other member are undefined. That's because they are at the same memory block, and you've only set an actual value to a part of it:
DATATYPE blah;
blah.ch = 4;
If you will then try to access blah.i or blah.d or blah.f they will have an undefined value. (because only first byte, which is a char, had its value set)
This refers to the problem of pointer aliasing, which is forbidden in C++ to give compilers an easier time optimizing. A good explanation of the problem can be found in this thread
May be he mean a strict aliasing rule? Object in memory should not overlap with object of other type.
"Strict aliasing is an assumption, made by the C (or C++) compiler, that dereferencing pointers to objects of different types will never refer to the same memory location (i.e. alias each other.)"
The canonical example is using memcpy:
char *s = malloc(100);
int i;
for(i=0; i != 100;++i) s[i] = i; /* just populate it with some data */
char *t = s + 10; /* s and t may overlap since s[10+i] = t[i] */
memcpy(t, s, 20); /* if you are copying at least 10 bytes, there is overlap and the behavior is undefined */
The reason why memcpy is undefined behavior is because there's no required algorithm for performing the copy. In this circumstance, memmove was introduced as a safe alternative.

C++ variable alias - what's that exactly, and why is it better to turn if off?

I've read the essay Surviving the Release Version.
Under the "Aliasing bugs" clause it says:
You can get tighter code if you tell
the compiler that it can assume no
aliasing....
I've also read Aliasing (computing).
What exactly is a variable alias? I understand it means using a pointer to a variable is an alias, but, how/why does it affect badly, or in other words - why telling the compiler that it can assume no aliasing would get me a "tighter code"
Aliasing is when you have two different references to the same underlying memory. Consider this made up example:
int doit(int *n1, int *n2)
{
int x = 0;
if (*n1 == 1)
{
*n2 = 0;
x += *n1 // line of interest
}
return x;
}
int main()
{
int x = 1;
doit(&x, &x); // aliasing happening
}
If the compiler has to allow for aliasing, it needs to consider the possibility that n1 == n2. Therefore, when it needs to use the value of *n1 at "line of interest", it needs to allow for the possibility it was changed by the line *n2 = 0.
If the compiler can assume no aliasing, it can assume at "line of interest" that *n1 == 1 (because otherwise we would not be inside the if). The optimizer can then use this information to optimize the code (in this case, change "line of interest" from following the pointer and doing a general purpose addition to using a simple increment).
Disallowing aliasing means if you have a pointer char* b, you can assume that b is the only pointer in the program that points to that particular memory location, which means the only time that memory location is going to change is when the programmer uses b to change it. The generated assembly thus doesn't need to reload the memory pointed to by b into a register as long as the compiler knows nothing has used b to modify it. If aliasing is allowed it's possible there's another pointer char* c = b; that was used elsewhere to mess with that memory