Is it possible for the sizeof operator to ever return 0 (zero) in C or C++? If it is possible, is it correct from a standards point of view?
In C++ an empty class or struct has a sizeof at least 1 by definition. From the C++ standard, 9/3 "Classes": "Complete objects and member subobjects of class type shall have nonzero size."
In C an empty struct is not permitted, except by extension (or a flaw in the compiler).
This is a consequence of the grammar (which requires that there be something inside the braces) along with this sentence from 6.7.2.1/7 "Structure and union specifiers": "If the struct-declaration-list contains no named members, the behavior is undefined".
If a zero-sized structure is permitted, then it's a language extension (or a flaw in the compiler). For example, in GCC the extension is documented in "Structures with No Members", which says:
GCC permits a C structure to have no members:
struct empty {
};
The structure will have size zero. In C++, empty structures are part of the language. G++ treats empty structures as if they had a single member of type char.
sizeof never returns 0 in C and in C++. Every time you see sizeof evaluating to 0 it is a bug/glitch/extension of a specific compiler that has nothing to do with the language.
Every object in C must have a unique address. Worded another way, an address must hold no more than one object of a given type (in order for pointer dereferencing to work). That being said, consider an 'empty' struct:
struct emptyStruct {};
and, more specifically, an array of them:
struct emptyStruct array[10];
struct emptyStruct* ptr = &array[0];
If the objects were indeed empty (that is, if sizeof(struct emptyStruct) == 0), then ptr++ ==> (void*)ptr + sizeof(struct emptyStruct) ==> ptr, which doesn't make sense. Which object would *ptr then refer to, ptr[0] or ptr[1]?
Even if a structure has no contents, the compiler should treat it as if it is one byte in length in order to maintain the "one address, one object" principle.
The C language specification (section A7.4.8) words this requirement as
when applied to a structure or union,
the result (of the sizeof operator)
is the number of bytes in the object,
including any padding required to make
the object tile an array
Since a padding byte must be added to an "empty" object in order for it to work in an array, sizeof() must therefore return a value of at least 1 for any valid input.
Edit:
Section A8.3 of the C spec calls a struct without a list of members an incomplete type, and the definition of sizeof specifically states (with emphasis added):
The operator (sizeof) may not be
applied to an operand of function
type, or of incomplete type, or to a
bit-field.
That would imply that using sizeof on an empty struct would be equally as invalid as using it on a data type that has not been defined. If your compiler allows the use of empty structs, be aware that using sizeof on them is not allowed as per the C spec. If your compiler allows you to do this anyway, understand that this is non-standard behavior that will not work on all compilers; do not rely on this behavior.
Edit: See also this entry in Bjarne Stroustrup's FAQ.
Empty structs, as isbadawi mentions. Also gcc allows arrays of 0 size:
int a[0];
sizeof(a);
EDIT: After seeing the MSDN link, I tried the empty struct in VS2005 and sizeof did return 1. I'm not sure if that's a VS bug or if the spec is somehow flexible about that sort of thing
in my view, it is better that sizeof returns 0 for a structure of size 0 (in the spirit of c).
but then the programmer has to be careful when he takes the sizeof an empty struct.
but it may cause a problem.
when array of such structures is defined, then
&arr[1] == &arr[2] == &arr[0]
which makes them lose their identities.
i guess this doesnt directly answer your question, whether it is possible or not.
well that may be possible depending on the compiler. (as said in Michael's answer above).
typedef struct {
int : 0;
} x;
x x1;
x x2;
Under MSVC 2010 (/Za /Wall):
sizeof(x) == 4
&x1 != &x2
Under GCC (-ansi -pedantic -Wall) :
sizeof(x) == 0
&x1 != &x2
i.e. Even though under GCC it has zero size, instances of the struct have distinct addresses.
ANSI C (C89 and C99 - I haven't looked at C++) says "It shall be possible to express the address of each individual byte of an object uniquely." This seems ambiguous in the case of a zero-sized object, since it arguably has no bytes.
Edit: "A bit-field declaration with no declarator, but only a colon and a width, indicates an unnamed bit-field. As a special case of this, a bit-field with a width of 0 indicates that no further bit-field is to be packed into the unit in which the previous bit-field, if any, was placed."
I think it never returns 0 in c , no empty structs is allowed
Here's a test, where sizeof yields 0
#include <stdio.h>
void func(int i)
{
int vla[i];
printf ("%u\n",(unsigned)sizeof vla);
}
int main(void)
{
func(0);
return 0;
}
If you have this :
struct Foo {};
struct Bar { Foo v[]; }
g++ -ansi returns sizeof(Bar) == 0. As does the clang & intel compiler.
However, this does not compile with gcc. I deduce it's a C++ extension.
struct Empty {
} em;
struct Zero {
Empty a[0];
} zr;
printf("em=%d\n", sizeof(em));
printf("zr=%d\n", sizeof(zr));
Result:
em=1
zr=0
Related
Can I assume that a C/C++ struct pointer will always point to the first member?
Example 1:
typedef struct {
unsigned char array_a[2];
unsigned char array_b[5];
}test;
//..
test var;
//..
In the above example will &var always point to array_a?
Also in the above example is it possible to cast the pointer
to an unsigned char pointer and access each byte separately?
Example 2:
function((unsigned char *)&var,sizeof(test));
//...
//...
void function(unsigned char *array, int len){
int i;
for( i=0; i<len; i++){
array[i]++;
}
}
Will that work correctly?
Note: I know that chars are byte aligned in a struct therefore I assume the size of the above struct is 7 bytes.
For C structs, yes, you can rely on it. This is how almost all "object orientated"-style APIs work in C (such as GObject and GTK).
For C++, you can rely on it only for "plain old data" (POD) types, which are guaranteed to be laid out in memory the same way as C structs. Exactly what constitutes a POD type is a little complicated and has changed between C++03 and C++11, but the crux of it is that if your type has any virtual functions then it's not a POD.
(In C++11 you can use std::is_pod to test at compile-time whether a struct is a POD type.)
EDIT: This tells you what constitutes a POD type in C++: http://en.cppreference.com/w/cpp/concept/PODType
EDIT2: Actually, in C++11, it doesn't need to be a POD, just "standard layout", which is a lightly weaker condition. Quoth section 9.2 [class.mem] paragraph 20 of the standard:
A pointer to a standard-layout struct object, suitably converted using a reinterpret_cast, points to its
initial member (or if that member is a bit-field, then to the unit in which it resides) and vice versa. [ Note:
There might therefore be unnamed padding within a standard-layout struct object, but not at its beginning,
as necessary to achieve appropriate alignment. — end note ]
From the C99 standard section 6.7.2.1 bullet point 13:
Within a structure object, the non-bit-field members and the units in
which bit-fields reside have addresses that increase in the order in
which they are declared. A pointer to a structure object, suitably
converted, points to its initial member (or if that member is a
bit-field, then to the unit in which it resides), and vice versa.
There may be unnamed padding within a structure object, but not at its
beginning.
The answer to your question is therefore yes.
Reference (see page 103)
The compiler is free to add padding and reorganize the struct how it sees fit. Especially in C++ you can add (virtual) functions and then chances are that the virtual table is hidden before that. But of course that are implementation details.
For C this assumption is valid.
For C, it's largely implementation-specific, but in practice the rule (in the absence of #pragma pack or something likewise) is:
Struct members are stored in the order they are declared. (This is required by the C99 standard, as mentioned here earlier.)
If necessary, padding is added before each struct member, to ensure correct alignment.
So given a struct like
struct test{
char ch;
int i;
}
will have ch at offset 0, then a padding byte to align, i at offset 2 and then at the end, padding bytes are added to make the struct size a multiple of 8 bytes.(on a 64-bit machine, 4 byte alignment may be permitted in 32 bit machines)
So at least in this case, for C, I think you can assume that the struct pointer will point to the first array.
C++17 (expr.add/4) say:
When an expression that has integral type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
expression P points to element x[i] of an array object x with n
elements, the expressions P + J and J + P (where J has the value j)
point to the (possibly-hypothetical) element x[i+j] if 0≤i+j≤n;
otherwise, the behavior is undefined. Likewise, the expression P - J
points to the (possibly-hypothetical) element x[i−j] if 0≤i−j≤n;
otherwise, the behavior is undefined.
struct Foo {
float x, y, z;
};
Foo f;
char *p = reinterpret_cast<char*>(&f) + offsetof(Foo, z); // (*)
*reinterpret_cast<float*>(p) = 42.0f;
Has the line marked with (*) UB? reinterpret_cast<char*>(&f) doesn't point to a char array, but to a float, so it should UB according to the cited paragraph. But, if it is UB, then offsetof's usefulness would be limited.
Is it UB? If not, why not?
The addition is intended to be valid, but I do not believe the standard manages to say so clearly enough. Quoting N4140 (roughly C++14):
3.9 Types [basic.types]
2 For any object (other than a base-class subobject) of trivially copyable type T, whether or not the object holds a valid value of type T, the underlying bytes (1.7) making up the object can be copied into an array
of char or unsigned char.42 [...]
42) By using, for example, the library functions (17.6.1.2) std::memcpy or std::memmove.
It says "for example" because std::memcpy and std::memmove are not the only ways in which the underlying bytes are intended to be allowed to be copied. A simple for loop which copies byte by byte manually is supposed to be valid as well.
In order for that to work, addition has to be defined for pointers to the raw bytes that make up an object, and the way definedness of expressions works, the addition's definedness cannot depend on whether the addition's result will subsequently be used to copy the bytes into an array.
Whether that means those bytes form an array already or whether this is a special exception to the general rules for the + operator that is somehow omitted in the operator description, is not clear to me (I suspect the former), but either way would make the addition you're performing in your code valid.
Any interpretation that disallows the intended usage of offsetof must be wrong:
#include <assert.h>
#include <stddef.h>
struct S { float a, b, c; };
const size_t idx_S[] = {
offsetof(struct S, a),
offsetof(struct S, b),
offsetof(struct S, c),
};
float read_S(struct S *sp, unsigned int idx)
{
assert(idx < 3);
return *(float *)(((char *)sp) + idx_S[idx]); // intended to be valid
}
However, any interpretation that allows one to step past the end of an explicitly-declared array must also be wrong:
#include <assert.h>
#include <stddef.h>
struct S { float a[2]; float b[2]; };
static_assert(offsetof(struct S, b) == sizeof(float)*2,
"padding between S.a and S.b -- should be impossible");
float read_S(struct S *sp, unsigned int idx)
{
assert(idx < 4);
return sp->a[idx]; // undefined behavior if idx >= 2,
// reading past end of array
}
And we are now on the horns of a dilemma, because the wording in both the C and C++ standards, that was intended to disallow the second case, probably also disallows the first case.
This is commonly known as the "what is an object?" problem. People, including members of the C and C++ committees, have been arguing about this and related issues since the 1990s, and there have been multiple attempts to fix the wording, and to the best of my knowledge none has succeeded (in the sense that all existing "reasonable" code is rendered definitely conforming and all existing "reasonable" optimizations are still allowed).
(Note: All of the above code is written as it would be written in C to emphasize that the same problem exists in both languages, and can be encountered without the use of any C++ constructs.)
See CWG 1314
According to 6.9 [basic.types] paragraph 4,
The object representation of an object of type T is the sequence of N unsigned char objects taken up by the object of type T, where N equals sizeof(T).
and 4.5 [intro.object] paragraph 5,
An object of trivially copyable or standard-layout type (6.9 [basic.types]) shall occupy contiguous bytes of storage.
Do these passages make pointer arithmetic (8.7 [expr.add] paragraph 5) within a standard-layout object well-defined (e.g., for writing one's own version of memcpy?
Rationale (August, 2011):
The current wording is sufficiently clear that this usage is permitted.
I strongly disagree with CWG's statement that "the current wording is sufficiently clear", but nevertheless, that's the ruling we have.
I interpret CWG's response as suggesting that a pointer to unsigned char into an object of trivially copyable or standard-layout type, for the purposes of pointer arithmetic, ought to be interpreted as a pointer to an array of unsigned char whose size equals the size of the object in question. I don't know whether they intended that it would also work using a char pointer or (as of C++17) a std::byte pointer. (Maybe if they had decided to actually clarify it instead of claiming the existing wording was clear enough, then I would know the answer.)
(A separate issue is whether std::launder is required to make the OP's code well-defined. I won't go into this here; I think it deserves a separate question.)
As far as I know, your code is valid. Aliasing an object as a char array is explicitly allowed as per § 3.10 ¶ 10.8:
If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined:
[…]
a char or unsigned char type.
The other question is whether casting the char* pointer back to float* and assigning through it is valid. Since your Foo is a POD type, this is okay. You are allowed to compute the address of a POD's member (given that the computation itself is not UB) and then access the member through that address. You must not abuse this to, for example, gain access to a private member of a non-POD object. Furthermore, it would be UB if you'd, say, cast to int* or write at an address where no object of type float exists. The reasoning behind this can be found in the section quoted above.
Yes, this is undefined. As you have stated in your question,
reinterpret_cast<char*>(&f) doesn't point to a char array, but to a float, ...
... reinterpret_cast<char*>(&f) does even not point to a char, so even if the object representation is a char array, the behavior is still undefined.
For offsetof, you can still use it like
struct Foo {
float x, y, z;
};
Foo f;
auto p = reinterpret_cast<std::uintptr_t>(&f) + offsetof(Foo, z);
// ^^^^^^^^^^^^^^
*reinterpret_cast<float*>(p) = 42.0f;
As an example, consider the following structure:
struct S {
int a[4];
int b[4];
} s;
Would it be legal to write s.a[6] and expect it to be equal to s.b[2]?
Personally, I feel that it must be UB in C++, whereas I'm not sure about C.
However, I failed to find anything relevant in the standards of C and C++ languages.
Update
There are several answers suggesting ways to make sure there is no padding
between fields in order to make the code work reliably. I'd like to emphasize
that if such code is UB, then absense of padding is not enough. If it is UB,
then the compiler is free to assume that accesses to S.a[i] and S.b[j] do not
overlap and the compiler is free to reorder such memory accesses. For example,
int x = s.b[2];
s.a[6] = 2;
return x;
can be transformed to
s.a[6] = 2;
int x = s.b[2];
return x;
which always returns 2.
Would it be legal to write s.a[6] and expect it to be equal to s.b[2]?
No. Because accessing an array out of bound invoked undefined behaviour in C and C++.
C11 J.2 Undefined behavior
Addition or subtraction of a pointer into, or just beyond, an array object and an integer type produces a result that points just beyond
the array object and is used as the operand of a unary * operator that
is evaluated (6.5.6).
An array subscript is out of range, even if an object is apparently accessible with the given subscript (as in the lvalue expression
a[1][7] given the declaration int a[4][5]) (6.5.6).
C++ standard draft section 5.7 Additive operators paragraph 5 says:
When an expression that has integral type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integral expression.
[...] If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined.
Apart from the answer of #rsp (Undefined behavior for an array subscript that is out of range) I can add that it is not legal to access b via a because the C language does not specify how much padding space can be between the end of area allocated for a and the start of b, so even if you can run it on a particular implementation , it is not portable.
instance of struct:
+-----------+----------------+-----------+---------------+
| array a | maybe padding | array b | maybe padding |
+-----------+----------------+-----------+---------------+
The second padding may miss as well as the alignment of struct object is the alignment of a which is the same as the alignment of b but the C language also does not impose the second padding not to be there.
a and b are two different arrays, and a is defined as containing 4 elements. Hence, a[6] accesses the array out of bounds and is therefore undefined behaviour. Note that array subscript a[6] is defined as *(a+6), so the proof of UB is actually given by section "Additive operators" in conjunction with pointers". See the following section of the C11-standard (e.g. this online draft version) describing this aspect:
6.5.6 Additive operators
When an expression that has integer type is added to or subtracted
from a pointer, the result has the type of the pointer operand. If the
pointer operand points to an element of an array object, and the array
is large enough, the result points to an element offset from the
original element such that the difference of the subscripts of the
resulting and original array elements equals the integer expression.
In other words, if the expression P points to the i-th element of an
array object, the expressions (P)+N (equivalently, N+(P)) and (P)-N
(where N has the value n) point to, respectively, the i+n-th and
i-n-th elements of the array object, provided they exist. Moreover, if
the expression P points to the last element of an array object, the
expression (P)+1 points one past the last element of the array object,
and if the expression Q points one past the last element of an array
object, the expression (Q)-1 points to the last element of the array
object. If both the pointer operand and the result point to elements
of the same array object, or one past the last element of the array
object, the evaluation shall not produce an overflow; otherwise, the
behavior is undefined. If the result points one past the last element
of the array object, it shall not be used as the operand of a unary *
operator that is evaluated.
The same argument applies to C++ (though not quoted here).
Further, though it is clearly undefined behaviour due to the fact of exceeding array bounds of a, note that the compiler might introduce padding between members a and b, such that - even if such pointer arithmetics were allowed - a+6 would not necessarily yield the same address as b+2.
Is it legal? No. As others mentioned, it invokes Undefined Behavior.
Will it work? That depends on your compiler. That's the thing about undefined behavior: it's undefined.
On many C and C++ compilers, the struct will be laid out such that b will immediately follow a in memory and there will be no bounds checking. So accessing a[6] will effectively be the same as b[2] and will not cause any sort of exception.
Given
struct S {
int a[4];
int b[4];
} s
and assuming no extra padding, the structure is really just a way of looking at a block of memory containing 8 integers. You could cast it to (int*) and ((int*)s)[6] would point to the same memory as s.b[2].
Should you rely on this sort of behavior? Absolutely not. Undefined means that the compiler doesn't have to support this. The compiler is free to pad the structure which could render the assumption that &(s.b[2]) == &(s.a[6]) incorrect. The compiler could also add bounds checking on the array access (although enabling compiler optimizations would probably disable such a check).
I've have experienced the effects of this in the past. It's quite common to have a struct like this
struct Bob {
char name[16];
char whatever[64];
} bob;
strcpy(bob.name, "some name longer than 16 characters");
Now bob.whatever will be " than 16 characters". (which is why you should always use strncpy, BTW)
As #MartinJames mentioned in a comment, if you need to guarantee that a and b are in contiguous memory (or at least able to be treated as such, (edit) unless your architecture/compiler uses an unusual memory block size/offset and forced alignment that would require padding to be added), you need to use a union.
union overlap {
char all[8]; /* all the bytes in sequence */
struct { /* (anonymous struct so its members can be accessed directly) */
char a[4]; /* padding may be added after this if the alignment is not a sub-factor of 4 */
char b[4];
};
};
You can't directly access b from a (e.g. a[6], like you asked), but you can access the elements of both a and b by using all (e.g. all[6] refers to the same memory location as b[2]).
(Edit: You could replace 8 and 4 in the code above with 2*sizeof(int) and sizeof(int), respectively, to be more likely to match the architecture's alignment, especially if the code needs to be more portable, but then you have to be careful to avoid making any assumptions about how many bytes are in a, b, or all. However, this will work on what are probably the most common (1-, 2-, and 4-byte) memory alignments.)
Here is a simple example:
#include <stdio.h>
union overlap {
char all[2*sizeof(int)]; /* all the bytes in sequence */
struct { /* anonymous struct so its members can be accessed directly */
char a[sizeof(int)]; /* low word */
char b[sizeof(int)]; /* high word */
};
};
int main()
{
union overlap testing;
testing.a[0] = 'a';
testing.a[1] = 'b';
testing.a[2] = 'c';
testing.a[3] = '\0'; /* null terminator */
testing.b[0] = 'e';
testing.b[1] = 'f';
testing.b[2] = 'g';
testing.b[3] = '\0'; /* null terminator */
printf("a=%s\n",testing.a); /* output: a=abc */
printf("b=%s\n",testing.b); /* output: b=efg */
printf("all=%s\n",testing.all); /* output: all=abc */
testing.a[3] = 'd'; /* makes printf keep reading past the end of a */
printf("a=%s\n",testing.a); /* output: a=abcdefg */
printf("b=%s\n",testing.b); /* output: b=efg */
printf("all=%s\n",testing.all); /* output: all=abcdefg */
return 0;
}
No, since accesing an array out of bounds invokes Undefined Behavior, both in C and C++.
Short Answer: No. You're in the land of undefined behavior.
Long Answer: No. But that doesn't mean that you can't access the data in other sketchier ways... if you're using GCC you can do something like the following (elaboration of dwillis's answer):
struct __attribute__((packed,aligned(4))) Bad_Access {
int arr1[3];
int arr2[3];
};
and then you could access via (Godbolt source+asm):
int x = ((int*)ba_pointer)[4];
But that cast violates strict aliasing so is only safe with g++ -fno-strict-aliasing. You can cast a struct pointer to a pointer to the first member, but then you're back in the UB boat because you're accessing outside the first member.
Alternatively, just don't do that. Save a future programmer (probably yourself) the heartache of that mess.
Also, while we're at it, why not use std::vector? It's not fool-proof, but on the back-end it has guards to prevent such bad behavior.
Addendum:
If you're really concerned about performance:
Let's say you have two same-typed pointers that you're accessing. The compiler will more than likely assume that both pointers have the chance to interfere, and will instantiate additional logic to protect you from doing something dumb.
If you solemnly swear to the compiler that you're not trying to alias, the compiler will reward you handsomely:
Does the restrict keyword provide significant benefits in gcc / g++
Conclusion: Don't be evil; your future self, and the compiler will thank you.
Jed Schaff’s answer is on the right track, but not quite correct. If the compiler inserts padding between a and b, his solution will still fail. If, however, you declare:
typedef struct {
int a[4];
int b[4];
} s_t;
typedef union {
char bytes[sizeof(s_t)];
s_t s;
} u_t;
You may now access (int*)(bytes + offsetof(s_t, b)) to get the address of s.b, no matter how the compiler lays out the structure. The offsetof() macro is declared in <stddef.h>.
The expression sizeof(s_t) is a constant expression, legal in an array declaration in both C and C++. It will not give a variable-length array. (Apologies for misreading the C standard before. I thought that sounded wrong.)
In the real world, though, two consecutive arrays of int in a structure are going to be laid out the way you expect. (You might be able to engineer a very contrived counterexample by setting the bound of a to 3 or 5 instead of 4 and then getting the compiler to align both a and b on a 16-byte boundary.) Rather than convoluted methods to try to get a program that makes no assumptions whatsoever beyond the strict wording of the standard, you want some kind of defensive coding, such as static assert(&both_arrays[4] == &s.b[0], "");. These add no run-time overhead and will fail if your compiler is doing something that would break your program, so long as you don’t trigger UB in the assertion itself.
If you want a portable way to guarantee that both sub-arrays are packed into a contiguous memory range, or split a block of memory the other way, you can copy them with memcpy().
The Standard does not impose any restrictions upon what implementations must do when a program tries to use an out-of-bounds array subscript in one structure field to access a member of another. Out-of-bounds accesses are thus "illegal" in strictly conforming programs, and programs which make use of such accesses cannot simultaneously be 100% portable and free of errors. On the other hand, many implementations do define the behavior of such code, and programs which are targeted solely at such implementations may exploit such behavior.
There are three issues with such code:
While many implementations lay out structures in predictable fashion, the Standard allows implementations to add arbitrary padding before any structure member other than the first. Code could use sizeof or offsetof to ensure that structure members are placed as expected, but the other two issues would remain.
Given something like:
if (structPtr->array1[x])
structPtr->array2[y]++;
return structPtr->array1[x];
it would normally be useful for a compiler to assume that the use of structPtr->array1[x] will yield the same value as the preceding use in the "if" condition, even though it would change the behavior of code that relies upon aliasing between the two arrays.
If array1[] has e.g. 4 elements, a compiler given something like:
if (x < 4) foo(x);
structPtr->array1[x]=1;
might conclude that since there would be no defined cases where x isn't less than 4, it could call foo(x) unconditionally.
Unfortunately, while programs can use sizeof or offsetof to ensure that there aren't any surprises with struct layout, there's no way by which they can test whether compilers promise to refrain from the optimizations of types #2 or #3. Further, the Standard is a little vague about what would be meant in a case like:
struct foo {char array1[4],array2[4]; };
int test(struct foo *p, int i, int x, int y, int z)
{
if (p->array2[x])
{
((char*)p)[x]++;
((char*)(p->array1))[y]++;
p->array1[z]++;
}
return p->array2[x];
}
The Standard is pretty clear that behavior would only be defined if z is in the range 0..3, but since the type of p->array in that expression is char* (due to decay) it's not clear the cast in the access using y would have any effect. On the other hand, since converting pointer to the first element of a struct to char* should yield the same result as converting a struct pointer to char*, and the converted struct pointer should be usable to access all bytes therein, it would seem the access using x should be defined for (at minimum) x=0..7 [if the offset of array2 is greater than 4, it would affect the value of x needed to hit members of array2, but some value of x could do so with defined behavior].
IMHO, a good remedy would be to define the subscript operator on array types in a fashion that does not involve pointer decay. In that case, the expressions p->array[x] and &(p->array1[x]) could invite a compiler to assume that x is 0..3, but p->array+x and *(p->array+x) would require a compiler to allow for the possibility of other values. I don't know if any compilers do that, but the Standard doesn't require it.
Can I assume that a C/C++ struct pointer will always point to the first member?
Example 1:
typedef struct {
unsigned char array_a[2];
unsigned char array_b[5];
}test;
//..
test var;
//..
In the above example will &var always point to array_a?
Also in the above example is it possible to cast the pointer
to an unsigned char pointer and access each byte separately?
Example 2:
function((unsigned char *)&var,sizeof(test));
//...
//...
void function(unsigned char *array, int len){
int i;
for( i=0; i<len; i++){
array[i]++;
}
}
Will that work correctly?
Note: I know that chars are byte aligned in a struct therefore I assume the size of the above struct is 7 bytes.
For C structs, yes, you can rely on it. This is how almost all "object orientated"-style APIs work in C (such as GObject and GTK).
For C++, you can rely on it only for "plain old data" (POD) types, which are guaranteed to be laid out in memory the same way as C structs. Exactly what constitutes a POD type is a little complicated and has changed between C++03 and C++11, but the crux of it is that if your type has any virtual functions then it's not a POD.
(In C++11 you can use std::is_pod to test at compile-time whether a struct is a POD type.)
EDIT: This tells you what constitutes a POD type in C++: http://en.cppreference.com/w/cpp/concept/PODType
EDIT2: Actually, in C++11, it doesn't need to be a POD, just "standard layout", which is a lightly weaker condition. Quoth section 9.2 [class.mem] paragraph 20 of the standard:
A pointer to a standard-layout struct object, suitably converted using a reinterpret_cast, points to its
initial member (or if that member is a bit-field, then to the unit in which it resides) and vice versa. [ Note:
There might therefore be unnamed padding within a standard-layout struct object, but not at its beginning,
as necessary to achieve appropriate alignment. — end note ]
From the C99 standard section 6.7.2.1 bullet point 13:
Within a structure object, the non-bit-field members and the units in
which bit-fields reside have addresses that increase in the order in
which they are declared. A pointer to a structure object, suitably
converted, points to its initial member (or if that member is a
bit-field, then to the unit in which it resides), and vice versa.
There may be unnamed padding within a structure object, but not at its
beginning.
The answer to your question is therefore yes.
Reference (see page 103)
The compiler is free to add padding and reorganize the struct how it sees fit. Especially in C++ you can add (virtual) functions and then chances are that the virtual table is hidden before that. But of course that are implementation details.
For C this assumption is valid.
For C, it's largely implementation-specific, but in practice the rule (in the absence of #pragma pack or something likewise) is:
Struct members are stored in the order they are declared. (This is required by the C99 standard, as mentioned here earlier.)
If necessary, padding is added before each struct member, to ensure correct alignment.
So given a struct like
struct test{
char ch;
int i;
}
will have ch at offset 0, then a padding byte to align, i at offset 2 and then at the end, padding bytes are added to make the struct size a multiple of 8 bytes.(on a 64-bit machine, 4 byte alignment may be permitted in 32 bit machines)
So at least in this case, for C, I think you can assume that the struct pointer will point to the first array.
Once again, I'm questioning a longstanding belief.
Until today, I believed that the alignment of the following struct would normally be 4 and the size would normally be 5...
struct example
{
int m_Assume_32_Bits;
char m_Assume_8_Bit_Bytes;
};
Because of this assumption, I have data structure code that uses offsetof to determine the distance in bytes between two adjacent items in an array. Today, I spotted some old code that was using sizeof where it shouldn't, couldn't understand why I hadn't had bugs from it, coded up a unit test - and the test surprised me by passing.
A bit of investigation showed that the sizeof the type I used for the test (similar to the struct above) was an exact multiple of the alignment - ie 8 bytes. It had padding after the final member. Here is an example of why I never expected this...
struct example2
{
example m_Example;
char m_Why_Cant_This_Be_At_Offset_6_Bytes;
};
A bit of Googling showed examples that make it clear that this padding after the final member is allowed - for example http://en.wikipedia.org/wiki/Data_structure_alignment#Data_structure_padding (the "or at the end of the structure" bit).
This is a bit embarrassing, as I recently posted this comment - Use of struct padding (my first comment to that answer).
What I can't seem to determine is whether this padding to an exact multiple of the alignment is guaranteed by the C++ standard, or whether it is just something that is permitted and that some (but maybe not all) compilers do.
So - is the size of a struct required to be an exact multiple of the alignment of that struct according to the C++ standard?
If the C standard makes different guarantees, I'm interested in that too, but the focus is on C++.
5.3.3/2
When applied to a class, the result [of sizeof] is the number of bytes in an object of that class, including any padding required for placing objects of that type in an array.
So yes, object size is a multiple of its alignment.
One definition of alignment size:
The alignment size of a struct is the offset from one element to the next element when you have an array of that struct.
By its nature, if you have an array of a struct with two elements, then both need to have aligned members, so that means that yes, the size has to be a multiple of the alignment. (I'm not sure if any standard explicitly enforce this, but because the size and alignment of a struct don't depend on whether the struct is alone or inside an array, the same rules apply to both, so it can't really be any other way.)
The standard says (section [dcl.array]:
An object of array type contains a contiguously allocated non-empty set of N subobjects of type T.
Therefore there is no padding between array elements.
Padding inside structures is not required by the standard, but the standard doesn't permit any other way of aligning array elements.
I am unsure if this is in the actual C/C++ standard, and I am inclined to say that it is up to the compiler (just to be on the safe side). However, I had a "fun" time figuring that out a few months ago, where I had to send dynamically generated C structs as byte arrays across a network as part of a protocol, to communicate with a chip. The alignment and size of all the structs had to be consistent with the structs in the code running on the chip, which was compiled with a variant of GCC for the MIPS architecture. I'll attempt to give the algorithm, and it should apply to all variants of gcc (and hopefully most other compilers).
All base types, like char, short and int align to their size, and they align to the next available position, regardless of the alignment of the parent. And to answer the original question, yes the total size is a multiple of the alignment.
// size 8
struct {
char A; //byte 0
char B; //byte 1
int C; //byte 4
};
Even though the alignment of the struct is 4 bytes, the chars are still packed as close as possible.
The alignment of a struct is equal to the largest alignment of its members.
Example:
//size 4, but alignment is 2!
struct foo {
char A; //byte 0
char B; //byte 1
short C; //byte 3
}
//size 6
struct bar {
char A; //byte 0
struct foo B; //byte 2
}
This also applies to unions, and in a curious way. The size of a union can be larger than any of the sizes of its members, simply due to alignment:
//size 3, alignment 1
struct foo {
char A; //byte 0
char B; //byte 1
char C; //byte 2
};
//size 2, alignment 2
struct bar {
short A; //byte 0
};
//size 4! alignment 2
union foobar {
struct foo A;
struct bar B;
}
Using these simple rules, you should be able to figure out the alignment/size of any horribly nested union/struct you come across. This is all from memory, so if I have missed a corner case that can't be decided from these rules please let me know!
C++ doesn't explicitly says so, but it is a consequence of two other requirements:
First, all objects must be well-aligned.
3.8/1 says
The lifetime of an object of type T begins when [...] storage with the proper alignment and size for type T is obtained
and 3.9/5:
Object types have *alignnment requirements (3.9.1, 3.9.2). The alignment of a complete object type is an implementation-defined integer value representing a number of bytes; an object is allocated at an address that meets the alignment requirements of its object type.
So every object must be aligned according to its alignment requirements.
The other requirement is that objects in an array are allocated contigulously:
8.3.4/1:
An object of array type contains a contiguously allocated non-empty set of N subobjects of type T.
For the objects in an array to be contiguously allocated, there can be no padding between them. But for every object in the array to be properly aligned, each individual object must be padded so that the byte immediately after the end of the object is also well aligned. In other words, the size of the object must be a multiple of its alignment.
So to split your question up into two:
1. Is it legal?
[5.3.3.2] When applied to a class, the result [of the sizeof() operator] is the number of bytes in an object of that class including any padding required for placing objects of that type in an array.
So, no, it's not.
2. Well, why isn't it?
Here, I cna only speculate.
2.1. Pointer arithmetics get weirder
If alignment would be "between array elements" but would not affect the size, zthigns would get needlessly complicated, e.g.
(char *)(X+1) != ((char *)X) + sizeof(X)
(I have a hunch that this is required implicitely by the standard even without above statement, but I can't put it to proof)
2.2 Simplicity
If alignment affects size, alignment and size can be decided by looking at a single type. Consider this:
struct A { int x; char y; }
struct B { A left, right; }
With the current standard, I just need to know sizeof(A) to determine size and layout of B.
With the alternate you suggest I need to know the internals of A. Similar to your example2: for a "better packing", sizeof(example) is not enough, you need to consider the internals of example.
It is possible to produce a C or C++ typedef whose alignment is not a multiple of its size. This came up recently in this bindgen bug. Here's a minimal example, which I'll call test.c below:
#include <stdio.h>
#include <stdalign.h>
__attribute__ ((aligned(4))) typedef struct {
char x[3];
} WeirdType;
int main() {
printf("sizeof(WeirdType) = %ld\n", sizeof(WeirdType));
printf("alignof(WeirdType) = %ld\n", alignof(WeirdType));
return 0;
}
On my Arch Linux x86_64 machine, gcc -dumpversion && gcc test.c && ./a.out prints:
9.3.0
sizeof(WeirdType) = 3
alignof(WeirdType) = 4
Similarly clang -dumpversion && clang test.c && ./a.out prints:
9.0.1
sizeof(WeirdType) = 3
alignof(WeirdType) = 4
Saving the file as test.cc and using g++/clang++ gives the same result. (Update from a couple years later: I get the same results from GCC 11.1.0 and Clang 13.0.0.)
Notably however, MSVC on Windows does not seem to reproduce any behavior like this.
The standard says very little about padding and alignment. Very little is guaranteed. About the only thing you can bet on is that the first element is at the beginning of the structure. After that...alignment and padding can be anything.
Seems the C++03 standard didn't say (or I didn't find) whether the alignment padding bytes should be included in the object representation.
And the C99 standard says the "sizeof" a struct type or union type includes internal and trailing padding, but I'm not sure if all alignment padding is included in that "trailing padding".
Now back to your example. There is really no confusion. sizeof(example) == 8 means the structure does take 8 bytes to represent itself, including the tailing 3 padding bytes. If the char in the second structure has an offset of 6, it will overwrite the space used by m_Example. The layout of a certain type is implementation-defined, and should be kept stable in the whole implementation.
Still, whether p+1 equals (T*)((char*)p + sizeof(T)) is unsure. And I'm hoping to find the answer.