I'm working on some application that requires big chunk of memory. To decrease memory usage I've switched alignment for huge structure to 1 byte (#pragma pack(1)).
After this my struct size was around 10-15% smaller but some problem appeared.
When I try to use one of fields of my structure through pointer or reference application just crashes. If I change field directly it work ok.
In test application I found out that problem start to appear after using smaller then 4 bytes field in struct.
Test code:
#pragma pack(1)
struct TestStruct
{
struct
{
long long lLongLong;
long lLong;
//bool lBool; // << if uncommented than crash
//short lShort; // << if uncommented than crash
//char lChar; // << if uncommented than crash
//unsigned char lUChar; // << if uncommented than crash
//byte lByte; // << if uncommented than crash
__int64 lInt64;
unsigned int Int;
unsigned int Int2;
} General;
};
struct TestStruct1
{
TestStruct lT[5];
};
#pragma pack()
void TestFunct(unsigned int &pNewLength)
{
std::cout << pNewLength << std::endl;
std::cout << "pNL pointer: " << &pNewLength << std::endl;
pNewLength = 7; // << crash
char *lPointer = (char *)&pNewLength;
*lPointer = 0x32; // << or crash here
}
int _tmain(int argc, _TCHAR* argv[])
{
std::cout << sizeof(TestStruct1) << std::endl;
TestStruct1 *lTest = new TestStruct1();
TestFunct(lTest->lT[4].General.Int);
std::cout << lTest->lT[4].General.Int << std::endl;
char lChar;
std::cin >> lChar;
return 0;
}
Compiling this code on ARM (WinCE 6.0) result in crash. Same code on Windows x86 work ok. Changing pack(1) to pack(4) or just pack() resolve this problem but structure is larger.
Why this alignment causes problem ?
You can fix it (to run on WCE with ARM) by using __unaligned keyword, I was able to compile this code with VS2005 and successfully run on WM5 device, by changing:
void TestFunct(unsigned int &pNewLength)
to
void TestFunct(unsigned int __unaligned &pNewLength)
using this keyword will more than double instructions count but will allow to use any legacy structures.
more on this here:
http://msdn.microsoft.com/en-us/library/aa448596.aspx
ARM architectures only support aligned memory accesses. This means four-Byte types can only be read and written at addresses that are a multiple of 4. For two-Byte types, the address must be a multiple of two. Any attempt at unaligned memory access will normally reward you with a DATATYPE_MISALIGNMENT exception and a subsequent crash.
Now you might wonder why you only started seeing crashes when passing your unaligned structure members around as pointers and references; this has to do with the compiler. As long as you directly access the fields in the structure, it knows that you are accessing unaligned data and deals with it by transparently reading and writing the data in several aligned chunks that are split and reassembled. I have seen eVC++ do this to write a four-Byte structure member that was two-Byte-aligned: the generated assembly instructions split the integer into separate, two-Bytes pieces and writes them separately.
The compiler does not know whether a pointer or reference is aligned or not, so as soon as you pass unaligned structure fields around as pointers or references, there is no way for it to know these should be treated in a special manner. It will treat them as aligned data and will access them accordingly, which leads to crashes when the address is unaligned.
As marcin_j mentioned, it is possible to work around this by telling the compiler a particular pointer/reference is unaligned with the __unaligned keyword, or rather the UNALIGNED macro which does nothing on platforms that do not need it. It basically tells the compiler to be careful with the pointer in a way that I assume it similar to the way unaligned structure members are accessed.
A naïve approach would be to plaster UNALIGNED all over your code, but is not recommended because it can incur a performance penalty: any data access with __unaligned will need several memory read/writes whereas the aligned version needs only one. UNALIGNED is rather typically only used at places in the code where it is known that unaligned data will be passed around, and left out elsewhere.
On x86, unaligned access is slow. ARM flat out can't do it. Your small types break the alignment of the next element.
Not that it matters. The overhead is unlikely to be more than 3 bytes, if you sort your members by size.
Related
This question already has answers here:
Compile-time check to make sure that there is no padding anywhere in a struct
(4 answers)
Closed 3 years ago.
Lets consider the following task:
My C++ module as part of an embedded system receives 8 bytes of data, like: uint8_t data[8].
The value of the first byte determines the layout of the rest (20-30 different). In order to get the data effectively, I would create different structs for each layout and put each to a union and read the data directly from the address of my input through a pointer like this:
struct Interpretation_1 {
uint8_t multiplexer;
uint8_t timestamp;
uint32_t position;
uint16_t speed;
};
// and a lot of other struct like this (with bitfields, etc..., layout is not defined by me :( )
union DataInterpreter {
Interpretation_1 movement;
//Interpretation_2 temperatures;
//etc...
};
...
uint8_t exampleData[8] {1u, 10u, 20u,0u,0u,0u, 5u,0u};
DataInterpreter* interpreter = reinterpret_cast<DataInterpreter*>(&exampleData);
std::cout << "position: " << +interpreter->movement.position << "\n";
The problem I have is, the compiler can insert padding bytes to the interpretation structs and this kills my idea. I know I can use
with gcc: struct MyStruct{} __attribute__((__packed__));
with MSVC: I can use #pragma pack(push, 1) MyStruct{}; #pragma pack(pop)
with clang: ? (I could check it)
But is there any portable way to achieve this? I know c++11 has e.g. alignas for alignment control, but can I use it for this? I have to use c++11 but I would be just interested if there is a better solution with later version of c++.
But is there any portable way to achieve this?
No, there is no (standard) way to "make" a type that would have padding to not have padding in C++. All objects are aligned at least as much as their type requires and if that alignment doesn't match with the previous sub objects, then there will be padding and that is unavoidable.
Furthermore, there is another problem: You're accessing through a reinterpreted pointed that doesn't point to an object of compatible type. The behaviour of the program is undefined.
We can conclude that classes are not generally useful for representing arbitrary binary data. The packed structures are non-standard, and they also aren't compatible across different systems with different representations for integers (byte endianness).
There is a way to check whether a type contains padding: Compare the size of the sub objects to the size of the complete object, and do this recursively to each member. If the sizes don't match, then there is padding. This is quite tricky however because C++ has minimal reflection capabilities, so you need to resort either hard coding or meta programming.
Given such check, you can make the compilation fail on systems where the assumption doesn't hold.
Another handy tool is std::has_unique_object_representations (since C++17) which will always be false for all types that have padding. But note that it will also be false for types that contain floats for example. Only types that return true can be meaningfully compared for equality with std::memcmp.
Reading from unaligned memory is undefined behavior in C++. In other words, the compiler is allowed to assume that every uint32_t is located at a alignof(uint32_t)-byte boundary and every uint16_t is located at a alignof(uint16_t)-byte boundary. This means that if you somehow manage to pack your bytes portably, doing interpreter->movement.position will still trigger undefined behaviour.
(In practice, on most architectures, unaligned memory access will still work, but albeit incur a performance penalty.)
You could, however, write a wrapper, like how std::vector<bool>::operator[] works:
#include <cstdint>
#include <cstring>
#include <iostream>
#include <type_traits>
template <typename T>
struct unaligned_wrapper {
static_assert(std::is_trivial<T>::value);
std::aligned_storage_t<sizeof(T), 1> buf;
operator T() const noexcept {
T ret;
memcpy(&ret, &buf, sizeof(T));
return ret;
}
unaligned_wrapper& operator=(T t) noexcept {
memcpy(&buf, &t, sizeof(T));
return *this;
}
};
struct Interpretation_1 {
unaligned_wrapper<uint8_t> multiplexer;
unaligned_wrapper<uint8_t> timestamp;
unaligned_wrapper<uint32_t> position;
unaligned_wrapper<uint16_t> speed;
};
// and a lot of other struct like this (with bitfields, etc..., layout is not defined by me :( )
union DataInterpreter {
Interpretation_1 movement;
//Interpretation_2 temperatures;
//etc...
};
int main(){
uint8_t exampleData[8] {1u, 10u, 20u,0u,0u,0u, 5u,0u};
DataInterpreter* interpreter = reinterpret_cast<DataInterpreter*>(&exampleData);
std::cout << "position: " << interpreter->movement.position << "\n";
}
This would ensure that every read or write to the unaligned integer is transformed to a bytewise memcpy, which does not have any alignment requirement. There might be a performance penalty for this on architectures with the ability to access unaligned memory quickly, but it would work on any conforming compiler.
I'm trying to cast a struct into a char vector.
I wanna send my struct casted in std::vector throw a UDP socket and cast it back on the other side. Here is my struct whith the PACK attribute.
#define PACK( __Declaration__ ) __pragma( pack(push, 1) ) __Declaration__ __pragma( pack(pop) )
PACK(struct Inputs
{
uint8_t structureHeader;
int16_t x;
int16_t y;
Key inputs[8];
});
Here is test code:
auto const ptr = reinterpret_cast<char*>(&in);
std::vector<char> buffer(ptr, ptr + sizeof in);
//send and receive via udp
Inputs* my_struct = reinterpret_cast<Inputs*>(&buffer[0]);
The issue is:
All works fine except my uint8_t or int8_t.
I don't know why but whenever and wherever I put a 1Bytes value in the struct,
when I cast it back the value is not readable (but the others are)
I tried to put only 16bits values and it works just fine even with the
maximum values so all bits are ok.
I think this is something with the alignment of the bytes in the memory but i can't figure out how to make it work.
Thank you.
I'm trying to cast a struct into a char vector.
You cannot cast an arbitrary object to a vector. You can cast your object to an array of char and then copy that array into a vector (which is actually what your code is doing).
auto const ptr = reinterpret_cast<char*>(&in);
std::vector<char> buffer(ptr, ptr + sizeof in);
That second line defines a new vector and initializes it by copying the bytes that represent your object into it. This is reasonable, but it's distinct from what you said you were trying to do.
I think this is something with the alignment of the bytes in the memory
This is good intuition. If you hadn't told the compiler to pack the struct, it would have inserted padding bytes to ensure each field starts at its natural alignment. The fact that the operation isn't reversible suggests that somehow the receiving end isn't packed exactly the same way. Are you sure the receiving program has exactly the same packing directive and struct layout?
On x86, you can get by with unaligned data, but you may pay a large performance cost whenever you access an unaligned member variable. With the packing set to one, and the first field being odd-sized, you've guaranteed that the next fields will be unaligned. I'd urge you to reconsider this. Design the struct so that all the fields fall at their natural alignment boundaries and that you don't need to adjust the packing. This may make your struct a little bigger, but it will avoid all the alignment and performance problems.
If you want to omit the padding bytes in your wire format, you'll have to copy the relevant fields byte by byte into the wire format and then copy them back out on the receiving end.
An aside regarding:
#define PACK( __Declaration__ ) __pragma( pack(push, 1) ) __Declaration__ __pragma( pack(pop) )
Identifiers that begin with underscore and a capital letter or with two underscores are reserved for "the implementation," so you probably shouldn't use __Declaration__ as the macro's parameter name. ("The implementation" refers to the compiler, the standard library, and any other runtime bits the compiler requires.)
1
vector class has dynamically allocated memory and uses pointers inside. So you can't send the vector (but you can send the underlying array)
2
SFML has a great class for doing this called sf::packet. It's free, open source, and cross-platform.
I was recently working on a personal cross platform socket library for use in other personal projects and I eventually quit it for SFML. There's just TOO much to test, I was spending all my time testing to make sure stuff worked and not getting any work done on the actual projects I wanted to do.
3
memcpy is your best friend. It is designed to be portable, and you can use that to your advantage.
You can use it to debug. memcpy the thing you want to see into a char array and check that it matches what you expect.
4
To save yourself from having to do tons of robustness testing, limit yourself to only chars, 32-bit integers, and 64-bit doubles. You're using different compilers? struct packing is compiler and architecture dependent. If you have to use a packed struct, you need to guarantee that the packing is working as expected on all platforms you will be using, and that all platforms have the same endianness. Obviously, that's what you're having trouble with and I'm sorry I can't help you more with that. I would I would recommend regular serializing and would definitely avoid struct packing if I was trying to make portable sockets.
If you can make those guarantees that I mentioned, sending is really easy on LINUX.
// POSIX
void send(int fd, Inputs& input)
{
int error = sendto(fd, &input, sizeof(input), ..., ..., ...);
...
}
winsock2 uses a char* instead of a void* :(
void send(int fd, Inputs& input)
{
char buf[sizeof(input)];
memcpy(buf, &input, sizeof(input));
int error = sendto(fd, buf, sizeof(input), ..., ..., ...);
...
}
Did you tried the most simple approach of:
unsigned char *pBuff = (unsigned char*)∈
for (unsigned int i = 0; i < sizeof(Inputs); i++) {
vecBuffer.push_back(*pBuff);
pBuff++;
}
This would work for both, pack and non pack, since you will iterate the sizeof.
What would be the most efficient way to read a UInt32 value from an arbitrary memory address in C++? (Assuming Windows x86 or Windows x64 architecture.)
For example, consider having a byte pointer that points somewhere in memory to block that contains a combination of ints, string data, etc., all mixed together. The following sample shows reading the various fields from this block in a loop.
typedef unsigned char* BytePtr;
typedef unsigned int UInt32;
...
BytePtr pCurrent = ...;
while ( *pCurrent != 0 )
{
...
if ( *pCurrent == ... )
{
UInt32 nValue = *( (UInt32*) ( pCurrent + 1 ) ); // line A
...
}
pCurrent += ...;
}
If at line A, pPtr happens to contain a 4-byte-aligned address, reading the UInt32 should be a single memory read. If pPtr contains a non-aligned address, more than one memory cycles my be needed which slows the code down. Is there a faster way to read the value from non-aligned addresses?
I'd recommend memcpy into a temporary of type UInt32 within your loop.
This takes advantage of the fact that a four byte memcpy will be inlined by the compiler when building with optimization enabled, and has a few other benefits:
If you are on a platform where alignment matters (hpux, solaris sparc, ...) your code isn't going to trap.
On a platform where alignment matters there it may be worthwhile to do an address check for alignment then one of a regular aligned load or a set of 4 byte loads and bit ors. Your compiler's memcpy very likely will do this the optimal way.
If you are on a platform where an unaligned access is allowed and doesn't hurt performance (x86, x64, powerpc, ...), you are pretty much guarenteed that such a memcpy is then going to be the cheapest way to do this access.
If your memory was initially a pointer to some other data structure, your code may be undefined because of aliasing problems, because you are casting to another type and dereferencing that cast. Run time problems due to aliasing related optimization issues are very hard to track down! Presuming that you can figure them out, fixing can also be very hard in established code and you may have to use obscure compilation options like -fno-strict-aliasing or -qansialias, which can limit the compiler's optimization ability significantly.
Your code is undefined behaviour.
Pretty much the only "correct" solution is to only read something as a type T if it is a type T, as follows:
uint32_t n;
char * p = point_me_to_random_memory();
std::copy(p, p + 4, reinterpret_cast<char*>(&n));
std::cout << "The value is: " << n << std::endl;
In this example, you want to read an integer, and the only way to do that is to have an integer. If you want it to contain a certain binary representation, you need to copy that data to the address starting at the beginning of the variable.
Let the compiler do the optimizing!
UInt32 ReadU32(unsigned char *ptr)
{
return static_cast<UInt32>(ptr[0]) |
(static_cast<UInt32>(ptr[1])<<8) |
(static_cast<UInt32>(ptr[2])<<16) |
(static_cast<UInt32>(ptr[3])<<24);
}
Please have a look a the following code sample, executed on a Windows-32 system using Visual Studio 2010:
#include <iostream>
using namespace std;
class LogicallyClustered
{
bool _fA;
int _nA;
char _cA;
bool _fB;
int _nB;
char _cB;
};
class TypeClustered
{
bool _fA;
bool _fB;
char _cA;
char _cB;
int _nA;
int _nB;
};
int main(int argc, char* argv[])
{
cout << sizeof(LogicallyClustered) << endl; // 20
cout << sizeof(TypeClustered) << endl; // 12
return 0;
}
Question 1
The sizeof the two classes varies because the compiler is inserting padding bytes to achieve an optimized memory allignment of the variables. Is this correct?
Question 2
Why is the memory footprint smaller if I cluster the variables by type as in class TypeClustered?
Question 3
Is it a good rule of thumb to always cluster member variables according to their type?
Should I also sort them according to their size ascending (bool, char, int, double...)?
EDIT
Additional Question 4
A smaller memory footprint will improve data cache efficiency, since more objects can be cached and you avoid full memory accesses into "slow" RAM. So could the ordering and grouping of the member declaration can be considered as a (small) but easy to achieve performance optimization?
1) Absolutely correct.
2) It's not smaller because they are grouped, but because of the way they are ordered and grouped. For example, if you declare 4 chars one after the other, they can be packed into 4 byte. If you declare one char and immediately one int, 3 padding bytes will be inserted as the int will need to be aligned to 4 bytes.
3) No! You should group members in a class so that the class becomes more readable.
Important note: this is all platform/compiler specific. Don't take it ad-literam.
Another note - there also exist some small performance increase on some platforms for accessing members that reside in the first n (varies) bytes of a class instance. So declaring frequently accessed members at the beginning of a class can result in a small speed increase. However, this too shouldn't be a criteria. I'm just stating a fact, but in no way recommend you do this.
You are right, the size differs because the compiler inserts padding bytes in class LogicallyClustered. The compiler should use a memory layout like this:
class LogicallyClustered
{
// class starts well aligned
bool _fA;
// 3 bytes padding (int needs to be aligned)
int _nA;
char _cA;
bool _fB;
// 2 bytes padding (int needs to be aligned)
int _nB;
char _cB;
// 3 bytes padding (so next class object in an array would be aligned)
};
Your class TypeClustered does not need any padding bytes because all elements are aligned. bool and char do not need alignment, int needs to be aligned on 4 byte boundary.
Regarding question 3 I would say (as often :-)) "It depends.". If you are in an environment where memory footprint does not matter very much I would rather sort logically to make the code more readable. If you are in an environment where every byte counts you might consider moving around the members for optimal usage of space.
Unless there are no extreme memory footprint restrictions, cluster them logically, which improves code readability and ease of maintenance.
Unless you actually have problems of space (i.e. very, very large
vectors with such structures), don't worry about it. Otherwise: padding
is added for alignment: on most machines, for example, a double will
be aligned on an 8 byte boundary. Regrouping all members according to
type, with the types requiring the most alignment at the start will
result in the smallest memory footprint.
Q1: Yes
Q2: Depends on the size of bool (which is AFAIK compiler-dependent). Assuming it is 1 byte (like char), the first 4 members together use 4 bytes, which is as much as is used by one integer. Therefore, the compiler does not need to insert alignment padding in front of the integers.
Q3: If you want to order by type, size-descending is a better idea. However, that kind of clustering impedes readability. If you want to avoid padding under all circumstances, just make sure that every variable which needs more memory than 1 byte starts at an alignment boundary.
The alignment boundary, however, differs from architecture to architecture. That is (besides the possibly different sizes of int) why the same struct may have different sizes on different architectures. It is generally safe to start every member x at an offset of a
multiple of sizeof(x). I.e., in
struct {
char a;
char b;
char c;
int d;
}
The int d would start at an offset of 3, which is not a multiple of sizeof(int) (=4 on x86/64), so you should probably move it to the front. It is, however, not necessary to strictly cluster by type.
Some compilers also offer the possibility to completely omit padding, e.g. __attribute((packed))__ in g++. This, however, may slow down your program, because an int then might actually need two memory accesses.
I am initializing millions of classes that are of the following type
template<class T>
struct node
{
//some functions
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
}
The purpose of the template is to enable the user to choose float or double precision, with the idea being that by node<float> will occupy less memory (RAM).
However, when I switch from double to float the memory footprint of my program does not decrease as I expect it to. I have two questions,
Is it possible that the compiler/operating system is reserving more space than required for my floats (or even storing them as a double). If so, how do I stop this happening - I'm using linux on 64 bit machine with g++.
Is there a tool that lets me determine the amount of memory used by all the different classes? (i.e. some sort of memory profiling) - to make sure that the memory isn't being goobled up somewhere else that I haven't thought of.
If you are compiling for 64-bit, then each pointer will be 64-bits in size. This also means that they may need to be aligned to 64-bits. So if you store 3 floats, it may have to insert 4 bytes of padding. So instead of saving 12 bytes, you only save 8. The padding will still be there whether the pointers are at the beginning of the struct or the end. This is necessary in order to put consecutive structs in arrays to continue to maintain alignment.
Also, your structure is primarily composed of 3 pointers. The 8 bytes you save take you from a 48-byte object to a 40 byte object. That's not exactly a massive decrease. Again, if you're compiling for 64-bit.
If you're compiling for 32-bit, then you're saving 12 bytes from a 36-byte structure, which is better percentage-wise. Potentially more if doubles have to be aligned to 8 bytes.
The other answers are correct about the source of the discrepancy. However, pointers (and other types) on x86/x86-64 are not required to be aligned. It is just that performance is better when they are, which is why GCC keeps them aligned by default.
But GCC provides a "packed" attribute to let you exert control over this:
#include <iostream>
template<class T>
struct node
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node* m_parent_1;
node* m_parent_2;
node* m_child;
} ;
template<class T>
struct node2
{
private:
T m_data_1;
T m_data_2;
T m_data_3;
node2* m_parent_1;
node2* m_parent_2;
node2* m_child;
} __attribute__((packed));
int
main(int argc, char *argv[])
{
std::cout << "sizeof(node<double>) == " << sizeof(node<double>) << std::endl;
std::cout << "sizeof(node<float>) == " << sizeof(node<float>) << std::endl;
std::cout << "sizeof(node2<float>) == " << sizeof(node2<float>) << std::endl;
return 0;
}
On my system (x86-64, g++ 4.5.2), this program outputs:
sizeof(node<double>) == 48
sizeof(node<float>) == 40
sizeof(node2<float>) == 36
Of course, the "attribute" mechanism and the "packed" attribute itself are GCC-specific.
In addtion to the valid points that Nicol makes:
When you call new/malloc, it doesn't necessarily correspond 1 to 1 with a call the the OS to allocate memory. This is because in order to reduce the number of expensive syste, calls, the heap manager may allocate more than is requested, and then "suballocate" chunks of that when you call new/malloc. Also, memory can only be allocated 4kb at a time (typically - this is the minimum page size). Essentially, there may be chunks of memory allocated that are not currently actively used, in order to speed up future allocations.
To answer your questions directly:
1) Yes, the runtime will very likely allocate more memory then you asked for - but this memory is not wasted, it will be used for future news/mallocs, but will still show up in "task manager" or whatever tool you use. No, it will not promote floats to doubles. The more allocations you make, the less likely this edge condition will be the cause of the size difference, and the items in Nicol's will dominate. For a smaller number of allocations, this item is likely to dominate (where "large" and "small" depends entirely on your OS and Kernel).
2) The windows task manager will give you the total memory allocated. Something like WinDbg will actually give you the virtual memory range chunks (usually allocated in a tree) that were allocated by the run-time. For Linux, I expect this data will be available in one of the files in the /proc directory associated with your process.