In which cases can __declspec( align( # ) ) not work? - c++

I have a class:
class CMatrix4f
{
public:
CMatrix4f();
public:
__declspec(align(16)) float m[16];
};
This class implements matrix operations with SSE, so m must be aligned for it to work. And it works most of the time, but sometimes I get segfault when executing SSE instructions like _mm_load_ps because m is not 16-bytes aligned. So far I can't understand in which cases it happens.
When I do CMatrix4f * dynamicMatrix = new CMatrix4f();, is dynamicMatrix.m guaranteed to be aligned?
If I have a class:
class MatrixWrapper {
public:
MatrixWrapper();
CMatrix4f _matrix;
};
And then do:
MatrixWrapper * dynamicMatrixWrapper = new MatrixWrapper();
Is dynamicMatrixWrapper._matrix.m guaranteed to be aligned?
I've read MSDN article on alignment, but it is unclear whether it works for dynamic allocation.

since __declspec(align(#)) is a compilation directive, creating the MatrixWrapper object with the new operator can result in unaligned memory on the heap. You may consider using _aligned_malloc and allocate memory dynamically, for example in the constructor, and then free it using _aligned_free in the destructor, by the way mixing static and dynamic allocation of object makes things more difficult.

Related

How to define a memory aligned template pointer as a C++ class member? (intel/gcc)

I have a matrix class that has a pointer class member (data pointer to the class template variable) and I would like to declare it as a pointer to aligned memory just as a hint to intel compiler that it can use aligned access for auto-vectorized code. The short version of the code is as below:
template <class V>
class matrix{
public:
typedef __declspec(align(64)) V VA;
VA * data;
int nrows, ncols;
}
When I check optimization report I see that access to data is not aligned unless I use __assume_aligned in the code block where I use matrix.data.
Please let me know if there is an alternative to let the intel/gcc compiler know that data field is an aligned pointer so that I don't have to use __assume_aligned everywhere there is an access to the pointer.
This issue has been already resolved in the Intel community. Please refer the below link
https://community.intel.com/t5/Intel-C-Compiler/How-to-define-a-memory-aligned-template-pointer-as-a-C-class/m-p/1287633#M38894
_mm_malloc assures gives a hint to the compiler to align the memory access. Alignment assures there is no peeled loop present that is what is happening in the second example.
Using only malloc makes the chances of a loop getting aligned lower. In this case, the loop is not aligned resulting in peeled loops.

Does the alignas specifier work with 'new'?

My question is rather simple;
Does the alignas specifier work with 'new'? That is, if a struct is defined to be aligned, will it be aligned when allocated with new?
Before C++17, if your type's alignment is not over-aligned, then yes, the default new will work. "Over-aligned" means that the alignment you specify in alignas is greater than alignof(std::max_align_t). The default new will work with non-over-aligned types more or less by accident; the default memory allocator will always allocate memory with an alignment equal to alignof(std::max_align_t).
If your type's alignment is over-aligned however, your out of luck. Neither the default new, nor any global new operator you write, will be able to even know the alignment required of the type, let alone allocate memory appropriate to it. The only way to help this case is to overload the class's operator new, which will be able to query the class's alignment with alignof.
Of course, this won't be useful if that class is used as the member of another class. Not unless that other class also overloads operator new. So something as simple as new pair<over_aligned, int>() won't work.
C++17 adds a number of memory allocators which are given the alignment of the type being used. These allocators are used specifically for over-aligned types (or more specifically, new-extended over-aligned types). So new pair<over_aligned, int>() will work in C++17.
Of course, this only works to the extent that the allocator handles over-aligned types.
No it does not. The struct will be padded to the alignment requested, but it will not be aligned. There is a chance, however, that this will be allowed in C++17 (the fact that this C++17 proposal exists should be pretty good proof this can't work in C++11).
I have seen this appear to work with some memory allocators, but that was pure luck. For instance, some memory allocators will align their memory allocations to powers of 2 of the requested size (up to 4KB) as an optimization for the allocator (reduce memory fragmentation, possibly make it easier to reuse previously freed memory, etc...). However, the new/malloc implementations that are included in the OS X 10.7 and CentOS 6 systems that I tested do not do this and fail with the following code:
#include <stdlib.h>
#include <assert.h>
struct alignas(8) test_struct_8 { char data; };
struct alignas(16) test_struct_16 { char data; };
struct alignas(32) test_struct_32 { char data; };
struct alignas(64) test_struct_64 { char data; };
struct alignas(128) test_struct_128 { char data; };
struct alignas(256) test_struct_256 { char data; };
struct alignas(512) test_struct_512 { char data; };
int main() {
test_struct_8 *heap_8 = new test_struct_8;
test_struct_16 *heap_16 = new test_struct_16;
test_struct_32 *heap_32 = new test_struct_32;
test_struct_64 *heap_64 = new test_struct_64;
test_struct_128 *heap_128 = new test_struct_128;
test_struct_256 *heap_256 = new test_struct_256;
test_struct_512 *heap_512 = new test_struct_512;
#define IS_ALIGNED(addr,size) ((((size_t)(addr)) % (size)) == 0)
assert(IS_ALIGNED(heap_8, 8));
assert(IS_ALIGNED(heap_16, 16));
assert(IS_ALIGNED(heap_32, 32));
assert(IS_ALIGNED(heap_64, 64));
assert(IS_ALIGNED(heap_128, 128));
assert(IS_ALIGNED(heap_256, 256));
assert(IS_ALIGNED(heap_512, 512));
delete heap_8;
delete heap_16;
delete heap_32;
delete heap_64;
delete heap_128;
delete heap_256;
delete heap_512;
return 0;
}

Creating dynamically sized objects

Removed the C tag, seeing as that was causing some confusion (it shouldn't have been there to begin with; sorry for any inconvenience there. C answer as still welcome though :)
In a few things I've done, I've found the need to create objects that have a dynamic size and a static size, where the static part is your basic object members, the dynamic part is however an array/buffer appended directly onto the class, keeping the memory contiguous, thus decreasing the amount of needed allocations (these are non-reallocatable objects), and decreasing fragmentation (though as a down side, it may be harder to find a block of a big enough size, however that is a lot more rare - if it should even occur at all - than heap fragmenting. This is also helpful on embedded devices where memory is at a premium(however I don't do anything for embedded devices currently), and things like std::string need to be avoided, or can't be used like in the case of trivial unions.
Generally the way I'd go about this would be to (ab)use malloc(std::string is not used on purpose, and for various reasons):
struct TextCache
{
uint_32 fFlags;
uint_16 nXpos;
uint_16 nYpos;
TextCache* pNext;
char pBuffer[0];
};
TextCache* pCache = (TextCache*)malloc(sizeof(TextCache) + (sizeof(char) * nLength));
This however doesn't sit too well with me, as firstly I would like to do this using new, and thus in a C++ environment, and secondly, it looks horrible :P
So next step was a templated C++ varient:
template <const size_t nSize> struct TextCache
{
uint_32 fFlags;
uint_16 nXpos;
uint_16 nYpos;
TextCache<nSize>* pNext;
char pBuffer[nSize];
};
This however has the problem that storing a pointer to a variable sized object becomes 'impossible', so then the next work around:
class DynamicObject {};
template <const size_t nSize> struct TextCache : DynamicObject {...};
This however still requires casting, and having pointers to DynamicObject all over the place becomes ambiguous when more that one dynamically sized object derives from it (it also looks horrible and can suffer from a bug that forces empty classes to still have a size, although that's probably an archaic, extinct bug...).
Then there was this:
class DynamicObject
{
void* operator new(size_t nSize, size_t nLength)
{
return malloc(nSize + nLength);
}
};
struct TextCache : DynamicObject {...};
which looks a lot better, but would interfere with objects that already have overloads of new(it could even affect placement new...).
Finally I came up with placement new abusing:
inline TextCache* CreateTextCache(size_t nLength)
{
char* pNew = new char[sizeof(TextCache) + nLength];
return new(pNew) TextCache;
}
This however is probably the worst idea so far, for quite a few reasons.
So are there any better ways to do this? or would one of the above versions be better, or at least improvable? Is doing even considered safe and/or bad programming practice?
As I said above, I'm trying to avoid double allocations, because this shouldn't need 2 allocations, and cause this makes writing(serializing) these things to files a lot easier.
The only exception to the double allocation requirement I have is when its basically zero overhead. the only cause where I have encountered that is where I sequentially allocate memory from a fixed buffer(using this system, that I came up with), however its also a special exception to prevent superflous copying.
I'd go for a compromise with the DynamicObject concept. Everything that doesn't depend on the size goes into the base class.
struct TextBase
{
uint_32 fFlags;
uint_16 nXpos;
uint_16 nYpos;
TextBase* pNext;
};
template <const size_t nSize> struct TextCache : public TextBase
{
char pBuffer[nSize];
};
This should cut down on the casting required.
C99 blesses the 'struct hack' - aka flexible array member.
§6.7.2.1 Structure and union specifiers
¶16 As a special case, the last element of a structure with more than one named member may
have an incomplete array type; this is called a flexible array member. With two
exceptions, the flexible array member is ignored. First, the size of the structure shall be
equal to the offset of the last element of an otherwise identical structure that replaces the
flexible array member with an array of unspecified length.106) Second, when a . (or ->)
operator has a left operand that is (a pointer to) a structure with a flexible array member
and the right operand names that member, it behaves as if that member were replaced
with the longest array (with the same element type) that would not make the structure
larger than the object being accessed; the offset of the array shall remain that of the
flexible array member, even if this would differ from that of the replacement array. If this
array would have no elements, it behaves as if it had one element but the behavior is
undefined if any attempt is made to access that element or to generate a pointer one past
it.
¶17 EXAMPLE Assuming that all array members are aligned the same, after the declarations:
struct s { int n; double d[]; };
struct ss { int n; double d[1]; };
the three expressions:
sizeof (struct s)
offsetof(struct s, d)
offsetof(struct ss, d)
have the same value. The structure struct s has a flexible array member d.
106) The length is unspecified to allow for the fact that implementations may give array members different
alignments according to their lengths.
Otherwise, use two separate allocations - one for the core data in the structure and the second for the appended data.
That may look heretic regarding to the economic mindset of C or C++ programmers, but the last time I had a similar problem to solve I chosed to put a fixed size static buffer in my struct and access it through a pointer indirection. If the given struct grew larger than my static buffer the indirection pointer was then allocated dynamically (and the internal buffer unused). That was very simple to implement and solved the kind of issues you raised like fragmentation, as static buffer was used in more than 95% of actual use case, with the remaining 5% needing really large buffers, hence I didn't cared much of the small loss of internal buffer.
I believe that, in C++, technically this is Undefined Behavior (due to alignment issues), although I suspect it can be made to work for probably every existing implementation.
But why do that anyway?
You could use placement new in C++:
char *buff = new char[sizeof(TextCache) + (sizeof(char) * nLength)];
TextCache *pCache = new (buff) TextCache;
The only caveat being that you need to delete buff instead of pCache and if pCache has a destructor you'll have to call it manually.
If you are intending to access this extra area using pBuffer I'd recommend doing this:
struct TextCache
{
...
char *pBuffer;
};
...
char *buff = new char[sizeof(TextCache) + (sizeof(char) * nLength)];
TextCache *pCache = new (buff) TextCache;
pCache->pBuffer = new (buff + sizeof(TextCache)) char[nLength];
...
delete [] buff;
There's nothing wrong with managing your own memory.
template<typename DerivedType, typename ElemType> struct appended_array {
ElemType* buffer;
int length;
~appended_array() {
for(int i = 0; i < length; i++)
buffer->~ElemType();
char* ptr = (char*)this - sizeof(DerivedType);
delete[] ptr;
}
static inline DerivedType* Create(int extra) {
char* newbuf = new char[sizeof(DerivedType) + (extra * sizeof(ElemType))];
DerivedType* ptr = new (newbuf) DerivedType();
ElemType* extrabuf = (ElemType*)newbuf[sizeof(DerivedType)];
for(int i = 0; i < extra; i++)
new (&extrabuf[i]) ElemType();
ptr->lenghth = extra;
ptr->buffer = extrabuf;
return ptr;
}
};
struct TextCache : appended_array<TextCache, char>
{
uint_32 fFlags;
uint_16 nXpos;
uint_16 nYpos;
TextCache* pNext;
// IT'S A MIRACLE! We have a buffer of size length and pointed to by buffer of type char that automagically appears for us in the Create function.
};
You should consider, however, that this optimization is premature and there are way better ways of doing it, like having an object pool or managed heap. Also, I didn't count for any alignment, however it's my understanding that sizeof() returns the aligned size. Also, this will be a bitch to maintain for non-trivial construction. Also, this is totally untested. A managed heap is a way better idea. But you shouldn't be afraid of managing your own memory- if you have custom memory requirements, you need to manage your own memory.
Just occurred to me that I destructed but not deleted the "extra" memory.

preventing unaligned data on the heap

I am building a class hierarchy that uses SSE intrinsics functions and thus some of the members of the class need to be 16-byte aligned. For stack instances I can use __declspec(align(#)), like so:
typedef __declspec(align(16)) float Vector[4];
class MyClass{
...
private:
Vector v;
};
Now, since __declspec(align(#)) is a compilation directive, the following code may result in an unaligned instance of Vector on the heap:
MyClass *myclass = new MyClass;
This too, I know I can easily solve by overloading the new and delete operators to use _aligned_malloc and _aligned_free accordingly. Like so:
//inside MyClass:
public:
void* operator new (size_t size) throw (std::bad_alloc){
void * p = _aligned_malloc(size, 16);
if (p == 0) throw std::bad_alloc()
return p;
}
void operator delete (void *p){
MyClass* pc = static_cast<MyClass*>(p);
_aligned_free(p);
}
...
So far so good.. but here is my problem. Consider the following code:
class NotMyClass{ //Not my code, which I have little or no influence over
...
MyClass myclass;
...
};
int main(){
...
NotMyClass *nmc = new NotMyClass;
...
}
Since the myclass instance of MyClass is created statically on a dynamic instance of NotMyClass, myclass WILL be 16-byte aligned relatively to the beginning of nmc because of Vector's __declspec(align(16)) directive. But this is worthless, since nmc is dynamically allocated on the heap with NotMyClass's new operator, which doesn't nesessarily ensure (and definitely probably NOT) 16-byte alignment.
So far, I can only think of 2 approaches on how to deal with this problem:
Preventing MyClass users from being able to compile the following code:
MyClass myclass;
meaning, instances of MyClass can only be created dynamically, using the new operator, thus ensuring that all instances of MyClass are truly dynamically allocatted with MyClass's overloaded new. I have consulted on another thread on how to accomplish this and got a few great answers:
C++, preventing class instance from being created on the stack (during compiltaion)
Revert from having Vector members in my Class and only have pointers to Vector as members, which I will allocate and deallocate using _aligned_malloc and _aligned_free in the ctor and dtor respectively. This methos seems crude and prone to error, since I am not the only programmer writing these Classes (MyClass derives from a Base class and many of these classes use SSE).
However, since both solutions have been frowned upon in my team, I come to you for suggestions of a different solution.
If you're set against heap allocation, another idea is to over allocate on the stack and manually align (manual alignment is discussed in this SO post). The idea is to allocate byte data (unsigned char) with a size guaranteed to contain an aligned region of the necessary size (+15), then find the aligned position by rounding down from the most-shifted region (x+15 - (x+15) % 16, or x+15 & ~0x0F). I posted a working example of this approach with vector operations on codepad (for g++ -O2 -msse2). Here are the important bits:
class MyClass{
...
unsigned char dPtr[sizeof(float)*4+15]; //over-allocated data
float* vPtr; //float ptr to be aligned
public:
MyClass(void) :
vPtr( reinterpret_cast<float*>(
(reinterpret_cast<uintptr_t>(dPtr)+15) & ~ 0x0F
) )
{}
...
};
...
The constructor ensures that vPtr is aligned (note the order of members in the class declaration is important).
This approach works (heap/stack allocation of containing classes is irrelevant to alignment), is portabl-ish (I think most compilers provide a pointer sized uint uintptr_t), and will not leak memory. But its not particularly safe (being sure to keep the aligned pointer valid under copy, etc), wastes (nearly) as much memory as it uses, and some may find the reinterpret_casts distasteful.
The risks of aligned operation/unaligned data problems could be mostly eliminated by encapsulating this logic in a Vector object, thereby controlling access to the aligned pointer and ensuring that it gets aligned at construction and stays valid.
You can use "placement new."
void* operator new(size_t, void* p) { return p; }
int main() {
void* p = aligned_alloc(sizeof(NotMyClass));
NotMyClass* nmc = new (p) NotMyClass;
// ...
nmc->~NotMyClass();
aligned_free(p);
}
Of course you need to take care when destroying the object, by calling the destructor and then releasing the space. You can't just call delete. You could use shared_ptr<> with a different function to deal with that automatically; it depends if the overhead of dealing with a shared_ptr (or other wrapper of the pointer) is a problem to you.
The upcoming C++0x standard proposes facilities for dealing with raw memory. They are already incorporated in VC++2010 (within the tr1 namespace).
std::tr1::alignment_of // get the alignment
std::tr1::aligned_storage // get aligned storage of required dimension
Those are types, you can use them like so:
static const floatalign = std::tr1::alignment_of<float>::value; // demo only
typedef std::tr1::aligned_storage<sizeof(float)*4, 16>::type raw_vector;
// first parameter is size, second is desired alignment
Then you can declare your class:
class MyClass
{
public:
private:
raw_vector mVector; // alignment guaranteed
};
Finally, you need some cast to manipulate it (it's raw memory until now):
float* MyClass::AccessVector()
{
return reinterpret_cast<float*>((void*)&mVector));
}

What is the Performance, Safety, and Alignment of a Data member hidden in an embedded char array in a C++ Class?

I have seen a codebase recently that I fear is violating alignment constraints. I've scrubbed it to produce a minimal example, given below. Briefly, the players are:
Pool. This is a class which allocates memory efficiently, for some definition of 'efficient'. Pool is guaranteed to return a chunk of memory that is aligned for the requested size.
Obj_list. This class stores homogeneous collections of objects. Once the number of objects exceeds a certain threshold, it changes its internal representation from a list to a tree. The size of Obj_list is one pointer (8 bytes on a 64-bit platform). Its populated store will of course exceed that.
Aggregate. This class represents a very common object in the system. Its history goes back to the early 32-bit workstation era, and it was 'optimized' (in that same 32-bit era) to use as little space as possible as a result. Aggregates can be empty, or manage an arbitrary number of objects.
In this example, Aggregate items are always allocated from Pools, so they are always aligned. The only occurrences of Obj_list in this example are the 'hidden' members in Aggregate objects, and therefore they are always allocated using placement new. Here are the support classes:
class Pool
{
public:
Pool();
virtual ~Pool();
void *allocate(size_t size);
static Pool *default_pool(); // returns a global pool
};
class Obj_list
{
public:
inline void *operator new(size_t s, void * p) { return p; }
Obj_list(const Args *args);
// when constructed, Obj_list will allocate representation_p, which
// can take up much more space.
~Obj_list();
private:
Obj_list_store *representation_p;
};
And here is Aggregate. Note that member declaration member_list_store_d:
// Aggregate is derived from Lesser, which is twelve bytes in size
class Aggregate : public Lesser
{
public:
inline void *operator new(size_t s) {
return Pool::default_pool->allocate(s);
}
inline void *operator new(size_t s, Pool *h) {
return h->allocate(s);
}
public:
Aggregate(const Args *args = NULL);
virtual ~Aggregate() {};
inline const Obj_list *member_list_store_p() const;
protected:
char member_list_store_d[sizeof(Obj_list)];
};
It is that data member that I'm most concerned about. Here is the pseudocode for initialization and access:
Aggregate::Aggregate(const Args *args)
{
if (args) {
new (static_cast<void *>(member_list_store_d)) Obj_list(args);
}
else {
zero_out(member_list_store_d);
}
}
inline const Obj_list *Aggregate::member_list_store_p() const
{
return initialized(member_list_store_d) ? (Obj_list *) &member_list_store_d : 0;
}
You may be tempted to suggest that we replace the char array with a pointer to the Obj_list type, initialized to NULL or an instance of the class. This gives the proper semantics, but just shifts the memory cost around. If memory were still at a premium (and it might be, this is an EDA database representation), replacing the char array with a pointer to an Obj_list would cost one more pointer in the case when Aggregate objects do have members.
Besides that, I don't really want to get distracted from the main question here, which is alignment. I think the above construct is problematic, but can't really find more in the standard than some vague discussion of the alignment behavior of the 'system/library' new.
So, does the above construct do anything more than cause an occasional pipe stall?
Edit: I realize that there are ways to replace the approach using the embedded char array. So did the original architects. They discarded them because memory was at a premium. Now, if I have a reason to touch that code, I'll probably change it.
However, my question, about the alignment issues inherent in this approach, is what I hope people will address. Thanks!
Ok - had a chance to read it properly. You have an alignment problem, and invoke undefined behaviour when you access the char array as an Obj_list. Most likely your platform will do one of three things: let you get away with it, let you get away with it at a runtime penalty or occasionally crash with a bus error.
Your portable options to fix this are:
allocate the storage with malloc or
a global allocation function, but
you think this is too
expensive.
as Arkadiy says, make your buffer an Obj_list member:
Obj_list list;
but you now don't want to pay the cost of construction. You could mitigate this by providing an inline do-nothing constructor to be used only to create this instance - as posted the default constructor would do. If you follow this route, strongly consider invoking the dtor
list.~Obj_list();
before doing a placement new into this storage.
Otherwise, I think you are left with non portable options: either rely on your platform's tolerance of misaligned accesses, or else use any nonportable options your compiler gives you.
Disclaimer: It's entirely possible I'm missing a trick with unions or some such. It's an unusual problem.
The alignment will be picked by the compiler according to its defaults, this will probably end up as four-bytes under GCC / MSVC.
This should only be a problem if there is code (SIMD/DMA) that requires a specific alignment. In this case you should be able to use compiler directives to ensure that member_list_store_d is aligned, or increase the size by (alignment-1) and use an appropriate offset.
Can you simply have an instance of Obj_list inside Aggregate? IOW, something along the lines of
class Aggregate : public Lesser
{
...
protected:
Obj_list list;
};
I must be missing something, but I can't figure why this is bad.
As to your question - it's perfectly compiler-dependent. Most compilers, though, will align every member at word boundary by default, even if the member's type does not need to be aligned that way for correct access.
If you want to ensure alignment of your structures, just do a
// MSVC
#pragma pack(push,1)
// structure definitions
#pragma pack(pop)
// *nix
struct YourStruct
{
....
} __attribute__((packed));
To ensure 1 byte alignment of your char array in Aggregate
Allocate the char array member_list_store_d with malloc or global operator new[], either of which will give storage aligned for any type.
Edit: Just read the OP again - you don't want to pay for another pointer. Will read again in the morning.