16 byte alignment issue - c++

I am using DirectXMath, creating XMMatrix and XMVector in classes.
When I call XMMatrixMultiply it throws unhandled exception on it.
I have found online that it is an issue with byte alligment, since DirectXMath uses SIMD instructions set which results in missaligned heap allocation.
One of the proposed solution was to use XMFLOAT4X4 variables and then change them to temporary XMMatrix whenever needed, but it isn't the nicest and fastest solution imo.
Another one was to use _aligned_malloc, yet I have no idea whatsoever how to use it. I have never had to do any memory allocations and it is black magic for me.
Another one, was to overload new operator, yet they did not provide any information how to do it.
And regarding the overloading method, I'm not using new to create XMMatrix objects since I don't use them as pointers.
It was all working nice untill I have decided to split code into classes.
I think _alligned_malloc solution would be best here, but I have no idea how to use it, where and when to call it.

Unlike XMFLOAT4X4 and XMFLOAT4, which are safe to store, XMMATRIX and XMVECTOR are aliases for hardware registers (SSE, NEON, etc.). Since the library is abstracting away the register type and alignment requirements, you shouldn't attempt to align the types yourself, since you can easily create a program that happens to work on your machine but fails on another. You should either use the safe types for storage (e.g. XMFLOAT4) or pull up the abstraction and use the vector instructions directly, with special storage and alignment code paths in your application for each vector extension you're trying to support.
Also, using these registers outside of the context of the library's vector instructions might cause unexpected failures for other reasons. For example, if you store an XMMATRIX in your own struct, some architectures might fail to create copies of the struct.

Not pretend to be a complete answer.
There are some ways that you didn't mention:
#define _XM_NO_INTRINSICS_. Simple. Slow. Works right now, just one line of code. ;)
Don't store XMVECTOR and XMMATRIX on a heap. Store XMFLOAT4 or XMFLOAT4X4 and convert to SIMD types only when needed (so they will be stored on stack). Slower. Many code to change (probably).
Don't store XMVECTOR and XMMATRIX on a heap, part 2. Just store your classes on stack. Fast. Pretty hard. Many code to change (probably).
Use aligned allocator. Fast. Hard. Many hours to google, many code to write and debug.
Don't use DirectXMath (previously XMMath) library. Choose any other (there are plenty) or write your own. Fast. Many code to change (probably).
If you want aligned allocator, it has nothing to DirectX or DirectXMath. It is advanced topic. No one can give you complete solution. But, here are some results of googling:
returning aligned memory with new?
Harder to C++: Aligned Memory Allocation
many more
Be very attentive. With bad memory allocator you can introduce much more problems than solve.
Hope it helps somehow. Happy Coding! :)

Related

Why convert to XMFLOAT instead of using XMVECTOR directly?

While studying DirectX 12, it says that I should use XMFLOAT instead of XMVECTOR for class data members.
I do not understand why.
Is it wrong that defining the XMVECTOR variables in my class? or using XMVECTOR directly in my class?
This is covered in the DirectXMath Programmer's Guide on Docs.Microsoft which you should take the time to read. In particular, read the Getting Started section titled Type Usage Guidelines.
The XMVECTOR and XMMATRIX types are the work horses for the DirectXMath Library. Every operation consumes or produces data of these types. Working with them is key to using the library. However, since DirectXMath makes use of the SIMD instruction sets, these data types are subject to a number of restrictions. It is critical that you understand these restrictions if you want to make good use of the DirectXMath functions.
You should think of XMVECTOR as a proxy for a SIMD hardware register, and XMMATRIX as a proxy for a logical grouping of four SIMD hardware registers. These types are annotated to indicate they require 16-byte alignment to work correctly. The compiler will automatically place them correctly on the stack when they are used as a local variable, or place them in the data segment when they are used as a global variable. With proper conventions, they can also be passed safely as parameters to a function (see Calling Conventions for details).
Allocations from the heap, however, are more complicated. As such, you need to be careful whenever you use either XMVECTOR or XMMATRIX as a member of a class or structure to be allocated from the heap. On Windows x64, all heap allocations are 16-byte aligned, but for Windows x86, they are only 8-byte aligned. There are options for allocating structures from the heap with 16-byte alignment (see Properly Align Allocations). For C++ programs, you can use operator new/delete/new[]/delete[] overloads (either globally or class-specific) to enforce optimal alignment if desired.
Note  As an alternative to enforcing alignment in your C++ class directly by overloading new/delete, you can use the pImpl idiom. If you ensure your Impl class is aligned via __aligned_malloc internally, you can then freely use aligned types within the internal implementation. This is a good option when the 'public' class is a Windows Runtime ref class or intended for use with std::shared_ptr<>, which can otherwise disrupt careful alignment.
However, often it is easier and more compact to avoid using XMVECTOR or XMMATRIX directly in a class or structure. Instead, make use of the XMFLOAT3, XMFLOAT4, XMFLOAT4X3, XMFLOAT4X4, and so on, as members of your structure. Further, you can use the Vector Loading and Vector Storage functions to move the data efficiently into XMVECTOR or XMMATRIX local variables, perform computations, and store the results. There are also streaming functions (XMVector3TransformStream, XMVector4TransformStream, and so on) that efficiently operate directly on arrays of these data types.
This strict alignment requirement and the verbosity is by design as it makes it clear to the programmer when load/store overhead is being incurred. If, however, you find it a bit tedious, consider making use of the SimpleMath wrapper in the DirectX Tool Kit for DirectX 11 / DirectX 12
Keep in mind that DirectX has nothing particularly to do with DirectXMath. DirectXMath can work just as well with any version of Direct3D or even OpenGL as it just does CPU-side vector and matrix computations. DirectXMath doesn't really depend on the Windows OS at all; it's just a collection of C/C++ code using intrinsics so the compiler is all that really matters.
In fact, since you are apparently new enough to DirectX generally to not already know how to use DirectXMath, you should consider using DirectX 11 and not trying to jump into DirectX 12 cold. DirectX 12 is a very unforgiving API designed for graphics experts, and largely assumes you are already an expert in Direct3D 11 programming.
See DirectX Tool Kit for DirectX 12 tutorials and Getting Started with Direct3D 12.
You are in your right to store one or more XMVECTOR as object members.
When you do so, you need to be sure you respect the alignment constraint of XMVECTOR : 128bits. This is why they introduce the XMFLOATx to deal with storage without the alignment requirements.
Failure to do so may give you crashes at best and incorrect computation at worst. This is more likely to happen with a 32bits executable when new is not required to returned a memory align on at least 16 bytes.

Overriding new and delete for DirectX structures

I follow some common DirectX tutorial on the web which features classes and structuring.
I need to allocate memory for XMVECTOR and XMMATRIX because of the specific memory allocation issue.
Now it all works, but I wish to make teh code cleaner. Question is:
Is there a way to override new and delete for those structures(so the malloc and pointer conversion details are hidden by the word "new", similarly with the delete) and if so how?
Edit 2014-07-11:
The comments so far suggested two way to workaround the problem by:
1) using a wrapper class for the structures and overloading/overriding delete and new for the wrapper class.
The problem with this is the obvious performance hit and the need to access member structure ever single time (less cleaner code, which defeats the whole purpose).
2) Using XMFLOAT4 and similar structures.
Problem with this is that this makes it easier with memory allocation but adds complications in conversions between types (as XMMATRIX and XMVECTOR are the ones returned by DirectXMath functions). Those conversions also make the code less cleaner so it's like replacing a pile of dog poop with cat poop, it's still poop in the end (yeah, the best comparison I could come up with to convey meaning).
The general recommendation is to use the various memory structures (XMFLOAT4, etc.) and Load/Stores. If you were targeting only x64 native, you could use XMVECTOR/XMMATRIX directly since that platform uses 16-byte aligned memory by default.
The overloading new/delete recommendation is not for XMVECTOR or XMMATRIX. Rather you can overload new/delete for your classes that contain these types to use __aligned_malloc( x, 16 ). Global overriding of new/delete is possible, but doing it per-class is actually the recommended solution. See the Scott Meyers "Effective C++" books for detailed discussion of overriding new/delete.
Another approach is to use the pImpl idiom like the DirectX Tool Kit does. The public class is unaligned but the internal class uses __aligned_malloc( x, 16 ). This actually works really well, and both the implementation and the client code doesn't end up looking like "poop".
Finally, you could make use of the SimpleMath wrapper in the DirectX Tool Kit which provides classes that derive from XMFLOAT4, etc. with implicit conversions. It is not as efficient, but it does look clean without worrying about the alignment issues.
BTW, this topic is covered in the DirectXMath Programmer's Guide on MSDN.

Memory management interface

I am writing a small particle system in C++ and am yet unsure about how I should manage the particle related data -- should it be stored in a static or dynamic array, in a linked list, some mixture of both, or whatever else one might think of?
At the moment I don't want to make a choice but would rather like to use an abstract class for memory mangement that on the one hand provides me with allocation and deallocation routines and on the other hand takes care of deallocation of the supplied resources in its destructor. I hope that in this way I can change between and test different particle management strategies quickly and transparently.
1) Is this a reasonable thing to do?
2) If yes: Are there any libraries that provide such functionality?
Thank you for you help!
For a particle system you may wish to consider using one std::vector for each coordinate, velocity, colour channel etc for each particle. Eg
std::vector<float> x(100);
std::vector<float> vx(100);
etc
Instead of
std::vector<Particle> p(100)
This is known as SOA (structure-of-array) rather than AOS (array of structures). The former is more amenable to vectorization.
The rule of thumb is to use std::vector unless you really have a reason to chose something else. At the moment you can stick with it. To control memory management at the low level you can supply a vector with your own allocator in case std::allocator which will use std::new_allocator should be replaced. If your main concern is extensive deleting and allocating single object than definitely you might consider writing your own user-defined allocator which will allocate from pool of fixed-sized elements organized into linked list, because conventional and more general oeprator new() is not efficient in case of many calls to allocate or deallocate objects one at a time.
To test different containers is a reasonable thing IMO, however vector should suffice. In order to decide if
1) Is this a reasonable thing to do?
and thus such tests should be covered at all - you have to think about the operations you are going to use extensively.
2) If yes: Are there any libraries that provide such functionality?
I don't know about such library.

Are classes guaranteed to have the same organization in memory between program runs?

I'm attempting to implement a Save/Load feature into my small game. To accomplish this I have a central class that stores all the important variables of the game such as position, etc. I then save this class as binary data to a file. Then simply load it back for the loading function. This seems to work MOST of the time, but if I change certain things then try to do a save/load the program will crash with memory access violations. So, are classes guaranteed to have the same structure in memory on every run of the program or can the data be arranged at random like a struct?
Response to Jesus - I mean the data inside the class, so that if I save the class to disk, when I load it back, will everything fit nicely back.
Save
fout.write((char*) &game,sizeof Game);
Load
fin.read((char*) &game, sizeof Game);
Your approach is extremely fragile. With many restrictions, it can work. These restrictions are not worth subjecting your users (or yourself!) to in typical cases.
Some Restrictions:
Never refer to external memory (e.g. a pointer or reference)
Forbid ABI changes/differences. Common case: memory layout and natural alignment on 32 vs 64 will vary. The user will need a new 'game' for each ABI.
Not endian compatible.
Altering your type's layouts will break your game. Changing your compiler options can do this.
You're basically limited to POD data.
Use offsets instead of pointers to refer to internal data (This reference would be in contiguous memory).
Therefore, you can safely use this approach in extremely limited situations -- that typically applies only to components of a system, rather than the entire state of the game.
Since this is tagged C++, "boost - Serialization" would be a good starting point. It's well tested and abstracts many of the complexities for you.
Even if this would work, just don't do it. Define a file format at the byte-level and write sensible 'convert to file format' and 'convert from file format' functions. You'll actually know the format of the file. You'll be able to extend it. Newer versions of the program will be able to read files from older versions. And you'll be able to update your platform, build tools, and classes without fear of causing your program to crash.
Yes, classes and structures will have the same layout in memory every time your program runs., although I can't say if the standard enforces this. The machine code generated by C++ compilers use "hard-coded" offsets to access type fields, so they are fixed. Realistically, the layout will only change if you modify the C++ class definition (field sizes, order, virtual methods, etc.), compile with a different compiler or change compiler options.
As long as the type is POD and without pointer fields, it should be safe to simply dump it to a file and read it back with the exact same program. However, because of the above-mentionned concerns, this approach is quite inflexible with regard to versionning and interoperability.
[edit]
To respond to your own edit, do not do this with your "Game" object! It certainly has pointers to other objects, and those objects will not exist anymore in memory or will be elsewhere when you'll reload your file.
You might want to take a look at this.
Classes are not guaranteed to have the same structure in memory as pointers can point to different locations in memory each time a class is created.
However, without posting code it is difficult to say with certainty where the problem is.

Creating a scoped custom memory pool/allocator?

Would it be possible in C++ to create a custom allocator that works simply like this:
{
// Limit memory to 1024 KB
ScopedMemoryPool memoryPool(1024 * 1024);
// From here on all heap allocations ('new', 'malloc', ...) take memory from the pool.
// If the pool is depleted these calls result in an exception being thrown.
// Examples:
std::vector<int> integers(10);
int a * = new int [10];
}
I couldn't find something like this in the boost libraries, or anywhere else.
Is there a fundamental problem that makes this impossible?
You would need to create a custom allocator that you pass in as a template param to vector. This custom allocator would essentially wrap the access to your pool and do whatever size validations that it wants.
Yes you can make such a construct, it's used in many games, but you'll basically need to implement your own containers and call memory allocation methods of that pool that you've created.
You could also experiment with writing a custom allocator for the STL containers, although it seems that that sort of work is generally advised against. (I've done it before and it was tedious, but I don't remember any specific problems.)
Mind- writing your own memory allocator is not for the faint of heart. You could take a look at Doug Lea's malloc, which provides "memory spaces", which you could use in your scoping construct somehow.
I will answer a different question. Look at 'efficient c++' book. One of the things they discuss is implementing this kind of thing. That was for a web server
For this particular thing you can either mess at the c++ layer by overriding new and supplying custom allocators to the STL.
Or you can mess at the malloc level, start with a custom malloc and work from there (like dmalloc)
Is there a fundamental problem that makes this impossible?
Arguing about program behavior would become fundamentally impossible. All sorts of weird issues will come up. Certain sections of the code may or may not execute though this will seeminly have no effect on the next sections which may work un-hindered. Certain sections may always fail. Dealing with the standard-library or any other third party library will become extremely difficult. There may be fragmentations at run-time at times and at times not.
If intent is that all allocations within that scope occur with that allocator object, then it's essentially a thread-local variable.
So, there will be multithreading issues if you use a static or global variable to implement it. Otherwise, not a bad workaround for the statelessness of allocators.
(Of course, you'll need to pass a second template argument eg vector< int, UseScopedPool >.)