why loki::flex_string's SmallStringOpt need aligment - c++

I'm reading the source code of flex_string, and doesn't understand very well why the alignment is necessary, just for performance reason?
union
{
mutable value_type buf_[maxSmallString + 1];
Align align_;
};
here is link of design document of flex_string:
http://www.drdobbs.com/generic-a-policy-based-basicstring-imple/184403784#4
the author said:
But what's that Align business? Well, when dealing with such "seated allocation," you must be careful with alignment issues.

Quoting from the linked article:
But what's that Align business? Well, when dealing with such "seated
allocation," you must be careful with alignment issues. Because there
is no portable way of figuring out what alignment requirements Storage
has, SmallStringOpt accepts a type that specifies the alignment and
stores it in the dummy align_ variable.
I believe this is to do with the Storage template parameter. In order to be as generic as possible, the class is trying to work with any container even if that container has certain alignment requirements for its elements. This could be for performance reasons, or it could to do with compatibility with a certain architecture. The point is, there is no reliable, portable way to ascertain the alignment requirements of whatever "Storage" ends up being.
Hence the parameter Align is intended to be some type whose size is equal to the alignment required by Storage. It is a dummy variable in the union - it is never written to or read. only its size is used.
It can be seen from the code that the small string size is the higher of the configured maximum, and the alignment, making the alignment the minimum configurable small string size.
Hope this helps!

Related

How exactly structure packing and padding work?

How exactly structs are packed and padded in c++? The standard does not says anything about how it should be done (as far as I know) and compilers can do whatever they want. But there are tutorials showing how to efficiently pack structs with known rules (for example that every variable needs to be on address that is multiple of its size and if end of previous variable is not multiple, then padding will be inserted), and with these rules we can pack structs by hand in source. What is it finally? We know in what way structs will be packed on modern machines (for example PCs) or it is just idea that can be right, but it is not good to take it for granted?
How exactly structs are packed and padded in c++?
Short answer: In such way that alignment requirements are satisfied.
The standard does not says anything about how it should be done (as far as I know) and compilers can do whatever they want.
Within bounds of the alignment requirements, this is indeed correct. This is also an answer to your question.

Do Google's Protocol Buffers automatically align data efficiently?

In a typical C or C++ struct the developer must explicitly order data members in a way that provides efficient memory alignment and padding, if that is an issue.
Google's Protocol Buffers behave a lot like structs and it is not clear how the compilation of these affects memory layout. Does anyone know if this tendency to organize data in a specific order for the sake of efficient memory layout is automatically handled by the protocol buffer compiler? I have been unable to find any information on this.
I.E. the buffer might actually internally order the data differently than it is specified in the message object of the protobuf.
In a typical C or C++ struct the developer must explicitly order data members in a way that provides efficient memory alignment and padding, if that is an issue.
Actually this is not entirely true.
It's true that most compilers (actually all I know of) tend to align struct elements to machine word addresses. They do this, due to performance reasons because it's usually cheaper to read from a word address and just mask away some bits than to read from the word address, shift the word, so the value you are looking for is right aligned and the mask away the bits not needed. (Of course this depends on the architecture you are compiling for)
So why is your statement I quoted above not true? - Because of the fact that compilers are arranging elements as described above, they also offer the programmer the opportunity to influnece this behavior. Usually this is done using a compiler specific pragma.
For example GCC and MS C Compilers provide a pragma called "pack" which allows the programmer to change the alignment behavior of the compiler for specific structs. Of course, if you choose to set pack to '1', the memory usage is improvide, but this will possibly impact your runtime behavior.
What never happens to my knowledge is a reordering of the members in a struct by the compiler.

How do I check a typed pointer is properly aligned for that type?

Suppose I have a templated function that deals with pointers to yet unknown type T. Now if type T happens to be void* on 64-bit platform then it must be 8-bytes aligned, but if T happens to be char it must be 1-byte aligned and if T happens to be a class then its alignment requirements will depend on its member variables.
This all can be computed on paper, but how do I make the compiler yield the alignment requirements for a given type T?
Is there a way to find during compile time the alignment requirements for a given type?
In C++11 you can use alignof and alignas to make asserts and provide requirements for alignment. Also look at std::align to control alignment in runtime.
In the absence of C++11, its easiest to use the next power-of-two greater than or equal to sizeof(T). You might also want to cap it to the alignment of the largest primitive. 8 is a pretty safe bet on a 64-bit architecture (though you might need to keep an eye on things like SSE data types).

C++ Memory alignment in custom stack allocator

Usually data is aligned at power of two addresses depending on its size.
How should I align a struct or class with size of 20 bytes or another non-power-of-two size?
I'm creating a custom stack allocator so I guess that the compiler wont align data for me since I'm working with a continuous block of memory.
Some more context:
I have an Allocator class that uses malloc() to allocate a large amount of data.
Then I use void* allocate(U32 size_of_object) method to return the pointer that where I can store whether objects I need to store.
This way all objects are stored in the same region of memory and it will hopefully fit in the cache reducing cache misses.
C++11 has the alignof operator specifically for this purpose. Don't use any of the tricks mentioned in other posts, as they all have edge cases or may fail for certain compiler optimisations. The alignof operator is implemented by the compiler and knows the exact alignment being used.
See this description of c++11's new alignof operator
Although the compiler (or interpreter) normally allocates individual data items on aligned boundaries, data structures often have members with different alignment requirements. To maintain proper alignment the translator normally inserts additional unnamed data members so that each member is properly aligned. In addition the data structure as a whole may be padded with a final unnamed member. This allows each member of an array of structures to be properly aligned. http://en.wikipedia.org/wiki/Data_structure_alignment#Typical_alignment_of_C_structs_on_x86
This says that the compiler takes care of it for you, 99.9% of the time. As for how to force an object to align a specific way, that is compiler specific, and only works in certain circumstances.
MSVC: http://msdn.microsoft.com/en-us/library/83ythb65.aspx
__declspec(align(20))
struct S{ int a, b, c, d; };
//must be less than or equal to 20 bytes
GCC: http://gcc.gnu.org/onlinedocs/gcc-3.4.0/gcc/Type-Attributes.html
struct S{ int a, b, c, d; }
__attribute__ ((aligned (20)));
I don't know of a cross-platform way (including macros!) to do this, but there's probably neat macro somewhere.
Unless you want to access memory directly, or squeeze maximum data in a block of memory you don't worry about alignment -- the compiler takes case of that for you.
Due to the way processor data buses work, what you want to avoid is 'mis-aligned' access. Usually you can read a 32 bit value in a single access from addresses which are multiples of four; if you try to read it from an address that's not such a multiple, the CPU may have to grab it in two or more pieces. So if you're really worrying about things at this level of detail, what you need to be concerned about is not so much the overall struct, as the pieces within it. You'll find that compilers will frequently pad out structures with dummy bytes to ensure aligned access, unless you specifically force them not to with a pragma.
Since you've now added that you actually want to write your own allocator, the answer is straight-forward: Simply ensure that your allocator returns a pointer whose value is a multiple of the requested size. The object's size itself will already come suitably adjusted (via internal padding) so that all member objects themselves are properly aligned, so if you request sizeof(T) bytes, all your allocator needs to do is to return a pointer whose value is divisible by sizeof(T).
If your object does indeed have size 20 (as reported by sizeof), then you have nothing further to worry about. (On a 64-bit platform, the object would probably be padded to 24 bytes.)
Update: In fact, as I only now came to realize, strictly speaking you only need to ensure that the pointer is aligned, recursively, for the largest member of your type. That may be more efficient, but aligning to the size of the entire type is definitely not getting it wrong.
How should I align a struct or class with size of 20 bytes or another non-power-of-two size?
Alignment is CPU-specific, so there is no answer to this question without, at least, knowing the target CPU.
Generally speaking, alignment isn't something that you have to worry about; your compiler will have the rules implemented for you. It does come up once in a while, like when writing an allocator. The classic solution is discussed in The C Programming Language (K&R): use the worst possible alignment. malloc does this, although it's phrased as, "the pointer returned if the allocation succeeds shall be suitably aligned so that it may be assigned to a pointer to any type of object."
The way to do that is to use a union (the elements of a union are all allocated at the union's base address, and the union must therefore be aligned in such a way that each element could exist at that address; i.e., the union's alignment will be the same as the alignment of the element with the strictest rules):
typedef Align long;
union header {
// the inner struct has the important bookeeping info
struct {
unsigned size;
header* next;
} s;
// the align member only exists to make sure header_t's are always allocated
// using the alignment of a long, which is probably the worst alignment
// for the target architecture ("worst" == "strictest," something that meets
// the worst alignment will also meet all better alignment requirements)
Align align;
};
Memory is allocated by creating an array (using somthing like sbrk()) of headers large enough to satisfy the request, plus one additional header element that actually contains the bookkeeping information. If the array is called arry, the bookkeeping information is at arry[0], while the pointer returned points at arry[1] (the next element is meant for walking the free list).
This works, but can lead to wasted space ("In Sun's HotSpot JVM, object storage is aligned to the nearest 64-bit boundary"). I'm aware of a better approach that tries to get a type-specific alignment instead of "the alignment that will work for anything."
Compilers also often have compiler-specific commands. They aren't standard, and they require that you know the correct alignment requirements for the types in question. I would avoid them.

Determining maximum possible alignment in C++

Is there any portable way to determine what the maximum possible alignment for any type is?
For example on x86, SSE instructions require 16-byte alignment, but as far as I'm aware, no instructions require more than that, so any type can be safely stored into a 16-byte aligned buffer.
I need to create a buffer (such as a char array) where I can write objects of arbitrary types, and so I need to be able to rely on the beginning of the buffer to be aligned.
If all else fails, I know that allocating a char array with new is guaranteed to have maximum alignment, but with the TR1/C++0x templates alignment_of and aligned_storage, I am wondering if it would be possible to create the buffer in-place in my buffer class, rather than requiring the extra pointer indirection of a dynamically allocated array.
Ideas?
I realize there are plenty of options for determining the max alignment for a bounded set of types: A union, or just alignment_of from TR1, but my problem is that the set of types is unbounded. I don't know in advance which objects must be stored into the buffer.
In C++11 std::max_align_t defined in header cstddef is a POD type whose alignment requirement is at least as strict (as large) as that of every scalar type.
Using the new alignof operator it would be as simple as alignof(std::max_align_t)
In C++0x, the Align template parameter of std::aligned_storage<Len, Align> has a default argument of "default-alignment," which is defined as (N3225 ยง20.7.6.6 Table 56):
The value of default-alignment shall be the most stringent alignment requirement for any C++ object type whose size is no greater than Len.
It isn't clear whether SSE types would be considered "C++ object types."
The default argument wasn't part of the TR1 aligned_storage; it was added for C++0x.
Unfortunately ensuring max alignment is a lot tougher than it should be, and there are no guaranteed solutions AFAIK. From the GotW blog (Fast Pimpl article):
union max_align {
short dummy0;
long dummy1;
double dummy2;
long double dummy3;
void* dummy4;
/*...and pointers to functions, pointers to
member functions, pointers to member data,
pointers to classes, eye of newt, ...*/
};
union {
max_align m;
char x_[sizeofx];
};
This isn't guaranteed to be fully
portable, but in practice it's close
enough because there are few or no
systems on which this won't work as
expected.
That's about the closest 'hack' I know for this.
There is another approach that I've used personally for super fast allocation. Note that it is evil, but I work in raytracing fields where speed is one of the greatest measures of quality and we profile code on a daily basis. It involves using a heap allocator with pre-allocated memory that works like the local stack (just increments a pointer on allocation and decrements one on deallocation).
I use it for Pimpls particularly. However, just having the allocator is not enough; for such an allocator to work, we have to assume that memory for a class, Foo, is allocated in a constructor, the same memory is likewise deallocated only in the destructor, and that Foo itself is created on the stack. To make it safe, I needed a function to see if the 'this' pointer of a class is on the local stack to determine if we can use our super fast heap-based stack allocator. For that we had to research OS-specific solutions: I used TIBs and TEBs for Win32/Win64, and my co-workers found solutions for Linux and Mac OS X.
The result, after a week of researching OS-specific methods to detect stack range, alignment requirements, and doing a lot of testing and profiling, was an allocator that could allocate memory in 4 clock cycles according to our tick counter benchmarks as opposed to about 400 cycles for malloc/operator new (our test involved thread contention so malloc is likely to be a bit faster than this in single-threaded cases, perhaps a couple of hundred cycles). We added a per-thread heap stack and detected which thread was being used which increased the time to about 12 cycles, though the client can keep track of the thread allocator to get the 4 cycle allocations. It wiped out memory allocation based hotspots off the map.
While you don't have to go through all that trouble, writing a fast allocator might be easier and more generally applicable (ex: allowing the amount of memory to allocate/deallocate to be determined at runtime) than something like max_align here. max_align is easy enough to use, but if you're after speed for memory allocations (and assuming you've already profiled your code and found hotspots in malloc/free/operator new/delete with major contributors being in code you have control over), writing your own allocator can really make the difference.
Short of some maximally_aligned_t type that all compilers promised faithfully to support for all architectures everywhere, I don't see how this could be solved at compile time. As you say, the set of potential types is unbounded. Is the extra pointer indirection really that big a deal?
Allocating aligned memory is trickier than it looks - see for example Implementation of aligned memory allocation
This is what I'm using. In addition to this, if you're allocating memory then a new()'d array of char with length greater than or equal to max_alignment will be aligned to max_alignment so you can then use indexes into that array to get aligned addresses.
enum {
max_alignment = boost::mpl::deref<
boost::mpl::max_element<
boost::mpl::vector<
boost::mpl::int_<boost::alignment_of<signed char>::value>::type,
boost::mpl::int_<boost::alignment_of<short int>::value>::type,
boost::mpl::int_<boost::alignment_of<int>::value>::type, boost::mpl::int_<boost::alignment_of<long int>::value>::type,
boost::mpl::int_<boost::alignment_of<float>::value>::type,
boost::mpl::int_<boost::alignment_of<double>::value>::type,
boost::mpl::int_<boost::alignment_of<long double>::value>::type,
boost::mpl::int_<boost::alignment_of<void*>::value>::type
>::type
>::type
>::type::value
};
}