C++ replacement for C99 VLAs (goal: preserve performance)

C++ replacement for C99 VLAs (goal: preserve performance) - c++

I am porting some C99 code that makes heavy use of variable length arrays (VLA) to C++.
I replaced the VLAs (stack allocation) with an array class that allocates memory on the heap. The performance hit was huge, a slowdown of a factor of 3.2 (see benchmarks below). What fast VLA replacement can I use in C++? My goal is to minimize performance hit when rewriting the code for C++.
One idea that was suggested to me was to write an array class that contains a fixed-size storage within the class (i.e. can be stack-allocated) and uses it for small arrays, and automatically switches to heap allocation for larger arrays. My implementation of this is at the end of the post. It works fairly well, but I still cannot reach the performance of the original C99 code. To come close to it, I must increase this fixed-size storage (MSL below) to sizes which I am not comfortable with. I don't want to allocate too-huge arrays on the stack even for the many small arrays that don't need it because I worry that it will trigger a stack overflow. A C99 VLA is actually less prone to this because it will never use more storage than needed.
I came upon std::dynarray, but my understanding is that it was not accepted into the standard (yet?).
I know that clang and gcc support VLAs in C++, but I need it to work with MSVC too. In fact better portability is one of the main goals of rewriting as C++ (the other goal being making the program, which was originally a command line tool, into a reusable library).
Benchmark
MSL refers to the array size above which I switch to heap-allocation. I use different values for 1D and 2D arrays.
Original C99 code: 115 seconds.
MSL = 0 (i.e. heap allocation): 367 seconds (3.2x).
1D-MSL = 50, 2D-MSL = 1000: 187 seconds (1.63x).
1D-MSL = 200, 2D-MSL = 4000: 143 seconds (1.24x).
1D-MSL = 1000, 2D-MSL = 20000: 131 (1.14x).
Increasing MSL further improves performance more, but eventually the program will start returning wrong results (I assume due to stack overflow).
These benchmarks are with clang 3.7 on OS X, but gcc 5 shows very similar results.
Code
This is the current "smallvector" implementation I use. I need 1D and 2D vectors. I switch to heap-allocation above size MSL.
template<typename T, size_t MSL=50>
class lad_vector {
const size_t len;
T sdata[MSL];
T *data;
public:
explicit lad_vector(size_t len_) : len(len_) {
if (len <= MSL)
data = &sdata[0];
else
data = new T[len];
}
~lad_vector() {
if (len > MSL)
delete [] data;
}
const T &operator [] (size_t i) const { return data[i]; }
T &operator [] (size_t i) { return data[i]; }
operator T * () { return data; }
};
template<typename T, size_t MSL=1000>
class lad_matrix {
const size_t rows, cols;
T sdata[MSL];
T *data;
public:
explicit lad_matrix(size_t rows_, size_t cols_) : rows(rows_), cols(cols_) {
if (rows*cols <= MSL)
data = &sdata[0];
else
data = new T[rows*cols];
}
~lad_matrix() {
if (rows*cols > MSL)
delete [] data;
}
T const * operator[] (size_t i) const { return &data[cols*i]; }
T * operator[] (size_t i) { return &data[cols*i]; }
};

Create a large buffer (MB+) in thread-local storage. (Actual memory on heap, management in TLS).
Allow clients to request memory from it in FILO manner (stack-like). (this mimics how it works in C VLAs; and it is efficient, as each request/return is just an integer addition/subtraction).
Get your VLA storage from it.
Wrap it pretty, so you can say stack_array<T> x(1024);, and have that stack_array deal with construction/destruction (note that ->~T() where T is int is a legal noop, and construction can similarly be a noop), or make stack_array<T> wrap a std::vector<T, TLS_stack_allocator>.
Data will be not as local as the C VLA data is because it will be effectively on a separate stack. You can use SBO (small buffer optimization), which is when locality really matters.
A SBO stack_array<T> can be implemented with an allocator and a std vector unioned with a std array, or with a unique ptr and custom destroyer, or a myriad of other ways. You can probably retrofit your solution, replacing your new/malloc/free/delete with calls to the above TLS storage.
I say go with TLS as that removes need for synchronization overhead while allowing multi-threaded use, and mirrors the fact that the stack itself is implicitly TLS.
Stack-buffer based STL allocator? is a SO Q&A with at least two "stack" allocators in the answers. They will need some adaption to automatically get their buffer from TLS.
Note that the TLS being one large buffer is in a sense an implementation detail. You could do large allocations, and when you run out of space do another large allocation. You just need to keep track each "stack page" current capacity and a list of stack pages, so when you empty one you can move onto an earlier one. That lets you be a bit more conservative in your TLS initial allocation without worrying about running OOM; the important part is that you are FILO and allocate rarely, not that the entire FILO buffer is one contiguous one.

I think you have already enumerated most options in your question and the comments.
Use std::vector. This is the most obvious, most hassle-free but maybe also the slowest solution.
Use platform-specific extensions on those platforms that provide them. For example, GCC supports variable-length arrays in C++ as an extension. POSIX specifies alloca which is widely supported to allocate memory on the stack. Even Microsoft Windows provides _malloca, as a quick web search told me.
In order to avoid maintenance nightmares, you'll really want to encapsulate these platform dependencies into an abstract interface that automatically and transparently chooses the appropriate mechanism for the current platform. Implementing this for all platforms will be a bit of work but if this single feature accounts for 3 × speed differences as you're reporting, it might be worth it. As a fallback for unknown platforms, I'd keep std::vector in reserve as a last resort. It is better to run slow but correctly than to behave erratic or not run at all.
Build your own variable-sized array type that implements a “small array” optimization embedded as a buffer inside the object itself as you have shown in your question. I'll just note that I'd rather try using a union of a std::array and a std::vector instead of rolling my own container.
Once you have a custom type in place, you can do interesting profiling such as maintaining a global hash table of all occurrences of this type (by source-code location) and recording each allocation size during a stress test of your program. You can then dump the hash table at program exit and plot the distributions in allocation sizes for the individual arrays. This might help you to fine-tune the amount of storage to reserve for each array individually on the stack.
Use a std::vector with a custom allocator. At program startup, allocate a few megabytes of memory and give it to a simple stack allocator. For a stack allocator, allocation is just comparing and adding two integers and deallocation is simply a subtraction. I doubt that the compiler-generated stack allocation can be much faster. Your “array stack” would then pulsate correlated to your “program stack”. This design would also have the advantage that accidental buffer overruns – while still invoking undefined behavior, trashing random data and all that bad stuff – wouldn't as easily corrupt the program stack (return addresses) as they would with native VLAs.
Custom allocators in C++ are a somewhat dirty business but some people do report they're using them successfully. (I don't have much experience with using them myself.) You might want to start looking at cppreference. Alisdair Meredith who is one of those people that promote the usage of custom allocators gave a double-session talk at CppCon'14 titled “Making Allocators Work” (part 1, part 2) that you might find interesting as well. If the std::allocator interface it too awkward to use for you, implementing your own variable (as opposed to dynamically) sized array class with your own allocator should be doable as well.

Regarding support for MSVC:
MSVC has _alloca which allocates stack space. It also has _malloca which allocates stack space if there is enough free stack space, otherwise falls back to dynamic allocation.
You cannot take advantage of the VLA type system, so you would have to change your code to work based in a pointer to first element of such an array.
You may end up needing to use a macro which has different definitions depending on the platform. E.g. invoke _alloca or _malloca on MSVC, and on g++ or other compilers, either calls alloca (if they support it), or makes a VLA and a pointer.
Consider investigating ways to rewrite the code without needing to allocate an unknown amount of stack. One option is to allocate a fixed-size buffer that is the maximum you will need. (If that would cause stack overflow it means your code is bugged anyway).

Related

Do 2d+ vectors cause a performance hit? [duplicate]

In our C++ course they suggest not to use C++ arrays on new projects anymore. As far as I know Stroustroup himself suggests not to use arrays. But are there significant performance differences?

Using C++ arrays with new (that is, using dynamic arrays) should be avoided. There is the problem that you have to keep track of the size, and you need to delete them manually and do all sorts of housekeeping.
Using arrays on the stack is also discouraged because you don't have range checking, and passing the array around will lose any information about its size (array to pointer conversion). You should use std::array in that case, which wraps a C++ array in a small class and provides a size function and iterators to iterate over it.
Now, std::vector vs. native C++ arrays (taken from the internet):
// Comparison of assembly code generated for basic indexing, dereferencing,
// and increment operations on vectors and arrays/pointers.
// Assembly code was generated by gcc 4.1.0 invoked with g++ -O3 -S on a
// x86_64-suse-linux machine.
#include <vector>
struct S
{
int padding;
std::vector<int> v;
int * p;
std::vector<int>::iterator i;
};
int pointer_index (S & s) { return s.p[3]; }
// movq 32(%rdi), %rax
// movl 12(%rax), %eax
// ret
int vector_index (S & s) { return s.v[3]; }
// movq 8(%rdi), %rax
// movl 12(%rax), %eax
// ret
// Conclusion: Indexing a vector is the same damn thing as indexing a pointer.
int pointer_deref (S & s) { return *s.p; }
// movq 32(%rdi), %rax
// movl (%rax), %eax
// ret
int iterator_deref (S & s) { return *s.i; }
// movq 40(%rdi), %rax
// movl (%rax), %eax
// ret
// Conclusion: Dereferencing a vector iterator is the same damn thing
// as dereferencing a pointer.
void pointer_increment (S & s) { ++s.p; }
// addq $4, 32(%rdi)
// ret
void iterator_increment (S & s) { ++s.i; }
// addq $4, 40(%rdi)
// ret
// Conclusion: Incrementing a vector iterator is the same damn thing as
// incrementing a pointer.
Note: If you allocate arrays with new and allocate non-class objects (like plain int) or classes without a user defined constructor and you don't want to have your elements initialized initially, using new-allocated arrays can have performance advantages because std::vector initializes all elements to default values (0 for int, for example) on construction (credits to #bernie for reminding me).

Preamble for micro-optimizer people
Remember:
"Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%".
(Thanks to metamorphosis for the full quote)
Don't use a C array instead of a vector (or whatever) just because you believe it's faster as it is supposed to be lower-level. You would be wrong.
Use by default vector (or the safe container adapted to your need), and then if your profiler says it is a problem, see if you can optimize it, either by using a better algorithm, or changing container.
This said, we can go back to the original question.
Static/Dynamic Array?
The C++ array classes are better behaved than the low-level C array because they know a lot about themselves, and can answer questions C arrays can't. They are able to clean after themselves. And more importantly, they are usually written using templates and/or inlining, which means that what appears to a lot of code in debug resolves to little or no code produced in release build, meaning no difference with their built-in less safe competition.
All in all, it falls on two categories:
Dynamic arrays
Using a pointer to a malloc-ed/new-ed array will be at best as fast as the std::vector version, and a lot less safe (see litb's post).
So use a std::vector.
Static arrays
Using a static array will be at best:
as fast as the std::array version
and a lot less safe.
So use a std::array.
Uninitialized memory
Sometimes, using a vector instead of a raw buffer incurs a visible cost because the vector will initialize the buffer at construction, while the code it replaces didn't, as remarked bernie by in his answer.
If this is the case, then you can handle it by using a unique_ptr instead of a vector or, if the case is not exceptional in your codeline, actually write a class buffer_owner that will own that memory, and give you easy and safe access to it, including bonuses like resizing it (using realloc?), or whatever you need.

Vectors are arrays under the hood.
The performance is the same.
One place where you can run into a performance issue, is not sizing the vector correctly to begin with.
As a vector fills, it will resize itself, and that can imply, a new array allocation, followed by n copy constructors, followed by about n destructor calls, followed by an array delete.
If your construct/destruct is expensive, you are much better off making the vector the correct size to begin with.
There is a simple way to demonstrate this. Create a simple class that shows when it is constructed/destroyed/copied/assigned. Create a vector of these things, and start pushing them on the back end of the vector. When the vector fills, there will be a cascade of activity as the vector resizes. Then try it again with the vector sized to the expected number of elements. You will see the difference.

To respond to something Mehrdad said:
However, there might be cases where
you still need arrays. When
interfacing with low level code (i.e.
assembly) or old libraries that
require arrays, you might not be able
to use vectors.
Not true at all. Vectors degrade nicely into arrays/pointers if you use:
vector<double> vector;
vector.push_back(42);
double *array = &(*vector.begin());
// pass the array to whatever low-level code you have
This works for all major STL implementations. In the next standard, it will be required to work (even though it does just fine today).

You have even fewer reasons to use plain arrays in C++11.
There are 3 kind of arrays in nature from fastest to slowest, depending on the features they have (of course the quality of implementation can make things really fast even for case 3 in the list):
Static with size known at compile time. --- std::array<T, N>
Dynamic with size known at runtime and never resized. The typical optimization here is, that if the array can be allocated in the stack directly. -- Not available. Maybe dynarray in C++ TS after C++14. In C there are VLAs
Dynamic and resizable at runtime. --- std::vector<T>
For 1. plain static arrays with fixed number of elements, use std::array<T, N> in C++11.
For 2. fixed size arrays specified at runtime, but that won't change their size, there is discussion in C++14 but it has been moved to a technical specification and made out of C++14 finally.
For 3. std::vector<T> will usually ask for memory in the heap. This could have performance consequences, though you could use std::vector<T, MyAlloc<T>> to improve the situation with a custom allocator. The advantage compared to T mytype[] = new MyType[n]; is that you can resize it and that it will not decay to a pointer, as plain arrays do.
Use the standard library types mentioned to avoid arrays decaying to pointers. You will save debugging time and the performance is exactly the same as with plain arrays if you use the same set of features.

There is definitely a performance impact to using an std::vector vs a raw array when you want an uninitialized buffer (e.g. to use as destination for memcpy()). An std::vector will initialize all its elements using the default constructor. A raw array will not.
The c++ spec for the std:vector constructor taking a count argument (it's the third form) states:
`Constructs a new container from a variety of data sources, optionally using a user supplied allocator alloc.
Constructs the container with count default-inserted instances of T. No copies are made.
Complexity
2-3) Linear in count
A raw array does not incur this initialization cost.
Note that with a custom allocator, it is possible to avoid "initialization" of the vector's elements (i.e. to use default initialization instead of value initialization). See these questions for more details:
Is this behavior of vector::resize(size_type n) under C++11 and Boost.Container correct?
How can I avoid std::vector<> to initialize all its elements?

Go with STL. There's no performance penalty. The algorithms are very efficient and they do a good job of handling the kinds of details that most of us would not think about.

STL is a heavily optimized library. In fact, it's even suggested to use STL in games where high performance might be needed. Arrays are too error prone to be used in day to day tasks. Today's compilers are also very smart and can really produce excellent code with STL. If you know what you are doing, STL can usually provide the necessary performance. For example by initializing vectors to required size (if you know from start), you can basically achieve the array performance. However, there might be cases where you still need arrays. When interfacing with low level code (i.e. assembly) or old libraries that require arrays, you might not be able to use vectors.

About duli's contribution with my own measurements.
The conclusion is that arrays of integers are faster than vectors of integers (5 times in my example). However, arrays and vectors are arround the same speed for more complex / not aligned data.

If you compile the software in debug mode, many compilers will not inline the accessor functions of the vector. This will make the stl vector implementation much slower in circumstances where performance is an issue. It will also make the code easier to debug since you can see in the debugger how much memory was allocated.
In optimized mode, I would expect the stl vector to approach the efficiency of an array. This is since many of the vector methods are now inlined.

The performance difference between the two is very much implementation dependent - if you compare a badly implemented std::vector to an optimal array implementation, the array would win, but turn it around and the vector would win...
As long as you compare apples with apples (either both the array and the vector have a fixed number of elements, or both get resized dynamically) I would think that the performance difference is negligible as long as you follow got STL coding practise. Don't forget that using standard C++ containers also allows you to make use of the pre-rolled algorithms that are part of the standard C++ library and most of them are likely to be better performing than the average implementation of the same algorithm you build yourself.
That said, IMHO the vector wins in a debug scenario with a debug STL as most STL implementations with a proper debug mode can at least highlight/cathc the typical mistakes made by people when working with standard containers.
Oh, and don't forget that the array and the vector share the same memory layout so you can use vectors to pass data to legacy C or C++ code that expects basic arrays. Keep in mind that most bets are off in that scenario, though, and you're dealing with raw memory again.

If you're using vectors to represent multi-dimensional behavior, there is a performance hit.
Do 2d+ vectors cause a performance hit?
The gist is that there's a small amount of overhead with each sub-vector having size information, and there will not necessarily be serialization of data (as there is with multi-dimensional c arrays). This lack of serialization can offer greater than micro optimization opportunities. If you're doing multi-dimensional arrays, it may be best to just extend std::vector and roll your own get/set/resize bits function.

If you do not need to dynamically adjust the size, you have the memory overhead of saving the capacity (one pointer/size_t). That's it.

There might be some edge case where you have a vector access inside an inline function inside an inline function, where you've gone beyond what the compiler will inline and it will force a function call. That would be so rare as to not be worth worrying about - in general I would agree with litb.
I'm surprised nobody has mentioned this yet - don't worry about performance until it has been proven to be a problem, then benchmark.

I'd argue that the primary concern isn't performance, but safety. You can make a lot of mistakes with arrays (consider resizing, for example), where a vector would save you a lot of pain.

Vectors use a tiny bit more memory than arrays since they contain the size of the array. They also increase the hard disk size of programs and probably the memory footprint of programs. These increases are tiny, but may matter if you're working with an embedded system. Though most places where these differences matter are places where you would use C rather than C++.

The following simple test:
C++ Array vs Vector performance test explanation
contradicts the conclusions from "Comparison of assembly code generated for basic indexing, dereferencing, and increment operations on vectors and arrays/pointers."
There must be a difference between the arrays and vectors. The test says so... just try it, the code is there...

Sometimes arrays are indeed better than vectors. If you are always manipulating
a fixed length set of objects, arrays are better. Consider the following code snippets:
int main() {
int v[3];
v[0]=1; v[1]=2;v[2]=3;
int sum;
int starttime=time(NULL);
cout << starttime << endl;
for (int i=0;i<50000;i++)
for (int j=0;j<10000;j++) {
X x(v);
sum+=x.first();
}
int endtime=time(NULL);
cout << endtime << endl;
cout << endtime - starttime << endl;
}
where the vector version of X is
class X {
vector<int> vec;
public:
X(const vector<int>& v) {vec = v;}
int first() { return vec[0];}
};
and the array version of X is:
class X {
int f[3];
public:
X(int a[]) {f[0]=a[0]; f[1]=a[1];f[2]=a[2];}
int first() { return f[0];}
};
The array version will of main() will be faster because we are avoiding the
overhead of "new" everytime in the inner loop.
(This code was posted to comp.lang.c++ by me).

For fixed-length arrays the performance is the same (vs. vector<>) in release build, but in debug build low-level arrays win by a factor of 20 in my experience (MS Visual Studio 2015, C++ 11).
So the "save time debugging" argument in favor of STL might be valid if you (or your coworkers) tend to introduce bugs in your array usage, but maybe not if your debugging time is mostly waiting on your code to run to the point you are currently working on so that you can step through it.
Experienced developers working on numerically intensive code sometimes fall into the second group (especially if they use vector :) ).

Assuming a fixed-length array (e.g. int* v = new int[1000]; vs std::vector<int> v(1000);, with the size of v being kept fixed at 1000), the only performance consideration that really matters (or at least mattered to me when I was in a similar dilemma) is the speed of access to an element. I looked up the STL's vector code, and here is what I found:
const_reference
operator[](size_type __n) const
{ return *(this->_M_impl._M_start + __n); }
This function will most certainly be inlined by the compiler. So, as long as the only thing that you plan to do with v is access its elements with operator[], it seems like there shouldn't really be any difference in performance.

There is no argument about which of them is the best or good to use.They both have there own use cases,they both have their pros and cons.The behavior of both containers are different in different places.One of the main difficulty with arrays is that they are fixed in size if once they are defined or initialized then you can not change values and on the other side vectors are flexible, you can change vectors value whenever you want it's not fixed in size like arrays,because array has static memory allocation and vector has dynamic memory or heap memory allocation(we can push and pop elements into/from vector) and the creator of c++ Bjarne Stroustrup said that vectors are flexible to use more than arrays.
Using C++ arrays with new (that is, using dynamic arrays) should be avoided. There is the problem you have to keep track of the size, and you need to delete them manually and do all sort of housekeeping.
We can also insert, push and pull values easily in vectors which is not easily possible in arrays.
If we talk about performance wise then if you are working with small values then you should use arrays and if you are working with big scale code then you should go with vector(vectors are good at handling big values then arrays).

Utilize memory past the end of a std::vector using a custom overallocating allocator

Let's say I have an allocator my_allocator that will always allocate memory for n+x (instead of n) elements when allocate(n) is called.
Can I savely assume that memory in the range [data()+n, data()+n+x) (for a std::vector<T, my_allocator<T>>) is accessible/valid for further use (i.e. placement new or simd loads/stores in case of fundamentals (as long as there is no reallocation)?
Note: I'm aware that everything past data()+n-1 is uninitialized storage. The use case would be a vector of fundamental types (which do not have a constructor anyway) using the custom allocator to avoid having special corner cases when throwing simd intrinsics at the vector. my_allocator shall allocate storage that is 1.) properly aligned and has 2.) a size that is a multiple of the used register size.
To make things a little bit more clear:
Let's say I have two vectors and I want to add them:
std::vector<double, my_allocator<double>> a(n), b(n);
// fill them ...
auto c = a + b;
assert(c.size() == n);
If the storage obtained from my_allocator now allocates aligned storage and if sizeof(double)*(n+x) is always a multiple of the used simd register size (and thus a multiple of the number of values per register) I assume that I can do something like
for(size_t i=0u; i<(n+x); i+=y)
{ // where y is the number of doubles per register and and divisor of (n+x)
auto ma = _aligned_load(a.data() + i);
auto mb = _aligned_load(b.data() + i);
_aligned_store(c.data() + i, _simd_add(ma, mb));
}
where I don't have to care about any special case like unaligned loads or backlog from some n that is not dividable by y.
But still the vectors only contain n values and can be handled like vectors of size n.

Stepping back a moment, if the problem you are trying to solve is to allow the underlying memory to be processed effectively by SIMD intrinsics or unrolled loops, or both, you don't necessarily need to allocate memory beyond the used amount just to "round off" the allocation size to a multiple of vector width.
There are various approaches used to handle this situation, and you mentioned a couple, such as special lead-in and lead-out code to handle the leading and trailing portions. There are actually two distinct problems here - handling the fact the data isn't a multiple of the vector width, and handling (possibly) unaligned starting addresses. Your over-allocation method is tackling the first issue - but there's probably a better way...
Most SIMD code in practice can simply read beyond the end of the processed region. Some might argue that this is technically UB - but when using SIMD intrinsics you are already venturing beyond the walls of Standard C++. In fact, this technique is already widely used in the standard library and so it is implicitly endorsed by compiler and library maintainers. It is also a standard method for handling SIMD codes in general, so you can be pretty sure it's not going to suddenly break.
They key to making it work is the observation that if you can validly read even a single byte at some location N, then any a naturally aligned read of any size1 won't trigger a fault. Of course, you still need to ignore or otherwise handle the data you read beyond the end of the officially allocated area - but you'll need to do that anyway with your "allocate extra" approach, right? Depending on the algorithm, you may mask away the invalid data, or exclude invalid data after the SIMD portion is done (i.e., if you are searching for a byte, if you find a byte after the allocated area, it's the same as "not found").
To make this work, you need to be reading in an aligned fashion, but that's probably something you already want to do I think. You can either arrange to have your memory allocated aligned in the first place, or do an overlapping read at the start (i.e., one unaligned read first, then all aligned with the first aligned read overlapping the unaligned portion), or use the same trick as the tail to read before the array (with the same reasoning as to why this is safe). Furthermore, there are various tricks to request aligned memory without needing to write your own allocator.
Overall, my recommendation is to try to avoid writing a custom allocator. Unless the code is fairly tightly contained, you may run into various pitfalls, including other code making wrong assumptions about how your memory was allocated and the various other pitfalls Leon mentions in his answer. Furthermore, using a custom allocator disables a bunch of optimizations used by the standard container algorithms, unless you use it everywhere, since many of them apply only to containers using the same allocator.
Furthermore, when I was actually implementing custom allocators2 , I found that it was a nice idea in theory, but a bit too obscure to be well-supported in an identical fashion across all the compilers. Now the compilers have become a lot more compliant over time (I'm looking mostly at you, Visual Studio), and template support has also improved, so perhaps that's not an issue, but I feel it still falls into the category of "do it only if you must".
Keep in mind also that custom allocators don't compose well - you only get the one! If someone else on your project wants to use a custom allocator for your container for some other reason, they won't be able to do it (although you could coordinate and create a combined allocator).
This question I asked earlier - also motivated by SIMD - covers a lot of the ground about the safety of reading past the end (and, implicitly, before the beginning), and is probably a good place to start if you are considering this.
1 Technically, the restriction is any aligned read up to the page size, which at 4K or larger is plenty for any of the current vector-oriented general purpose ISAs.
2 In this case, I was doing it not for SIMD, but basically to avoid malloc() and to allow partially on-stack and contiguous fast allocations for containers with many small nodes.

For your use case you shouldn't have any doubts. However, if you decide to store anything useful in the extra space and will allow the size of your vector to change during its lifetime, you will probably run into problems dealing with the possibility of reallocation - how are you going to transfer the extra data from the old allocation to the new allocation given that reallocation happens as a result of separate calls to allocate() and deallocate() with no direct connection between them?
EDIT (addressing the code added to the question)
In my original answer I meant that you shouldn't have any problem accessing the extra bytes allocated by your allocator in excess of what was requested. However, writing data in the memory range, that is outside the range currently utilized by the vector object but belongs to the range that would be spanned by the unmodified allocation, asks for trouble. An implementation of std::vector is free to request from the allocator more memory than would be exposed through its size()/capacity() functions and store auxiliary data in the unused area. Though this is highly theoretical, not accounting for that possibility means opening a door into undefined behavior.
Consider the following possible layout of the vector's allocation:
---====================++++++++++------.........
=== - used capacity of the vector
+++ - unused capacity of the vector
--- - overallocated by the vector (but not shown as part of its capacity)
... - overallocated by your allocator
You MUST NOT write anything in the regions 2 (---) and 3 (+++). All your writes must be constrained to the region 4 (...), otherwise you may corrupt important bits.

Why is it impossible to allocate an array of an arbitrary size on the stack?

Why can't I write the following?
char acBuf[nSize];
Only to prevent the stack from overgrowing?
Or is there a possibility to do something similar, if I can ensure that I always take just a few hundred kilobytes?
As far as I know, the std::string uses the memory of its members to store the assigned strings, as long as they are 15 characters or less. Only if the strings are longer, it uses this memory to store the address of some heap-allocated memory, which then takes the data.
It seems like it has to be 100%ly determined, during compile-time, how the stack will be aligned during runtime. Is that true? Why is that?

It has nothing to do with preventing stack overflow, you can overflow the stack just fine with char a[SOME_LARGE_CONSTANT]. In C++ the array size has to be known at compile time, this is among other things needed to compute the size of structures containing arrays.
C on the other hand had Variable Length Arrays since C99, which adds an exception and allow runtime dependant size for arrays within function scope. As to why C++ does not have this? It was never adopted by a C++ standard.

Why can't I write the following?
char acBuf[nSize];
Those are called Variable Length Arrays (VLA's) and aren't supported by C++. The reason being that the stack is very fast but tiny compared to the free store (the heap in your words). Which means that at any moment that you add lots of elements to a VLA your stack might just overflow and you get a vague runtime exception. This can also happen with compile-time sized stack arrays but these are way easier to catch because the behaviour of the program doesn't influence their size. Which means that x doesn't have to happen after y to create a stack overflow, it's just there right off the bat. This covers it in more detail and rage.
Containers like std::vector use the free store which is way bigger and has a way to deal with over-allocation (throws bad_alloc).

Unlike C, C++ doesn't support variable length arrays. If you want them, you can use non-standard extensions such as alloca or GNU extensions (supported by clang and GCC). They have their caveats, so be sure to read the manual to make sure you use them safely.
The reason the stack layout is mostly determined statically is so that the generated code has to perform fewer computations (additions, multiplications, and pointer dereferencing) to figure out where the data is on the stack. The offsets can instead be hardcoded into the generated machine code.

My advice is to take a look at alloca.h
void *alloca(size_t size);
The alloca() function allocates size bytes of space in the stack
frame of the caller. This temporary space is automatically freed
when the function that called alloca() returns to its caller.

One possible problem I see with VLA in C++ is the type.
What is the type of acBuf in char acBuf[nSize] or even worse in char acBuf[nSize][nSize] ?
template <typename T> void foo(const T&);
void foo(int n)
{
char mat[n][n];
foo(mat);
}
You cannot pass that array by reference to
template <typename T, std::size_t N>
void foo_on_array(const T (&a)[N]);

You should be happy that the C++ standard discourages a dangerous practice (variable-length arrays on the stack) and instead encourages a less dangerous practice (variable-length arrays and std::vector with heap allocation).
Variable-length arrays on the stack are more dangerous because:
The available stack space is typically 8 MB, much smaller than the 2 GB (or more) of available heap space.
When the stack space is exhausted, the program crashes with SIGSEGV, and it requires special software such as GNU libsigsegv to recover from such a situation.
In typical C++ programs, the programmer does not know whether the array length will definitely stay under a limit such as 4 MB.

Why can't I write the following?
char acBuf[nSize];
You can't do that because in C++ the lenght of the array has to be known at compile time, that's because the compiler reserves the specified memory for the array and it can not be modified during runtime. it's not about prevent a stack overflow, it's about memory layout.
If you want to make a dynamic array you should use the new operator so it will be stored in heap.
char *acBuf = new char[nsize];

Is there ever a valid reason to use C-style arrays in C++?

Between std::vector and std::array in TR1 and C++11, there are safe alternatives for both dynamic and fixed-size arrays which know their own length and don't exhibit horrible pointer/array duality.
So my question is, are there any circumstances in C++ when C arrays must be used (other than calling C library code), or is it reasonable to "ban" them altogether?
EDIT:
Thanks for the responses everybody, but it turns out this question is a duplicate of
Now that we have std::array what uses are left for C-style arrays?
so I'll direct everybody to look there instead.
[I'm not sure how to close my own question, but if a moderator (or a few more people with votes) wander past, please feel free to mark this as a dup and delete this sentence.]

I didnt want to answer this at first, but Im already getting worried that this question is going to be swamped with C programmers, or people who write C++ as object oriented C.
The real answer is that in idiomatic C++ there is almost never ever a reason to use a C style array. Even when using a C style code base, I usually use vectors. How is that possible, you say? Well, if you have a vector v and a C style function requires a pointer to be passed in, you can pass &v[0] (or better yet, v.data() which is the same thing).
Even for performance, its very rare that you can make a case for a C style array. A std::vector does involve a double indirection but I believe this is generally optimized away. If you dont trust the compiler (which is almost always a terrible move), then you can always use the same technique as above with v.data() to grab a pointer for your tight loop. For std::array, I believe the wrapper is even thinner.
You should only use one if you are an awesome programmer and you know exactly why you are doing it, or if an awesome programmer looks at your problem and tells you to. If you arent awesome and you are using C style arrays, the chances are high (but not 100%) that you are making a mistake,

Foo data[] = {
is a pretty common pattern. Elements can be added to it easily, and the size of the data array grows based on the elements added.
With C++11 you can replicate this with a std::array:
template<class T, class... Args>
auto make_array( Args&&... args )
-> std::array< T, sizeof...(Args) >
{
return { std::forward<Args>(args)... };
}
but even this isn't as good as one might like, as it does not support nested brackets like a C array does.
Suppose Foo was struct Foo { int x; double y; };. Then with C style arrays we can:
Foo arr[] = {
{1,2.2},
{3,4.5},
};
meanwhile
auto arr = make_array<Foo>(
{1,2.2},
{3,4.5}
};
does not compile. You'd have to repeat Foo for each line:
auto arr = make_array<Foo>(
Foo{1,2.2},
Foo{3,4.5}
};
which is copy-paste noise that can get in the way of the code being expressive.
Finally, note that "hello" is a const array of size 6. Code needs to know how to consume C-style arrays.
My typical response to this situation is to convert C-style arrays and C++ std::arrays into array_views, a range that consists of two pointers, and operate on them. This means I do not care if I was fed an array based on C or C++ syntax: I just care I was fed a packed sequence of data elements. These can also consume std::dynarrays and std::vectors with little work.
It did require writing an array_view, or stealing one from boost, or waiting for it to be added to the standard.

Sometimes an exsisting code base can force you to use them

The last time I needed to use them in new code was when I was doing embedded work and the standard library just didn't have an implementation of std::vector or std::array. In some older code bases you have to use arrays because of design decisions made by the previous developers.
In most cases if you are starting a new project with C++11 the old C style arrays are a fairly poor choice. This is because relative to std::array they are difficult to get correct and this difficulty is a direct expense when developing. This C++ FAQ entry sums up my thoughts on the matter fairly well: http://www.parashift.com/c++-faq/arrays-are-evil.html

Pre-C++14: In some (rare) cases, the missing initialization of types like int can improve the execution speed notably. Especially if some algorithm needs many short-lived arrays during his execution and the machine has not enough memory for pre-allocating making sense and/or the sizes could not be known first

C-style arrays are very useful in embedded system where memory is constrained (and severely limited).
The arrays allow for programming without dynamic memory allocation. Dynamic memory allocation generates fragmented memory and at some point in run-time, the memory has to be defragmented. In safety critical systems, defragmentation cannot occur during the periods that have critical timing.
The const arrays allow for data to be put into Read Only Memory or Flash memory, out of the precious RAM area. The data can be directly accessed and does not require any additional initialization time, as with std::vector or std::array.
The C-style array is a convenient tool to place raw data into a program. For example, bitmap data for images or fonts. In smaller embedded systems with no hard drives or flash drives, the data must directly accessed. C-style arrays allow for this.
Edit 1:
Also, std::array cannot be used with compiler that don't support C++11 or afterwards.
Many companies do not want to switch compilers once a project has started. Also, they may need to keep the compiler version around for maintenance fixes, and when Agencies require the company to reproduce an issue with a specified software version of the product.

I found just one reason today : when you want to know preciselly the size of the data block and control it for aligning in a giant data block .
This is usefull when your are dealing with stream processors or Streaming extensions like AVX or SSE.
Control the data block allocation to a huge single aligned block in memory is usefull. Your objects can manipulate the segments they are responsible and, when they finished , you can move and/or process the huge vector in an aligned way .

Why do I need std::get_temporary_buffer?

For what purpose I should use std::get_temporary_buffer? Standard says the following:
Obtains a pointer to storage sufficient to store up to n adjacent T objects.
I thought that the buffer will be allocated on the stack, but that is not true. According to the C++ Standard this buffer is actually not temporary. What advantages does this function have over the global function ::operator new, which doesn't construct the objects either. Am I right that the following statements are equivalent?
int* x;
x = std::get_temporary_buffer<int>( 10 ).first;
x = static_cast<int*>( ::operator new( 10*sizeof(int) ) );
Does this function only exist for syntax sugar? Why is there temporary in its name?
One use case was suggested in the Dr. Dobb's Journal, July 01, 1996 for implementing algorithms:
If no buffer can be allocated, or if it is smaller than requested, the algorithm still works correctly, It merely slows down.

Stroustrup says in "The C++ Programming Language" (§19.4.4, SE):
The idea is that a system may keep a number of fixed-sized buffers ready for fast allocation so that requesting space for n objects may yield space for more than n. It may also yield less, however, so one way of using get_temporary_buffer() is to optimistically ask for a lot and then use what happens to be available.
[...] Because get_temporary_buffer() is low-level and likely to be optimized for managing temporary buffers, it should not be used as an alternative to new or allocator::allocate() for obtaining longer-term storage.
He also starts the introduction to the two functions with:
Algorithms often require temporary space to perform acceptably.
... but doesn't seem to provide a definition of temporary or longer-term anywhere.
An anecdote in "From Mathematics to Generic Programming" mentions that Stepanov provided a bogus placeholder implementation in the original STL design, however:
To his surprise, he discovered years later that all the major vendors that provide STL implementations are still using this terrible implementation [...]

Microsoft's standard library guy says the following (here):
Could you perhaps explain when to use 'get_temporary_buffer'
It has a very specialized purpose. Note that it doesn't throw
exceptions, like new (nothrow), but it also doesn't construct objects,
unlike new (nothrow).
It's used internally by the STL in algorithms like stable_partition().
This happens when there are magic words like N3126 25.3.13
[alg.partitions]/11: stable_partition() has complexity "At most (last
- first) * log(last - first) swaps, but only linear number of swaps if there is enough extra memory." When the magic words "if there is
enough extra memory" appear, the STL uses get_temporary_buffer() to
attempt to acquire working space. If it can, then it can implement the
algorithm more efficiently. If it can't, because the system is running
dangerously close to out-of-memory (or the ranges involved are huge),
the algorithm can fall back to a slower technique.
99.9% of STL users will never need to know about get_temporary_buffer().

The standard says it allocates storage for up to n elements.
In other words, your example might return a buffer big enough for 5 objects only.
It does seem pretty difficult to imagine a good use case for this though. Perhaps if you're working on a very memory-constrained platform, it's a convenient way to get "as much memory as possible".
But on such a constrained platform, I'd imagine you'd bypass the memory allocator as much as possible, and use a memory pool or something you have full control over.

For what purpose I should use std::get_temporary_buffer?
The function is deprecated in C++17, so the correct answer is now "for no purpose, do not use it".

ptrdiff_t request = 12
pair<int*,ptrdiff_t> p = get_temporary_buffer<int>(request);
int* base = p.first;
ptrdiff_t respond = p.sencond;
assert( is_valid( base, base + respond ) );
respond may be less than request.
size_t require = 12;
int* base = static_cast<int*>( ::operator new( require*sizeof(int) ) );
assert( is_valid( base, base + require ) );
the actual size of base must greater or equal to require.

Perhaps (just a guess) it has something to do with memory fragmentation. If you heavily keep allocating and deallocating temporal memory, but each time you do it you allocate some long-term intended memory after allocating the temp but before deallocating it, you may end up with a fragmented heap (I guess).
So the get_temporary_buffer could be intended to be a bigger-than-you-would-need chunk of memory that is allocated once (perhaps there are many chunks ready for accepting multiple requests), and each time you need memory you just get one of the chunks. So the memory doesn't get fragmented.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js