Apparent contradiction between Stroustrup book and the C++Standard - c++

I'm trying to understand the following paragraph from Stroustrup's "The C++ Programming Language" on page 282 (emphasis is mine):
To deallocate space allocated by new, delete and delete[] must be able
to determine the size of the object allocated. This implies that an
object allocated using the standard implementation of new will occupy
slightly more space than a static object. At a minimum, space is
needed to hold the object’s size. Usually two or more words per
allocation are used for free-store management. Most modern machines
use 8-byte words. This overhead is not significant when we allocate
many objects or large objects, but it can matter if we allocate lots
of small objects (e.g., ints or Points) on the free store.
Note that the author doesn't differentiate whether the object is an array, or not, in the sentence highlighted above.
But according to paragraph §5.3.4/11 in C++14, we have (my emphasis):
When a new-expression calls an allocation function and that allocation
has not been extended, the new-expression passes the amount of space
requested to the allocation function as the first argument of type
std::size_t. That argument shall be no less than the size of the
object being created; it may be greater than the size of the object
being created only if the object is an array.
I may be missing something, but it seems to me, we have a contradiction in those two statements. It was my understanding that the additional space required was only for array objects, and that this additional space would hold the number of elements in the array, not the array size in bytes.

If you call new on a type T, the overloaded operator new that may be invoked will be passed exactly sizeof(T).
If you implement a new of your own (or an allocator) that uses some different memory store (ie, not just forwarding to another call to new or malloc etc), you'll find yourself wanting to store information to clean up the allocation later, when the delete occurs. A typical way to do this is to get a slightly larger block of memory, and store the amount of memory requested at the start of it, then return a pointer to later in the memory you acquired.
This is roughly what most standard implementations of new (and malloc do).
So while you only need sizeof(T) bytes to store a T, the amount of bytes consumed by new/malloc is more than sizeof(T). This is what Stroustrup is talking about: every dynamic allocation has actual overhead, and that overhead can be substantial if you make lots of small allocations.
There are some allocators that don't need that extra room "before" the allocation. For example, a stack-scoped allocator that doesn't delete anything until it goes out of scope. Or one that allocates from stores of fixed-sized blocks and uses a bitfield to describe which are in use.
Here, the accounting information isn't store adjacent to the data, or we make the accounting information implicit in the code state (scoped allocators).
Now, in the case of arrays, the C++ compiler is free to call operator new[] with an amount of memory requested larger than sizeof(T)*n when T[n] is allocated. This is done by new (not operator new) code generated by the compiler when it asks your overload for memory.
This is traditionally done on types with non-trivial destructors so that the C++ runtime can, when delete[] is called, iterate over each of the items and call .~T() on them. It pulls off a similar trick, where it stuffs n into memory before the array it is using, then does pointer arithmetic to extract it at delete time.
This is not required by the standard, but it is a common technique (clang and gcc both do it at least on some platforms, and I believe MSVC does as well). Some method of calculating the size of the array is needed; this is just one of them.
For something without a destructor (like char) or a trivial one (like struct foo{ ~foo()=default; }, n isn't needed by the runtime, so it doesn't have to store it. So it can say "naw, I won't store it".
Here is a live example.
struct foo {
static void* operator new[](std::size_t sz) {
std::cout << sz << '/' << sizeof(foo) << '=' << sz/sizeof(foo) << "+ R(" << sz%sizeof(foo) << ")" << '\n';
return malloc(sz);
}
static void operator delete[](void* ptr) {
free(ptr);
}
virtual ~foo() {}
};
foo* test(std::size_t n) {
std::cout << n << '\n';
return new foo[n];
}
int main(int argc, char**argv) {
foo* f = test( argc+10 );
std::cout << *std::prev(reinterpret_cast<std::size_t*>(f)) << '\n';
}
If run with 0 arguments, it prints out 11, 96/8 = 12 R(0) and 11.
The first is the number of elements allocated, the second is how much memory is allocated (which adds up to 11 element's worth, plus 8 bytes -- sizeof(size_t) I suspect), the last is what we happen to find right before the start of the array of 11 elements (a size_t with the value 11).
Accessing memory before the start of the array is naturally undefined behavior, but I did it in order to expose some implementation details in gcc/clang. The point is that they did ask for an extra 8 bytes (as predicted), and they did happen to store the value 11 there (the size of the array).
If you change that 11 to 2, a call to delete[] will actually delete the wrong number of elements.
Other solutions (to store how big the array is) are naturally possible. As an example, if you know you aren't calling an overload of new and you know details of your underlying memory allocation, you could reuse the data it uses to know your block size to determine the number of elements, thus saving an extra size_t of memory. This requires knowing that your underlying allocator won't over-allocate on you, and that it stores the bytes used at a known offset to the data-pointer.
Or, in theory, a compiler could build a separate pointer->size map.
I am unaware of compilers that do either of these, but would be surprised by neither.
Allowing this technique is what the C++ standard is talking about. For array allocation, the compiler's new (not operator new) code is permitted to ask operator new for extra memory. For non-array allocation, the compiler's new is not permitted to ask operator new for extra memory, it must ask for the exact amount. (I believe there may be exceptions for memory-allocation merging?)
As you can see, the two situations are different.

There's no contradiction between these two things. The allocation function gets the size, and almost certainly has to allocate a bit more than that so it knows the size again if the deallocation function is called.
When an array of objects that have a non-trivial destructor is allocated, the implementation needs some way to know how many times to call the destructor when delete[] is called. Implementations are permitted to allocate some extra space along with the array to store this additional information though not every implementation works this way.

There is no contradiction between the two paragraphs.
The paragraph from the Standard discusses the rules of the first argument being passed to the allocation function.
The paragraph out of Stroustrup doesn't talk focus on the first argument having type std::size_t but explains the allocation itself which is "two or more words" bigger than what new indicates, and that every programmer should know.
Stroustrup's explanation is more low level, that's the difference. But there is no contradiction.

The quote from the standard is talking about the value passed to operator new; the quote from Stroustrup is talking about what operator new does with the value. The two are pretty much independent; the requirement is only that the allocator allocate at least as much storage as was requested. Allocators often allocate more space than was requested. What they do with that extra space is up to the implementation; often it's just padding. Note that even if you read the requirements narrowly, that the allocator must allocate the exact number of bytes requested, allocating more is allowed under the "as if" rule, because no portable program can detect how much memory was in fact allocated.

I'm not sure that both talk about the same thing...
It seems that Stroustrup is talking about a more general memory allocation, that inherently uses extra data to manage free/allocated chunks. I think he is not talking about the value of the size passed to new but what really happens at some lower level. He probably would say: when you ask for 10 bytes, the machine will probably use slightly more than 10 bytes. using the standard implementation seems to be important here.
While the standard talks about the value passed to the function.
One talks about implementation while the other not.

There is no contradiction, because "precisely the object's size" is one possible implementation of "at a minimum, the size of the object".
The number 42 is at least 42.

Related

C++ doesn't tell you the size of a dynamic array. But why?

I know that there is no way in C++ to obtain the size of a dynamically created array, such as:
int* a;
a = new int[n];
What I would like to know is: Why? Did people just forget this in the specification of C++, or is there a technical reason for this?
Isn't the information stored somewhere? After all, the command
delete[] a;
seems to know how much memory it has to release, so it seems to me that delete[] has some way of knowing the size of a.
It's a follow on from the fundamental rule of "don't pay for what you don't need". In your example delete[] a; doesn't need to know the size of the array, because int doesn't have a destructor. If you had written:
std::string* a;
a = new std::string[n];
...
delete [] a;
Then the delete has to call destructors (and needs to know how many to call) - in which case the new has to save that count. However, given it doesn't need to be saved on all occasions, Bjarne decided not to give access to it.
(In hindsight, I think this was a mistake ...)
Even with int of course, something has to know about the size of the allocated memory, but:
Many allocators round up the size to some convenient multiple (say 64 bytes) for alignment and convenience reasons. The allocator knows that a block is 64 bytes long - but it doesn't know whether that is because n was 1 ... or 16.
The C++ run-time library may not have access to the size of the allocated block. If for example, new and delete are using malloc and free under the hood, then the C++ library has no way to know the size of a block returned by malloc. (Usually of course, new and malloc are both part of the same library - but not always.)
One fundamental reason is that there is no difference between a pointer to the first element of a dynamically allocated array of T and a pointer to any other T.
Consider a fictitious function that returns the number of elements a pointer points to.
Let's call it "size".
Sounds really nice, right?
If it weren't for the fact that all pointers are created equal:
char* p = new char[10];
size_t ps = size(p+1); // What?
char a[10] = {0};
size_t as = size(a); // Hmm...
size_t bs = size(a + 1); // Wut?
char i = 0;
size_t is = size(&i); // OK?
You could argue that the first should be 9, the second 10, the third 9, and the last 1, but to accomplish this you need to add a "size tag" on every single object.
A char will require 128 bits of storage (because of alignment) on a 64-bit machine. This is sixteen times more than what is necessary.
(Above, the ten-character array a would require at least 168 bytes.)
This may be convenient, but it's also unacceptably expensive.
You could of course envision a version that is only well-defined if the argument really is a pointer to the first element of a dynamic allocation by the default operator new, but this isn't nearly as useful as one might think.
You are right that some part of the system will have to know something about the size. But getting that information is probably not covered by the API of memory management system (think malloc/free), and the exact size that you requested may not be known, because it may have been rounded up.
You will often find that memory managers will only allocate space in a certain multiple, 64 bytes for example.
So, you may ask for new int[4], i.e. 16 bytes, but the memory manager will allocate 64 bytes for your request. To free this memory it doesn't need to know how much memory you asked for, only that it has allocated you one block of 64 bytes.
The next question may be, can it not store the requested size? This is an added overhead which not everybody is prepared to pay for. An Arduino Uno for example only has 2k of RAM, and in that context 4 bytes for each allocation suddenly becomes significant.
If you need that functionality then you have std::vector (or equivalent), or you have higher-level languages. C/C++ was designed to enable you to work with as little overhead as you choose to make use of, this being one example.
There is a curious case of overloading the operator delete that I found in the form of:
void operator delete[](void *p, size_t size);
The parameter size seems to default to the size (in bytes) of the block of memory to which void *p points. If this is true, it is reasonable to at least hope that it has a value passed by the invocation of operator new and, therefore, would merely need to be divided by sizeof(type) to deliver the number of elements stored in the array.
As for the "why" part of your question, Martin's rule of "don't pay for what you don't need" seems the most logical.
There's no way to know how you are going to use that array.
The allocation size does not necessarily match the element number so you cannot just use the allocation size (even if it was available).
This is a deep flaw in other languages not in C++.
You achieve the functionality you desire with std::vector yet still retain raw access to arrays. Retaining that raw access is critical for any code that actually has to do some work.
Many times you will perform operations on subsets of the array and when you have extra book-keeping built into the language you have to reallocate the sub-arrays and copy the data out to manipulate them with an API that expects a managed array.
Just consider the trite case of sorting the data elements.
If you have managed arrays then you can't use recursion without copying data to create new sub-arrays to pass recursively.
Another example is an FFT which recursively manipulates the data starting with 2x2 "butterflies" and works its way back to the whole array.
To fix the managed array you now need "something else" to patch over this defect and that "something else" is called 'iterators'. (You now have managed arrays but almost never pass them to any functions because you need iterators +90% of the time.)
The size of an array allocated with new[] is not visibly stored anywhere, so you can't access it. And new[] operator doesn't return an array, just a pointer to the array's first element. If you want to know the size of a dynamic array, you must store it manually or use classes from libraries such as std::vector

`std::string` allocations are my current bottleneck - how can I optimize with a custom allocator?

I'm writing a C++14 JSON library as an exercise and to use it in my personal projects.
By using callgrind I've discovered that the current bottleneck during a continuous value creation from string stress test is an std::string dynamic memory allocation. Precisely, the bottleneck is the call to malloc(...) made from std::string::reserve.
I've read that many existing JSON libraries such as rapidjson use custom allocators to avoid malloc(...) calls during string memory allocations.
I tried to analyze rapidjson's source code but the large amount of additional code and comments, plus the fact that I'm not really sure what I'm looking for, didn't help me much.
How do custom allocators help in this situation?
Is a memory buffer preallocated somewhere (where? statically?) and std::strings take available memory from it?
Are strings using custom allocators "compatible" with normal strings?
They have different types. Do they have to be "converted"? (And does that result in a performance hit?)
Code notes:
Str is an alias for std::string.
By default, std::string allocates memory as needed from the same heap as anything that you allocate with malloc or new. To get a performance gain from providing your own custom allocator, you will need to be managing your own "chunk" of memory in such a way that your allocator can deal out the amounts of memory that your strings ask for faster than malloc does. Your memory manager will make relatively few calls to malloc, (or new, depending on your approach) under the hood, requesting "large" amounts of memory at once, then deal out sections of this (these) memory block(s) through the custom allocator. To actually achieve better performance than malloc, your memory manager will usually have to be tuned based on known allocation patterns of your use cases.
This kind of thing often comes down to the age-old trade off of memory use versus execution speed. For example: if you have a known upper bound on your string sizes in practice, you can pull tricks with over-allocating to always accommodate the largest case. While this is wasteful of your memory resources, it can alleviate the performance overhead that more generalized allocation runs into with memory fragmentation. As well as making any calls to realloc essentially constant time for your purposes.
#sehe is exactly right. There are many ways.
EDIT:
To finally address your second question, strings using different allocators can play nicely together, and usage should be transparent.
For example:
class myalloc : public std::allocator<char>{};
myalloc customAllocator;
int main(void)
{
std::string mystring(customAllocator);
std::string regularString = "test string";
mystring = regularString;
std::cout << mystring;
return 0;
}
This is a fairly silly example and, of course, uses the same workhorse code under the hood. However, it shows assignment between strings using allocator classes of "different types". Implementing a useful allocator that supplies the full interface required by the STL without just disguising the default std::allocator is not as trivial. This seems to be a decent write up covering the concepts involved. The key to why this works, in the context of your question at least, is that using different allocators doesn't cause the strings to be of different type. Notice that the custom allocator is given as an argument to the constructor not a template parameter. The STL still does fun things with templates (such as rebind and Traits) to homogenize allocator interfaces and tracking.
What often helps is the creation of a GlobalStringTable.
See if you can find portions of the old NiMain library from the now defunct NetImmerse software stack. It contains an example implementation.
Lifetime
What is important to note is that this string table needs to be accessible between different DLL spaces, and that it is not a static object. R. Martinho Fernandes already warned that the object needs to be created when the application or DLL thread is created / attached, and disposed when the thread is destroyed or the dll is detached, and preferrably before any string object is actually used. This sounds easier than it actually is.
Memory allocation
Once you have a single point of access that exports correctly, you can have it allocate a memory buffer up-front. If the memory is not enough, you have to resize it and move the existing strings over. Strings essentially become handles to regions of memory in this buffer.
Placement new
Something that often works well is called the placement new() operator, where you can actually specify where in memory your new string object needs to be allocated. However, instead of allocating, the operator can simply grab the memory location that is passed in as an argument, zero the memory at that location, and return it. You can also keep track of the allocation, the actual size of the string etc.. in the Globalstringtable object.
SOA
Handling the actual memory scheduling is something that is up to you, but there are many possible ways to approach this. Often, the allocated space is partitioned in several regions so that you have several blocks per possible string size. A block for strings <= 4 bytes, one for <= 8 bytes, and so on. This is called a Small Object Allocator, and can be implemented for any type and buffer.
If you expect many string operations where small strings are incremented repeatedly, you may change your strategy and allocate larger buffers from the start, so that the number of memmove operations are reduced. Or you can opt for a different approach and use string streams for those.
String operations
It is not a bad idea to derive from std::basic_str, so that most of the operations still work but the internal storage is actually in the GlobalStringTable, so that you can keep using the same stl conventions. This way, you also make sure that all the allocations are within a single DLL, so that there can be no heap corruption by linking different kinds of strings between different libraries, since all the allocation operations are essentially in your DLL (and are rerouted to the GlobalStringTable object)
Custom allocators can help because most malloc()/new implementations are designed for maximum flexibility, thread-safety and bullet-proof workings. For instance, they must gracefully handle the case that one thread keeps allocating memory, sending the pointers to another thread that deallocates them. Things like these are difficult to handle in a performant way and drive the cost of malloc() calls.
However, if you know that some things cannot happen in your application (like one thread deallocating stuff another thread allocated, etc.), you can optimize your allocator further than the standard implementation. This can yield significant results, especially when you don't need thread safety.
Also, the standard implementation is not necessarily well optimized: Implementing void* operator new(size_t size) and void operator delete(void* pointer) by simply calling through to malloc() and free() gives an average performance gain of 100 CPU cycles on my machine, which proves that the default implementation is suboptimal.
I think you'd be best served by reading up on the EASTL
It has a section on allocators and you might find fixed_string useful.
The best way to avoid a memory allocation is don't do it!
BUT if I remember JSON correctly all the readStr values either gets used as keys or as identifiers so you will have to allocate them eventually, std::strings move semantics should insure that the allocated array are not copied around but reused until its final use. The default NRVO/RVO/Move should reduce any copying of the data if not of the string header itself.
Method 1:
Pass result as a ref from the caller which has reserved SomeResonableLargeValue chars, then clear it at the start of readStr. This is only usable if the caller actually can reuse the string.
Method 2:
Use the stack.
// Reserve memory for the string (BOTTLENECK)
if (end - idx < SomeReasonableValue) { // 32?
char result[SomeReasonableValue] = {0}; // feel free to use std::array if you want bounds checking, but the preceding "if" should insure its not a problem.
int ridx = 0;
for(; idx < end; ++idx) {
// Not an escape sequence
if(!isC('\\')) { result[ridx++] = getC(); continue; }
// Escape sequence: skip '\'
++idx;
// Convert escape sequence
result[ridx++] = getEscapeSequence(getC());
}
// Skip closing '"'
++idx;
result[ridx] = 0; // 0-terminated.
// optional assert here to insure nothing went wrong.
return result; // the bottleneck might now move here as the data is copied to the receiving string.
}
// fallback code only if the string is long.
// Your original code here
Method 3:
If your string by default can allocate some size to fill its 32/64 byte boundary, you might want to try to use that, construct result like this instead in case the constructor can optimize it.
Str result(end - idx, 0);
Method 4:
Most systems already has some optimized allocator that like specific block sizes, 16,32,64 etc.
siz = ((end - idx)&~0xf)+16; // if the allocator has chunks of 16 bytes already.
Str result(siz);
Method 5:
Use either the allocator made by google or facebooks as global new/delete replacement.
To understand how a custom allocator can help you, you need to understand what malloc and the heap does and why it is quite slow in comparison to the stack.
The Stack
The stack is a large block of memory allocated for your current scope. You can think of it as this
([] means a byte of memory)
[P][][][][][][][][][][][][][][][]
(P is a pointer that points to a specific byte of memory, in this case its pointing at the first byte)
So the stack is a block with only 1 pointer. When you allocate memory, what it does is it performs a pointer arithmetic on P, which takes constant time.
So declaring int i = 0; would mean this,
P + sizeof(int).
[i][i][i][i][P][][][][][][][][][][][],
(i in [] is a block of memory occupied by an integer)
This is blazing fast and as soon as you go out of scope, the entire chunk of memory is emptied simply by moving P back to the first position.
The Heap
The heap allocates memory from a reserved pool of bytes reserved by the c++ compiler at runtime, when you call malloc, the heap finds a length of contiguous memory that fits your malloc requirements, marks it as used so nothing else can use it, and returns that to you as a void*.
So, a theoretical heap with little optimization calling new(sizeof(int)), would do this.
Heap chunk
At first : [][][][][][][][][][][][][][][][][][][][][][][][][]
Allocate 4 bytes (sizeof(int)):
A pointer goes though every byte of memory, finds one that is of correct length, and returns to you a pointer.
After : [i][i][i][i][][][]][][][][][][][][][]][][][][][][][]
This is not an accurate representation of the heap, but from this you can already see numerous reasons for being slow relative to the stack.
The heap is required to keep track of all already allocated memory and their respective lengths. In our test case above, the heap was already empty and did not require much, but in worst case scenarios, the heap will be populated with multiple objects with gaps in between (heap fragmentation), and this will be much slower.
The heap is required to cycle though all the bytes to find one that fits your length.
The heap can suffer from fragmentation since it will never completely clean itself unless you specify it. So if you allocated an int, a char, and another int, your heap would look like this
[i][i][i][i][c][i2][i2][i2][i2]
(i stands for bytes occupied by int and c stands for bytes occupied by a char. When you de-allocate the char, it will look like this.
[i][i][i][i][empty][i2][i2][i2][i2]
So when you want to allocate another object into the heap,
[i][i][i][i][empty][i2][i2][i2][i2][i3][i3][i3][i3]
unless an object is the size of 1 char, the overall heap size for that allocation is reduced by 1 byte. In more complex programs with millions of allocations and deallocations, the fragmentation issue becomes severe and the program will become unstable.
Worry about cases like thread safety (Someone else said this already).
Custom Heap/Allocator
So, a custom allocator usually needs to address these problems while providing the benefits of the heap, such as personalized memory management and object permanence.
These are usually accomplished with specialized allocators. If you know you dont need to worry about thread safety or you know exactly how long your string will be or a predictable usage pattern you can make your allocator fast than malloc and new by quite a lot.
For example, if your program requires a lot of allocations as fast as possible without lots of deallocations, you could implement a stack allocator, in which you allocate a huge chunk of memory with malloc at startup,
e.g
typedef char* buffer;
//Super simple example that probably doesnt work.
struct StackAllocator:public Allocator{
buffer stack;
char* pointer;
StackAllocator(int expectedSize){ stack = new char[expectedSize];pointer = stack;}
allocate(int size){ char* returnedPointer = pointer; pointer += size; return returnedPointer}
empty() {pointer = stack;}
};
Get expected size, get a chunk of memory from the heap.
Assign a pointer to the beginning.
[P][][][][][][][][][] ..... [].
then have one pointer that moves for each allocation. When you no longer need the memory, you simply move the pointer to the beginning of your buffer. This gives your the advantage of O(1) speed allocations and deallocations as well as object permanence for the lack of flexible deallocation and large initial memory requirements.
For strings, you could try a chunk allocator. For every allocation, the allocator gives a set chunk of memory.
Compatibility
Compatibility with other strings is almost guaranteed. As long as you are allocating a contiguous chunk of memory and preventing anything else from using that block of memory, it will work.

What is the "proper" way to allocate variable-sized buffers in C++?

This is very similar to this question, but the answers don't really answer this, so I thought I'd ask again:
Sometimes I interact with functions that return variable-length structures; for example, FSCTL_GET_RETRIEVAL_POINTERS in Windows returns a variably-sized RETRIEVAL_POINTERS_BUFFER structure.
Using malloc/free is discouraged in C++, and so I was wondering:
What is the "proper" way to allocate variable-length buffers in standard C++ (i.e. no Boost, etc.)?
vector<char> is type-unsafe (and doesn't guarantee anything about alignment, if I understand correctly), new doesn't work with custom-sized allocations, and I can't think of a good substitute. Any ideas?
I would use std::vector<char> buffer(n). There's really no such thing as a variably sized structure in C++, so you have to fake it; throw type safety out the window.
If you like malloc()/free(), you can use
RETRIEVAL_POINTERS_BUFFER* ptr=new char [...appropriate size...];
... do stuff ...
delete[] ptr;
Quotation from the standard regarding alignment (expr.new/10):
For arrays of char and unsigned char, the difference between the
result of the new-expression and the address returned by the
allocation function shall be an integral multiple of the strictest
fundamental alignment requirement (3.11) of any object type whose size
is no greater than the size of the array being created. [ Note:
Because allocation functions are assumed to return pointers to storage
that is appropriately aligned for objects of any type with fundamental
alignment, this constraint on array allocation overhead permits the
common idiom of allocating character arrays into which objects of
other types will later be placed. — end note ]
I don't see any reason why you can't use std::vector<char>:
{
std::vector<char> raii(memory_size);
char* memory = &raii[0];
//Now use `memory` wherever you want
//Maybe, you want to use placement new as:
A *pA = new (memory) A(/*...*/); //assume memory_size >= sizeof(A);
pA->fun();
pA->~A(); //call the destructor, once done!
}//<--- just remember, memory is deallocated here, automatically!
Alright, I understand your alignment problem. It's not that complicated. You can do this:
A *pA = new (&memory[i]) A();
//choose `i` such that `&memory[i]` is multiple of four, or whatever alignment requires
//read the comments..
You may consider using a memory pool and, in the specific case of the RETRIEVAL_POINTERS_BUFFER structure, allocate pool memory amounts in accordance with its definition:
sizeof(DWORD) + sizeof(LARGE_INTEGER)
plus
ExtentCount * sizeof(Extents)
(I am sure you are more familiar with this data structure than I am -- the above is mostly for future readers of your question).
A memory pool boils down to "allocate a bunch of memory, then allocate that memory in small pieces using your own fast allocator".
You can build your own memory pool, but it may be worth looking at Boosts memory pool, which is a pure header (no DLLs!) library. Please note that I have not used the Boost memory pool library, but you did ask about Boost so I thought I'd mention it.
std::vector<char> is just fine. Typically you can call your low-level c-function with a zero-size argument, so you know how much is needed. Then you solve your alignment problem: just allocate more than you need, and offset the start pointer:
Say you want the buffer aligned to 4 bytes, allocate needed size + 4 and add 4 - ((&my_vect[0] - reinterpret_cast<char*>(0)) & 0x3).
Then call your c-function with the requested size and the offsetted pointer.
Ok, lets start from the beginning. Ideal way to return variable-length buffer would be:
MyStruct my_func(int a) { MyStruct s; /* magic here */ return s; }
Unfortunately, this does not work since sizeof(MyStruct) is calculated on compile-time. Anything variable-length just do not fit inside a buffer whose size is calculated on compile-time. The thing to notice that this happens with every variable or type supported by c++, since they all support sizeof. C++ has just one thing that can handle runtime sizes of buffers:
MyStruct *ptr = new MyStruct[count];
So anything that is going to solve this problem is necessarily going to use the array version of new. This includes std::vector and other solutions proposed earlier. Notice that tricks like the placement new to a char array has exactly the same problem with sizeof. Variable-length buffers just needs heap and arrays. There is no way around that restriction, if you want to stay within c++. Further it requires more than one object! This is important. You cannot make variable-length object with c++. It's just impossible.
The nearest one to variable-length object that the c++ provides is "jumping from type to type". Each and every object does not need to be of same type, and you can on runtime manipulate objects of different types. But each part and each complete object still supports sizeof and their sizes are determined on compile-time. Only thing left for programmer is to choose which type you use.
So what's our solution to the problem? How do you create variable-length objects? std::string provides the answer. It needs to have more than one character inside and use the array alternative for heap allocation. But this is all handled by the stdlib and programmer do not need to care. Then you'll have a class that manipulates those std::strings. std::string can do it because it's actually 2 separate memory areas. The sizeof(std::string) does return a memory block whose size can be calculated on compile-time. But the actual variable-length data is in separate memory block allocated by the array version of new.
The array version of new has some restrictions on it's own. sizeof(a[0])==sizeof(a[1]) etc. First allocating an array, and then doing placement new for several objects of different types will go around this limitation.

Is this nested array using stack or heap memory?

Say I have this declaration and use of array nested in a vector
const int MAX_LEN = 1024;
typedef std::tr1::array<char, MAX_LEN> Sentence;
typedef std::vector<Sentence> Paragraph;
Paragraph para(256);
std::vector<Paragraph> book(2000);
I assume that the memory for Sentence is on the stack. Is that right?
What about the memory for vector para? Is that on the stack i.e. should I worry if my para gets too large?
And finaly what about the memory for book? That has to be on the heap I guess but the nested arrays are on the stack, aren't they?
Additional questions
Is the memory for Paragraph contiguous?
Is the memory for book contiguous?
There is no stack. Don't think about a stack. What matters is whether a given container class performs any dynamic allocation or not.
std::array<T,N> doesn't use any dynamic allocation, it is a very thing wrapper around an automatically allocated T[N].
Anything you put in a vector will however be allocated by the vector's own allocator, which in the default case (usually) performs dynamic allocation with ::operator new().
So in short, vector<array<char,N>> is very simiar to vector<int>: The allocator simply allocates memory for as many units of array<char,N> (or int) as it needs to hold and constructs the elements in that memory. Rinse and repeat for nested vectors.
For your "additional questions": vector<vector<T>> is definitely not contiguous for T at all. It is merely contiguous for vector<T>, but that only contains the small book-keeping part of the inner vector. The actual content of the inner vector is allocated by the inner vector's allocator, and separately for each inner vector. In general, vector<S> is contiguous for the type S, and nothing else.
I'm not actually sure about vector<array<U,N>> -- it might be contiguous for U, because the array has no reason to contain any data besides the contained U[N], but I'm not sure if that's mandatory.
You might want to ask that as a separate question, it's a good question!
As a side note, it might be helpful to use gdb. It lets you manually examine your memory, including the locations of your variables. You can check yourself precisely what memory you are using.
Your code example:
const int MAX_LEN = 1024;
typedef std::tr1::array<char, MAX_LEN> Sentence;
typedef std::vector<Sentence> Paragraph;
Paragraph para(256);
std::vector<Paragraph> book(2000);
"I assume that the memory for Sentence is on the stack. Is that right?"
No. Whether something is allocated on the stack depends on the declaration context. You have omitted the context, hence nothing can be said. If an object is local and non-static, then you get stack allocation for the object itself, but not necessarily for parts that it refers to internally. By the way, since another answer here claimed "there is no stack", just disregard that urban legend about what kinds of systems C++ must support. It came originally from a misunderstanding of how a rather unsuccessful hardware level optimized computer worked, that some people erroneously thought that it didn't have a simple hardware-supported array-like stack implementation. It is quite a stretch from "not simple" to "not there", and even the "not simple" was utterly wrong, not just factually but logically (ultimately a self-contradiction). I.e. it was a not-too-smart beginner's mistake, even though the myth has been propagated by at least one person with some experience. Anyway, C++ guarantees an abstract stack, and on all extant computers that guaranteed abstract stack is implemented in terms of a hardware-assisted array-like simple stack
"What about the memory for vector para? Is that on the stack"
Again, that depends on the declaration context, which you don't show. And again, even if the object itself is allocated on the stack, parts that it refer to internally will not (in general) be allocated on the stack.
"i.e. should I worry if my para gets too large?"
No, there's no need to worry. A std::vector allocates its buffer dynamically. It's not limited by available stack space.
"And finaly what about the memory for book? That has to be on the heap I guess but the nested arrays are on the stack, aren't they?"
No and no.
"Is the memory for Paragraph contiguous?"
No. But the vector's buffer is contiguous. That's because std::array is guaranteed contiguous, and a std::vector's buffer is guaranteed contiguous.
"Is the memory for book contiguous?"
No.

Why is it not possible to access the size of a new[]'d array?

When you allocate an array using new [], why can't you find out the size of that array from the pointer? It must be known at run time, otherwise delete [] wouldn't know how much memory to free.
Unless I'm missing something?
In a typical implementation the size of dynamic memory block is somehow stored in the block itself - this is true. But there's no standard way to access this information. (Implementations may provide implementation-specific ways to access it). This is how it is with malloc/free, this is how it is with new[]/delete[].
In fact, in a typical implementation raw memory allocations for new[]/delete[] calls are eventually processed by some implementation-specific malloc/free-like pair, which means that delete[] doesn't really have to care about how much memory to deallocate: it simply calls that internal free (or whatever it is named), which takes care of that.
What delete[] does need to know though is how many elements to destruct in situations when array element type has non-trivial destructor. And this is what your question is about - the number of array elements, not the size of the block (these two are not the same, the block could be larger than really required for the array itself). For this reason, the number of elements in the array is normally also stored inside the block by new[] and later retrieved by delete[] to perform the proper array element destruction. There are no standard ways to access this number either.
(This means that in general case, a typical memory block allocated by new[] will independently, simultaneously store both the physical block size in bytes and the array element count. These values are stored by different levels of C++ memory allocation mechanism - raw memory allocator and new[] itself respectively - and don't interact with each other in any way).
However, note that for the above reasons the array element count is normally only stored when the array element type has non-trivial destructor. I.e. this count is not always present. This is one of the reasons why providing a standard way to access that data is not feasible: you'd either have to store it always (which wastes memory) or restrict its availability by destructor type (which is confusing).
To illustrate the above, when you create an array of ints
int *array = new int[100];
the size of the array (i.e. 100) is not normally stored by new[] since delete[] does not care about it (int has no destructor). The physical size of the block in bytes (like, 400 bytes or more) is normally stored in the block by the raw memory allocator (and used by raw memory deallocator invoked by delete[]), but it can easily turn out to be 420 for some implementation-specific reason. So, this size is basically useless for you, since you won't be able to derive the exact original array size from it.
You most likely can access it, but it would require intimate knowledge of your allocator and would not be portable. The C++ standard doesn't specify how implementations store this data, so there's no consistent method for obtaining it. I believe it's left unspecified because different allocators may wish to store it in different ways for efficiency purposes.
It makes sense, as for example the size of the allocated block may not necessarily be the same size as the array. While it is true that new[] may store the number of elements (calling each elements destructor), it doesn't have to as it wouldn't be required for a empty destructor. There is also no standard way (C++ FAQ Lite 1, C++ FAQ Lite 2) of implementing where new[] stores the array length as each method has its pros and cons.
In other words, it allows allocations to as fast an cheap as possible by not specifying anything about the implementation. (If the implementation has to store the size of the array as well as the size of the allocated block every time, it wastes memory that you may not need).
Simply put, the C++ standard does not require support for this. It is possible that if you know enough about the internals of your compiler, you can figure out how to access this information, but that would generally be considered bad practice. Note that there may be a difference in memory layout for heap-allocated arrays and stack-allocated arrays.
Remember that essentially what you are talking about here are C-style arrays, too -- even though new and delete are C++ operators -- and the behavior is inherited from C. If you want a C++ "array" that is sized, you should be using the STL (e.g. std::vector, std::deque).