Accessing overallocated memory in C++ - c++

I have a huge tree that can take up to several gigabytes. The node structure is as below. You'll notice that I made the last member an array of size 1. The reason for this is that I can over-allocate a Node with flexible sizes. Similar to what C natively supports as a flexible array member. I could use std::unique_ptr<T[]> or std::vector<T> instead, but the problem is that then there is double dynamic allocation, double indirection, and extra cache misses per each tree node. In my last test, this made my program take about 50% more time, which is simply not acceptable for my application.
template<typename T>
class Node
{
public:
Node<T> *parent;
Node<T> *child;
/* ... */
T &operator[](int);
private;
int size;
T array[1];
};
The simplest way to implement operator[] would be this.
template<typename T>
T &Node::operator[](int n)
{
return array[n];
}
It should work fine in most sane C++ implementations. But as the C++ standard allows insane implementations doing array bounds checking, as fas as I know this is technically invoking undefined behaviour. Then can I do this?
template<typename T>
T &Node::operator[](int n)
{
return (&array[0])[n];
}
I'm a little confused here. The [] operator for primitive types is just a syntactic sugar to *. Thus (&array[0])[n] is equivalent to (&*(array + 0))[n], which I think can be cleaned up as array[n], making everything the same as the first one. Okay but I can still do this.
template<typename T>
T &Node::operator[](int n)
{
return *(reinterpret_cast<T *>(reinterpret_cast<char *>(this) + offsetof(Node<T>, array)) + n);
}
I hope I'm now free from the possible undefined behaviours. Perhaps inline assembly will show my intent better. But do I really have to go this far? Can someone clarify things to me?
By the way T is always a POD type. The whole Node is also POD.

First of all, an implementation is free to reorder class members in all but trivial cases. Your case is not trivial because it has access specifiers. Unless you make your class POD, or whatever it's called in C++11 (trivial layout?) you are not guaranteed your array is actually laid out last.
Then of course flexible members do not exist in C++.
All is not lost however. Allocate a chunk of memory large enough to house both your class and your array, then placement-new your class in the beginning, and interpret the portion that comes after the object (plus any paddibg to ensure proper alignment) as the array.
If you have this, then the array can be accessed with
reinterpret_cast<T*>(
reinterpret_cast<char*>(this) +
sizeof(*this) + padding))
where adfing is chosen such that sizeof(T) divides sizeof(*this) + padding.
For inspiration, look at std::make_shared`. It also packs two objects into one allocated block of memory.

The main problem with "out of bounds" array access is that no object lives there. It's not the out of bounds index itself which causes the problem.
Now in your case there presumably is raw memory at the intended location. That means you can in fact create a POD object there via assignment. Any subsequent read access will find the object there.
The root cause is that C didn't really have array bounds. a[n] is just *(a+n), by definition. So the first two proposed forms are already identical.
I'd be slightly more worried about any padding behind T array[1], which you'd be accessing as part of array[1].

You also wondered if there was an alternative approach. Given your recent comment about "no reallocation", I'd store the array data as a pointer to heap-allocated storage, but:
Trees has predictable access patterns, from root to child. Therefore, I'd have a Node::operator new and make sure that child nodes are allocated directly after their parent. This gives you locality of reference when walking the tree. Secondly, I'd have another allocator for the array data, and make this return contiguous memory for the parent array and their first child (followed by its first grandchild of course).
The result is that the node and its array have no locality of reference between them, but instead you get locality of reference both for the tree graph and the associated array data.
It's quite possible that the array data allocator can be a trivial pool allocator for the tree. Just allocate 256 KB chunks at a time, and parcel them out a few ints at a time. The whole state you need to track is how much of that 256 kB you've already allocated. This is much faster than std::vector<T, std::allocator> can achieve because it cannot know the memory lives for as long as the tree lives.

Related

Template non-type parameters and allocating array memory

There is an example class in a book I am reading used throughout to explain concepts:
class Gameboard{
public:
Gameboard(int inWidth, int inHeight);
Gameboard(const Spreadsheet& src);
Gameboard& operator=(const Spreadsheet& rhs);
private:
GamePiece** mCells;
size_t width;
size_t height;
};
they then introduce templates and introduce an updated class:
template<typename T>
class Grid{
public:
Grid<T>(int inWidth, int inHeight);
Grid<T>(const T& src);
Grid<T>& operator=(const T& rhs);
private:
T** mCells;
size_t width;
size_t height;
};
finally they introduce non-type template parameters and say you can now do this:
template<typename T, size_t WIDTH, size_t HEIGHT>
class Grid{
public:
Grid<T>();
Grid<T>(const T& src);
Grid<T>& operator=(const T& rhs);
private:
T mCells[WIDTH][HEIGHT];
};
From the book:
In the Grid template class, you could use non-type template parameters to specify the height and width of the grid instead of specifying them in the constructor. The principle advantage to specifying non-type parameters in the template list instead of the constructor is that the values are known before the code is compiled. Recall that the compiler generates code for templatized methods by substituting in the template parameters before compiling. Thus you can use a normal two-dimensional array in your implementation instead of dynamically allocating it.
I don't get all the excitement with this approach regarding the dynamic memory allocation. Firstly does this mean the multidimensional array would be on the stack (because they seem to suggest it wouldnt be dynamically allocated)? I don't understand why you wouldn't want to dynamically allocate the memory on the heap?
Secondly, is there some C++ rule (which I am forgetting) which prohibits declaring a multidimensional array on the stack, hence the excitement with this approach?
I am trying to understand what is the advantage of using non-type template parameters in their example.
Firstly does this mean the multidimensional array would be on the stack (because they seem to suggest it wouldnt be dynamically allocated)?
As you can see, your array is directly a member, it's not indirected by a pointer.
T mCells[WIDTH][HEIGHT];
But it wouldn't be correct to say that it's on the stack or the heap since in fact the array is a part of your object and where it is depends on where and how your object is allocated. If the object is allocated on the heap, its array subobject will be too, if the whole object is on the stack, so will the array be.
I don't understand why you wouldn't want to dynamically allocate the memory on the heap?
You could. But having a new'd array is both slower and more error-prone (i.e. it should be deleted etc).
Secondly, is there some C++ rule (which I am forgetting) which prohibits declaring a multidimensional array on the stack, hence the excitement with this approach?
No. But the size of any array must be a compile-time constant. Non-type template parameters are exactly that - compile time constants. Had they been simply function parameters of an int type, the compiler would not know their value and it would be illegal to create an array (multidimensional or otherwise) on the stack.
"Firstly does this mean the multidimensional array would be on the stack"
If Gameboard is located on the stack (ie it's a local variable) then yes, the array is also on the stack.
"I don't understand why you wouldn't want to dynamically allocate the memory on the heap?"
For speed. In this case, as Gameboard will probably stick around for a long time, it's not necessary. Actually, std::vector<std::vector<GamePiece>> mCells would be better than manual arrays.
"is there some C++ rule (which I am forgetting) which prohibits declaring a multidimensional array on the stack"
No, it's permissible. But like normal arrays, the dimensions must be known at compile-time.
"I am trying to understand what is the advantage of using non-type template parameters in their example."
It is a contrived example.
Consider you want to create an arbitrary-precision integer class. The higher the precision, the more space is required. This class may be created and destroyed frequently, just like a regular integer variable. If the data is allocated dynamically, this is slower. By specifying the required precision with a template parameter, a regular array of the required size can be created as a member of the class. Whenever the class is placed on the stack, so will the array, and this will be faster.
Without referring to the particular samples and sources you give:
I don't get all the excitement with this approach regarding the dynamic memory allocation
Because no more dynamic allocation is needed for such at all, even if you're creating instances on the stack.
I don't understand why you wouldn't want to dynamically allocate the memory on the heap?
I'm often working on small embedded systems, where I sometimes don't even have the possibility for dynamic memory management (or I don't want to bear the overhead). For such systems, and where you very well know in advance, which sizes you can/want actually have, it's a pretty good configuration abstraction you want to use for target specific platform implementations.
Also besides the above reasons, dynamic allocation introduces a performance hit at runtime that is unwanted for performance critical applications (e.g. game framework rendering engines).
In short and general:
If you have anything that can be certainly configured at compile time, prefer this over configuring at run time.

need a std::vector with O(1) erase

I was surprised to find out the vector::erase move elements on calling erase . I thought it would swap the last element with the "to-be-deleted" element and reduce the size by one. My first reaction was : "let's extend std::vector and over-ride erase()" . But I found in many threads like " Is there any real risk to deriving from the C++ STL containers? ", that it can cause memory leaks. But, I am not adding any new data member to vector. So there is no additional memory to be freed. Is there still a risk?
Some suggest that we should prefer composition over inheritance. I can't make sense of this advice in this context. Why should I waste my time in the "mechanical" task of wrapping every function of the otherwise wonderful std::vector class.? Inheritance indeed makes the most sense for this task - or am I missing something?
Why not just write a standalone function that does what you want:
template<typename T>
void fast_erase(std::vector<T>& v, size_t i)
{
v[i] = std::move(v.back());
v.pop_back();
}
All credit to Seth Carnegie though. I originally used "std::swap".
Delicate issue. The first guideline you're breaking is: "Inheritance is not for code reuse". The second is: "Don't inherit from standard library containers".
But: If you can guarantee, that nobody will ever use your unordered_vector<T> as a vector<T> you're good. However, if somebody does, the results may be undefined and/or horrible, regardless of how many members you have (it may seem to work perfectly but nevertheless be undefined behaviour!).
You could use private inheritance, but that would not free you from writing wrappers or pulling member functions in with lots of using statements, which would almost be as much code as composition (a bit less, though).
Edit: What I mean with using statements is this:
class Base {
public:
void dosmth();
};
class Derived : private Base {
public:
using Base::dosmth;
};
class Composed {
private:
Base base;
public:
void dosmth() {return base.dosmth(); }
};
You could do this with all member functions of std::vector. As you can see Derived is significantly less code than Composed.
The risk of inheritance is in the following example:
std::vector<something> *v = new better_vector<something>();
delete v;
That would cause problems because you deleted a pointer to a base class with no virtual destructor.
However if you always delete a pointer to your class like:
better_vector<something> *v = new better_vector<something>();
delete v;
Or don't allocate it on the heap there is no danger. just don't forget to call the parent destructor in your destructor.
I thought it would swap the last element with the "to-be-deleted"
element and reduce the size by one.
vector::erase maintains order of elements while moving last element to erased element and reduce the size by one does not. Since vector implements array, there is no O(1) way to maintain order of elements and to erase at the same time (unless you remove the last element).
If maintaining order of elements is not important than your solution is fine, otherwise, you better use other containers (for example list, which implements doubly-linked list).

is there any way to avoid the copy from and to between the valarray and array?

I have a lot of data in a list, say several kbytes in each element, I would like to extract each by each to do some numeric processing. These data are originally stored as float[]. Since the processing involves a lot of indexing and global calculation, I think valarray might be easy to program. But if I use valarray, I may have to copy from the array to the valarray first, and then copy back to the array. Is there any way to avoid this? Any way such that to let me work directly on the arrays? Or do you have better ways to solve similar problems?
The valarray type does not provide any way to use an existing array for its data store; it always makes a copy for itself. Instead of storing your data in an ordinary array, store the values directly in the valarray from the start. Call v.resize to set the size, and either assign values into it with the [] operator, or use &v[0] to get a pointer to the first value and use it as you would an iterator or buffer pointer — elements of a valarray are stored contiguously in memory.
Warning: ugly hack.
On my system (MS Visual Studio) the valarray class is defined like this:
template<class _Ty>
class valarray
{
...
private:
_Ty *_Myptr; // current storage reserved for array
size_t _Mysize; // current length of sequence
size_t _Myres; // length of array
};
So i can build my own class that has the same layout (with a good level of confidence):
struct my_valarray_hack
{
void *_Myptr;
size_t num_of_elements;
size_t size_of_buffer;
};
Then create an empty valarray and overwrite its internal variables so it points to your data.
void do_stuff(float my_data[], size_t size)
{
valarray<float> my_valarray;
my_valarray_hack hack = {my_data, size, size};
my_valarray_hack cleanup;
assert(sizeof(my_valarray) == sizeof(hack));
// Save the contents of the object that we are fiddling with
memcpy(&cleanup, &my_valarray, sizeof(cleanup));
// Overwrite the object so it points to our array
memcpy(&my_valarray, &hack, sizeof(hack));
// Do calculations
...
// Do cleanup (otherwise, it will crash)
memcpy(&my_valarray, &cleanup, sizeof(cleanup));
// Destructor is silently invoked here
}
This is not a recommended way of doing things; you should consider it only if you have no other way to implement what you want (maybe not even then). Possible reasons why it could fail:
Layout of valarray may be different in another mode of compilation (examples of modes: debug/release; different platforms; different versions of Standard Library)
If your calculations resize the valarray in any manner, it will try to reallocate your buffer and crash
If the implementation of valarray assumes its buffer has e.g. 16-byte alignment, it may crash, do wrong calculations or just work slowly (depending on your platform)
(I am sure there are some more reasons for it not to work)
Anyway, it's described as "undefined behavior" by the Standard, so strictly speaking anything may happen if you use this solution.

Why is a variable length array not declared not as a pointer sometimes?

I see this in code sometimes:
struct S
{
int count; // length of array in data
int data[1];
};
Where the storage for S is allocated bigger than sizeof(S) so that data can have more space for its array. It is then used like:
S *s;
// allocation
s->data[3] = 1337;
My question is, why is data not a pointer? Why the length-1 array?
If you declare data as a pointer, you'll have to allocate a separate memory block for the data array, i.e. you'll have to make two allocations instead of one. While there won't be much difference in the actual functionality, it still might have some negative performance impact. It might increase memory fragmentation. It might result in struct memory being allocated "far away" from the data array memory, resulting in the poor cache behavior of the data structure. If you use your own memory management routines, like pooled allocators, you'll have to set up two allocators: one for the struct and one for the array.
By using the above technique (known as "struct hack") you allocate memory for the entire struct (including data array) in one block, with one call to malloc (or to your own allocator). This is what it is used for. Among other things it ensures that struct memory is located as close to the array memory as possible (i.e. it is just one continuous block), so the cache behavior of the data structure is optimal.
Raymond Chen wrote an excellent article on precisely why variable length structures chose this pattern over many others (including pointers).
http://blogs.msdn.com/b/oldnewthing/archive/2004/08/26/220873.aspx
He doesn't directly comment on why a pointer was chosen over an array but Steve Dispensa provides some insight in the comments section.
From Steve
typedef struct _TOKEN_GROUPS {
DWORD GroupCount;
SID_AND_ATTRIBUTES *Groups;
} TOKEN_GROUPS, *PTOKEN_GROUPS;
This would still force Groups to be pointer-aligned, but it's much less convenient when you think of argument marshalling.
In driver development, developers are sometimes faced with sending arguments from user-mode to kernel-mode via a METHOD_BUFFERED IOCTL. Structures with embedded pointers like this one represent anything from a security flaw waiting to happen to simply a PITA.
It's done to make it easier to manage the fact that the array is sequential in memory (within the struct). Otherwise, after the memalloc that is greater than sizeof(S), you would have to point 'data' at the next memory address.
Because it lets you have code do this:
struct S
{
int count; // length of array in data
int data[1];
};
struct S * foo;
foo = malloc(sizeof(struct S) + ((len - 1)*sizeof(int)) );
strcpy(foo->data, buf);
Which only requires one call to malloc and one call to free.
This is common enough that the C99 standard allows you do not even specify a length of the array. It's called a flexible array member.
From ISO/IEC 9899:1999, Section
6.7.2.1, paragraph 16: "As a special case, the last element of a structure with more than one named member may have an incomplete array type; this is called a flexible array member."
called a flexible array member."
struct S
{
int count; // length of array in data
int data[];
};
And gcc has allowed 0 length array members as the last members of structs as an extension for a while.
Because of different copy semantics. If it is a pointer inside, then the contents have to explicitly copied. If it is a C-style array inside, then the copy is automatic.
Incidentally, I don't think there's any guarantee that using a length-one array as something longer is going to work. A compiler would be free to generate effective-address code that relies upon the subscript being no larger than the specified bound (e.g. if an array bound is specified as one, a compiler could generate code that always accesses the first element, and if it's two, on some platforms, an optimizing compiler might turn a[i] into ((i & 1) ? a[1] : a[0]). Note that while I'm unaware of any compilers that actually do that transform, I am aware of platforms where it would be more efficient than computing an array subscript.
I think a standards-compliant approach would be to declare the array as [MAX_SIZE] and allocate sizeof(struct S)-(MAX_SIZE-len)*sizeof(int) bytes.

What is the Performance, Safety, and Alignment of a Data member hidden in an embedded char array in a C++ Class?

I have seen a codebase recently that I fear is violating alignment constraints. I've scrubbed it to produce a minimal example, given below. Briefly, the players are:
Pool. This is a class which allocates memory efficiently, for some definition of 'efficient'. Pool is guaranteed to return a chunk of memory that is aligned for the requested size.
Obj_list. This class stores homogeneous collections of objects. Once the number of objects exceeds a certain threshold, it changes its internal representation from a list to a tree. The size of Obj_list is one pointer (8 bytes on a 64-bit platform). Its populated store will of course exceed that.
Aggregate. This class represents a very common object in the system. Its history goes back to the early 32-bit workstation era, and it was 'optimized' (in that same 32-bit era) to use as little space as possible as a result. Aggregates can be empty, or manage an arbitrary number of objects.
In this example, Aggregate items are always allocated from Pools, so they are always aligned. The only occurrences of Obj_list in this example are the 'hidden' members in Aggregate objects, and therefore they are always allocated using placement new. Here are the support classes:
class Pool
{
public:
Pool();
virtual ~Pool();
void *allocate(size_t size);
static Pool *default_pool(); // returns a global pool
};
class Obj_list
{
public:
inline void *operator new(size_t s, void * p) { return p; }
Obj_list(const Args *args);
// when constructed, Obj_list will allocate representation_p, which
// can take up much more space.
~Obj_list();
private:
Obj_list_store *representation_p;
};
And here is Aggregate. Note that member declaration member_list_store_d:
// Aggregate is derived from Lesser, which is twelve bytes in size
class Aggregate : public Lesser
{
public:
inline void *operator new(size_t s) {
return Pool::default_pool->allocate(s);
}
inline void *operator new(size_t s, Pool *h) {
return h->allocate(s);
}
public:
Aggregate(const Args *args = NULL);
virtual ~Aggregate() {};
inline const Obj_list *member_list_store_p() const;
protected:
char member_list_store_d[sizeof(Obj_list)];
};
It is that data member that I'm most concerned about. Here is the pseudocode for initialization and access:
Aggregate::Aggregate(const Args *args)
{
if (args) {
new (static_cast<void *>(member_list_store_d)) Obj_list(args);
}
else {
zero_out(member_list_store_d);
}
}
inline const Obj_list *Aggregate::member_list_store_p() const
{
return initialized(member_list_store_d) ? (Obj_list *) &member_list_store_d : 0;
}
You may be tempted to suggest that we replace the char array with a pointer to the Obj_list type, initialized to NULL or an instance of the class. This gives the proper semantics, but just shifts the memory cost around. If memory were still at a premium (and it might be, this is an EDA database representation), replacing the char array with a pointer to an Obj_list would cost one more pointer in the case when Aggregate objects do have members.
Besides that, I don't really want to get distracted from the main question here, which is alignment. I think the above construct is problematic, but can't really find more in the standard than some vague discussion of the alignment behavior of the 'system/library' new.
So, does the above construct do anything more than cause an occasional pipe stall?
Edit: I realize that there are ways to replace the approach using the embedded char array. So did the original architects. They discarded them because memory was at a premium. Now, if I have a reason to touch that code, I'll probably change it.
However, my question, about the alignment issues inherent in this approach, is what I hope people will address. Thanks!
Ok - had a chance to read it properly. You have an alignment problem, and invoke undefined behaviour when you access the char array as an Obj_list. Most likely your platform will do one of three things: let you get away with it, let you get away with it at a runtime penalty or occasionally crash with a bus error.
Your portable options to fix this are:
allocate the storage with malloc or
a global allocation function, but
you think this is too
expensive.
as Arkadiy says, make your buffer an Obj_list member:
Obj_list list;
but you now don't want to pay the cost of construction. You could mitigate this by providing an inline do-nothing constructor to be used only to create this instance - as posted the default constructor would do. If you follow this route, strongly consider invoking the dtor
list.~Obj_list();
before doing a placement new into this storage.
Otherwise, I think you are left with non portable options: either rely on your platform's tolerance of misaligned accesses, or else use any nonportable options your compiler gives you.
Disclaimer: It's entirely possible I'm missing a trick with unions or some such. It's an unusual problem.
The alignment will be picked by the compiler according to its defaults, this will probably end up as four-bytes under GCC / MSVC.
This should only be a problem if there is code (SIMD/DMA) that requires a specific alignment. In this case you should be able to use compiler directives to ensure that member_list_store_d is aligned, or increase the size by (alignment-1) and use an appropriate offset.
Can you simply have an instance of Obj_list inside Aggregate? IOW, something along the lines of
class Aggregate : public Lesser
{
...
protected:
Obj_list list;
};
I must be missing something, but I can't figure why this is bad.
As to your question - it's perfectly compiler-dependent. Most compilers, though, will align every member at word boundary by default, even if the member's type does not need to be aligned that way for correct access.
If you want to ensure alignment of your structures, just do a
// MSVC
#pragma pack(push,1)
// structure definitions
#pragma pack(pop)
// *nix
struct YourStruct
{
....
} __attribute__((packed));
To ensure 1 byte alignment of your char array in Aggregate
Allocate the char array member_list_store_d with malloc or global operator new[], either of which will give storage aligned for any type.
Edit: Just read the OP again - you don't want to pay for another pointer. Will read again in the morning.