Implementing concurrent_vector according to intel blog - c++

I am trying to implement a thread-safe lockless container, analogous to std::vector, according to this https://software.intel.com/en-us/blogs/2008/07/24/tbbconcurrent_vector-secrets-of-memory-organization
From what I understood, to prevent re-allocations and invalidating all iterators on all threads, instead of a single contiguous array, they add new contiguous blocks.
Each block they add is with a size of increasing powers of 2, so they can use log(index) to find the proper segment where an item at [index] is supposed to be.
From what I gather, they have a static array of pointers to segments, so they can quickly access them, however they don't know how many segments the user wants, so they made a small initial one and if the amount of segments exceeds the current count, they allocate a huge one and switch to using that one.
The problem is, adding a new segment can't be done in a lockless thread safe manner or at least I haven't figured out how. I can atomically increment the current size, but only that.
And also switching from the small to the large array of segment pointers involves a big allocation and memory copies, so I can't understand how they are doing it.
They have some code posted online, but all the important functions are without available source code, they are in their Thread Building Blocks DLL. Here is some code that demonstrates the issue:
template<typename T>
class concurrent_vector
{
private:
int size = 0;
int lastSegmentIndex = 0;
union
{
T* segmentsSmall[3];
T** segmentsLarge;
};
void switch_to_large()
{
//Bunch of allocations, creates a T* segmentsLarge[32] basically and reassigns all old entries into it
}
public:
concurrent_vector()
{
//The initial array is contiguous just for the sake of cache optimization
T* initialContiguousBlock = new T[2 + 4 + 8]; //2^1 + 2^2 + 2^3
segmentsSmall[0] = initialContiguousBlock;
segmentsSmall[1] = initialContiguousBlock + 2;
segmentsSmall[2] = initialContiguousBlock + 2 + 4;
}
void push_back(T& item)
{
if(size > 2 + 4 + 8)
{
switch_to_large(); //This is the problem part, there is no possible way to make this thread-safe without a mutex lock. I don't understand how Intel does it. It includes a bunch of allocations and memory copies.
}
InterlockedIncrement(&size); //Ok, so size is atomically increased
//afterwards adds the item to the appropriate slot in the appropriate segment
}
};

I would not try to make the segmentsLarge and segmentsSmall a union. Yes this wastes one more pointer. Then the pointer, lets call it just segments can initially point to segmentsSmall.
On the other hand the other methods can always use the same pointer which makes them simpler.
And switching from small to large can be accomplished by one compare exchange of a pointer.
I am not sure how this could be accomplished safely with a union.
The idea would look something like this (note that I used C++11, which the Intel library predates, so they likely did it with their atomic intrinsics).
This probably misses quite a few details which I am sure the Intel people have thought more about, so you will likely have to check this against the implementations of all other methods.
#include <atomic>
#include <array>
#include <cstddef>
#include <climits>
template<typename T>
class concurrent_vector
{
private:
std::atomic<size_t> size;
std::atomic<T**> segments;
std::array<T*, 3> segmentsSmall;
unsigned lastSegmentIndex = 0;
void switch_to_large()
{
T** segmentsOld = segments;
if( segmentsOld == segmentsSmall.data()) {
// not yet switched
T** segmentsLarge = new T*[sizeof(size_t) * CHAR_BIT];
// note that we leave the original segment allocations alone and just copy the pointers
std::copy(segmentsSmall.begin(), segmentsSmall.end(), segmentsLarge);
for(unsigned i = segmentsSmall.size(); i < numSegments; ++i) {
segmentsLarge[i] = nullptr;
}
// now both the old and the new segments array are valid
if( segments.compare_exchange_strong(segmentsOld, segmentsLarge)) {
// success!
return;
} else {
// already switched, just clean up
delete[] segmentsLarge;
}
}
}
public:
concurrent_vector() : size(0), segments(segmentsSmall.data())
{
//The initial array is contiguous just for the sake of cache optimization
T* initialContiguousBlock = new T[2 + 4 + 8]; //2^1 + 2^2 + 2^3
segmentsSmall[0] = initialContiguousBlock;
segmentsSmall[1] = initialContiguousBlock + 2;
segmentsSmall[2] = initialContiguousBlock + 2 + 4;
}
void push_back(T& item)
{
if(size > 2 + 4 + 8) {
switch_to_large();
}
// here we may have to allocate more segments atomically
++size;
//afterwards adds the item to the appropriate slot in the appropriate segment
}
};

Related

Problems that may arise when initializing arrays on stack inside a function scope with an N size_t parameter?

Say for example I have a function that takes some argument and a size_t length to initialize an array on stack inside a function.
Considering the following:
Strictly the length can only be on the range of 1 to 30 (using a fixed max buffer length of 30 is not allowed).
The array only stays inside the function and is only used to compute a result.
int foo(/*some argument, ..., ... */ size_t length) {
uint64_t array[length];
int some_result = 0;
// some code that uses the array to compute something ...
return some_result;
}
In normal cases I would use an std::vector, new or *alloc functions for this but... I'm trying to optimize since this said function is being repeatedly called through out the life time of the program, making the heap allocations a large overhead.
Initially using an array on stack with fixed size is the solution that I have come up with, but I cannot do this, for some reasons that I cannot tell since it would be rude.
Anyway I wonder If I can get away with this approach without encountering any problem in the future?
In the rare cases where I've done some image processing with large fixed sized temp buffers or just wanted to avoid the runtime for redundant alloc/free calls, I've made my own heap.
It doesn't make a lot of sense for small allocations, where you could just use the stack, but you indicated your instructor said not to do this. So you could try something like this:
template<typename T>
struct ArrayHeap {
unordered_map<size_t, list<shared_ptr<T[]>>> available;
unordered_map<uint64_t*, pair<size_t, shared_ptr<T[]>>> inuse;
T* Allocate(size_t length) {
auto &l = available[length];
shared_ptr<T[]> ptr;
if (l.size() == 0) {
ptr.reset(new T[length]);
} else {
ptr = l.front();
l.pop_front();
}
inuse[ptr.get()] = {length, ptr};
return ptr.get();
}
void Deallocate(T* allocation) {
auto itor = inuse.find(allocation);
if (itor == inuse.end()) {
// assert
} else {
auto &p = itor->second;
size_t length = p.first;
shared_ptr<T[]> ptr = p.second;
inuse.erase(allocation);
// optional - you can choose not to push the pointer back onto the available list
// if you have some criteria by which you want to reduce memory usage
available[length].push_back(ptr);
}
}
};
In the above code, you can Allocate a buffer of a specific length. The first time invoked for a given length value, it will incur the overhead of allocating "new". But when the buffer is returned to the heap, the second allocation for the buffer of the same length, it will be fast.
Then your function can be implemented like this:
ArrayHeap<uint64_t> global_heap;
int foo(/*some argument, ..., ... */ size_t length) {
uint64_t* array = global_heap.Allocate(length);
int some_result = 0;
// some code that uses the array to compute something ...
global_heap.Deallocate(array);
return some_result;
}
Personally I would use a fixed size array on the stack, but if there are reasons to prohibit that then check if there are any against the alloca() method.
man 3 alloca

Memory layout : 2D N*M data as pointer to N*M buffer or as array of N pointers to arrays

I'm hesitating on how to organize the memory layout of my 2D data.
Basically, what I want is an N*M 2D double array, where N ~ M are in the thousands (and are derived from user-supplied data)
The way I see it, I have 2 choices :
double *data = new double[N*M];
or
double **data = new double*[N];
for (size_t i = 0; i < N; ++i)
data[i] = new double[M];
The first choice is what I'm leaning to.
The main advantages I see are shorter new/delete syntax, continuous memory layout implies adjacent memory access at runtime if I arrange my access correctly, and possibly better performance for vectorized code (auto-vectorized or use of vector libraries such as vDSP or vecLib)
On the other hand, it seems to me that allocating a big chunk of continuous memory could fail/take more time compared to allocating a bunch of smaller ones. And the second method also has the advantage of the shorter syntax data[i][j] compared to data[i*M+j]
What would be the most common / better way to do this, mainly if I try to view it from a performance standpoint (even though those are gonna be small improvements, I'm curious to see which would more performing).
Between the first two choices, for reasonable values of M and N, I would almost certainly go with choice 1. You skip a pointer dereference, and you get nice caching if you access data in the right order.
In terms of your concerns about size, we can do some back-of-the-envelope calculations.
Since M and N are in the thousands, suppose each is 10000 as an upper bound. Then your total memory consumed is
10000 * 10000 * sizeof(double) = 8 * 10^8
This is roughly 800 MB, which while large, is quite reasonable given the size of memory in modern day machines.
If N and M are constants, it is better to just statically declare the memory you need as a two dimensional array. Or, you could use std::array.
std::array<std::array<double, M>, N> data;
If only M is a constant, you could use a std::vector of std::array instead.
std::vector<std::array<double, M>> data(N);
If M is not constant, you need to perform some dynamic allocation. But, std::vector can be used to manage that memory for you, so you can create a simple wrapper around it. The wrapper below returns a row intermediate object to allow the second [] operator to actually compute the offset into the vector.
template <typename T>
class matrix {
const size_t N;
const size_t M;
std::vector<T> v_;
struct row {
matrix &m_;
const size_t r_;
row (matrix &m, size_t r) : m_(m), r_(r) {}
T & operator [] (size_t c) { return m_.v_[r_ * m_.M + c]; }
T operator [] (size_t c) const { return m_.v_[r_ * m_.M + c]; }
};
public:
matrix (size_t n, size_t m) : N(n), M(m), v_(N*M) {}
row operator [] (size_t r) { return row(*this, r); }
const row & operator [] (size_t r) const { return row(*this, r); }
};
matrix<double> data(10,20);
data[1][2] = .5;
std::cout << data[1][2] << '\n';
In addressing your particular concern about performance: Your rationale for wanting a single memory access is correct. You should want to avoid doing new and delete yourself, however (which is something this wrapper provides), and if the data is more naturally interpreted as multi-dimensional, then showing that in the code will make the code easier to read as well.
Multiple allocations as shown in your second technique is inferior because it will take more time, but its advantage is that it may succeed more often if your system is fragmented (the free memory consists of smaller holes, and you do not have a free chunk of memory large enough to satisfy the single allocation request). But multiple allocations has another downside in that some more memory is needed to allocate space for the pointers to each row.
My suggestion provides the single allocation technique without needed to explicitly call new and delete, as the memory is managed by vector. At the same time, it allows the data to be addressed with the 2-dimensional syntax [x][y]. So it provides all the benefits of a single allocation with all the benefits of the multi-allocation, provided you have enough memory to fulfill the allocation request.
Consider using something like the following:
// array of pointers to doubles to point the beginning of rows
double ** data = new double*[N];
// allocate so many doubles to the first row, that it is long enough to feed them all
data[0] = new double[N * M];
// distribute pointers to individual rows as well
for (size_t i = 1; i < N; i++)
data[i] = data[0] + i * M;
I'm not sure if this is a general practice or not, I just came up with this. Some downs still apply to this approach, but I think it eliminates most of them, like being able to access the individual doubles like data[i][j] and all.

Implementing incremental array in C++

I want to implement an array that can increment as new values are added. Just like in Java. I don't have any idea of how to do this. Can anyone give me a way ?
This is done for learning purposes, thus I cannot use std::vector.
Here's a starting point: you only need three variables, nelems, capacity and a pointer to the actual array. So, your class would start off as
class dyn_array
{
T *data;
size_t nelems, capacity;
};
where T is the type of data you want to store; for extra credit, make this a template class. Now implement the algorithms discussed in your textbook or on the Wikipedia page on dynamic arrays.
Note that the new/delete allocation mechanism does not support growing an array like C's realloc does, so you'll actually be moving data's contents around when growing the capacity.
I would like to take the opportunity to interest you in an interesting but somewhat difficult topic: exceptions.
If you start allocating memory yourself and subsequently playing with raw pointers, you will find yourself in the difficult position of avoiding memory leaks.
Even if you are entrusting the book-keeping of the memory to a right class (say std::unique_ptr<char[]>), you still have to ensure that operations that change the object leave it in a consistent state should they fail.
For example, here is a simple class with an incorrect resize method (which is at the heart of most code):
template <typename T>
class DynamicArray {
public:
// Constructor
DynamicArray(): size(0), capacity(0), buffer(0) {}
// Destructor
~DynamicArray() {
if (buffer == 0) { return; }
for(size_t i = 0; i != size; ++i) {
T* t = buffer + i;
t->~T();
}
free(buffer); // using delete[] would require all objects to be built
}
private:
size_t size;
size_t capacity;
T* buffer;
};
Okay, so that's the easy part (although already a bit tricky).
Now, how do you push a new element at the end ?
template <typename T>
void DynamicArray<T>::resize(size_t n) {
// The *easy* case
if (n <= size) {
for (; n < size; ++n) {
(buffer + n)->~T();
}
size = n;
return;
}
// The *hard* case
// new size
size_t const oldsize = size;
size = n;
// new capacity
if (capacity == 0) { capacity = 1; }
while (capacity < n) { capacity *= 2; }
// new buffer (copied)
try {
T* newbuffer = (T*)malloc(capacity*sizeof(T));
// copy
for (size_t i = 0; i != oldsize; ++i) {
new (newbuffer + i) T(*(buffer + i));
}
free(buffer)
buffer = newbuffer;
} catch(...) {
free(newbuffer);
throw;
}
}
Feels right no ?
I mean, we even take care of a possible exception raised by T's copy constructor! yeah!
Do note the subtle issue we have though: if an exception is thrown, we have changed the size and capacity members but still have the old buffer.
The fix is obvious, of course: we should first change the buffer, and then the size and capacity. Of course...
But it is "difficult" to get it right.
I would recommend using an alternative approach: create an immutable array class (the capacity should be immutable, not the rest), and implement an exception-less swap method.
Then, you'll be able to implement the "transaction-like" semantics much more easily.
An array which grows dynamically as we add elements are called dynamic array, growable array, and here is a complete implementation of a dynamic array .
In C and C++ array notation is basically just short hand pointer maths.
So in this example.
int fib [] = { 1, 1, 2, 3, 5, 8, 13};
This:
int position5 = fib[5];
Is the same thing as saying this:
int position5 = int(char*(fib)) + (5 * sizeof(int));
So basically arrays are just pointers.
So if you want to auto allocate you will need to write some wrapper functions to call malloc() or new, ( C and C++ respectively).
Although you might find vectors are what you are looking for...

delete [] performance issues

I wrote a program, which computes the flow shop scheduling problem.
I need help with optimizing the slowest parts of my program:
Firstly there is array 2D array allocation:
this->_perm = new Chromosome*[f];
//... for (...)
this->_perm[i] = new Chromosome[fM1];
It works just fine, but a problem occurs, when I try to delete array:
delete [] _perm[i];
It takes extremely long to execute line above. Chromosome is array of about 300k elements - allocating it takes less than a second but deleting takes far more than a minute.
I would appreciate any suggestions of improving delete part.
On a general note, you should never manually manage memory in C++. This will lead to leaks, double-deletions and all kinds of nasty inconveniences. Use proper resource-handling classes for this. For example, std::vector is what you should use for managing a dynamically allocated array.
To get back to your problem at hand, you first need to know what delete [] _perm[i] does: It calls the destructor for every Chromosome object in that array and then frees the memory. Now you do this in a loop, which means this will call all Chromosome destructors and perform f deallocations. As was already mentioned in a comment to your question, it is very likely that the Chromosome destructor is the actual culprit. Try to investigate that.
You can, however, change your memory handling to improve the speed of allocation and deallocation. As Nawaz has shown, you could allocate one big chunk of memory and use that. I'd use a std::vector for a buffer:
void f(std::size_t row, std::size_t col)
{
int sizeMemory = sizeof(Chromosome) * row * col;
std::vector<unsigned char> buffer(sizeMemory); //allocation of memory at once!
vector<Chromosome*> chromosomes(row);
// use algorithm as shown by Nawaz
std::size_t j = 0 ;
for(std::size_t i = 0 ; i < row ; i++ )
{
//...
}
make_baby(chromosomes); //use chromosomes
in_place_destruct(chromosomes.begin(), chromosomes.end());
// automatic freeing of memory holding pointers in chromosomes
// automatic freeing of buffer memory
}
template< typename InpIt >
void in_place_destruct(InpIt begin, InpIt end)
{
typedef std::iterator_traits<InpIt>::value_type value_type; // to call dtor
while(begin != end)
(begin++)->~value_type(); // call dtor
}
However, despite handling all memory through std::vector this still is not fully exception-safe, as it needs to call the Chromosome destructors explicitly. (If make_baby() throws an exception, the function f() will be aborted early. While the destructors of the vectors will delete their content, one only contains pointers, and the other treats its content as raw memory. No guard is watching over the actual objects created in that raw memory.)
The best solution I can see is to use a one-dimensional arrays wrapped in a class that allows two-dimensional access to the elements in that array. (Memory is one-dimensional, after all, on current hardware, so the system is already doing this.) Here's a sketch of that:
class chromosome_matrix {
public:
chromosome_matrix(std::size_t row, std::size_t col)
: row_(row), col_(col), data_(row*col)
{
// data_ contains row*col constructed Chromosome objects
}
// note needed, compiler generated dtor will do the right thing
//~chromosome_matrix()
// these rely on pointer arithmetic to access a column
Chromosome* operator[](std::size_t row) {return &data_[row*col_];}
const Chromosome* operator[](std::size_t row) const {return &data_[row*col_];}
private:
std::size_t row_;
std::size_t col_;
std::vector<chromosomes> data_
};
void f(std::size_t row, std::size_t col)
{
chromosome_matrix cm(row, col);
Chromosome* column = ch[0]; // get a whole column
Chromosome& chromosome1 = column[0]; // get one object
Chromosome& chromosome2 = cm[1][2]; // access object directly
// make baby
}
check your destructors.
If you were allocating a built-in type (eg an int) then allocating 300,000 of them would be more expensive than the corresponding delete. But that's a relative term, 300k allocated in a single block is pretty fast.
As you're allocating 300k Chromosomes, the allocator has to allocate 300k * sizeof the Chromosome object, and as you say its fast - I can't see it doing much beside just that (ie the constructor calls are optimised into nothingness)
However, when you come to delete, it not only frees up all that memory, but it also calls the destructor for each object, and if its slow, I would guess that the destructor for each object takes a small, but noticeable, time when you have 300k of them.
I would suggest you to use placement new. The allocation and deallocation can be done just in one statement each!
int sizeMemory = sizeof(Chromosome) * row * col;
char* buffer = new char[sizeMemory]; //allocation of memory at once!
vector<Chromosome*> chromosomes;
chromosomes.reserve(row);
int j = 0 ;
for(int i = 0 ; i < row ; i++ )
{
//only construction of object. No allocation!
Chromosome *pChromosome = new (&buffer[j]) Chromosome[col];
chromosomes.push_back(pChromosome);
j = j+ sizeof(Chromosome) * col;
}
for(int i = 0 ; i < row ; i++ )
{
for(int j = 0 ; j < col ; j++ )
{
//only destruction of object. No deallocation!
chromosomes[i][j].~Chromosome();
}
}
delete [] buffer; //actual deallocation of memory at once!
std::vector can help.
Special memory allocators too.

Nested STL vector using way too much memory

I have an STL vector My_Partition_Vector of Partition objects, defined as
struct Partition // the event log data structure
{
int key;
std::vector<std::vector<char> > partitions;
float modularity;
};
The actual nested structure of Partition.partitions varies from object to object but in the total number of chars stored in Partition.partitions is always 16.
I assumed therefore that the total size of the object should be more or less 24 bytes (16 + 4 + 4). However for every 100,000 items I add to My_Partition_Vector, memory consumption (found using ps -aux) increases by around 20 MB indicating around 209 bytes for each Partition Object.
This is a nearly 9 Fold increase!? Where is all this extra memory usage coming from? Some kind of padding in the STL vector, or the struct? How can I resolve this (and stop it reaching into swap)?
For one thing std::vector models a dynamic array so if you know that you'll always have 16 chars in partitions using std::vector is overkill. Use a good old C style array/matrix, boost::array or boost::multi_array.
To reduce the number of re-allocations needed for inserting/adding elements due to it's memory layout constrains std::vector is allowed to preallocate memory for a certain number of elements upfront (and it's capacity() member function will tell you how much).
While I think he may be overstating the situation just a tad, I'm in general agreement with DeadMG's conclusion that what you're doing is asking for trouble.
Although I'm generally the one looking at (whatever mess somebody has made) and saying "don't do that, just use a vector", this case might well be an exception. You're creating a huge number of objects that should be tiny. Unfortunately, a vector typically looks something like this:
template <class T>
class vector {
T *data;
size_t allocated;
size_t valid;
public:
// ...
};
On a typical 32-bit machine, that's twelve bytes already. Since you're using a vector<vector<char> >, you're going to have 12 bytes for the outer vector, plus twelve more for each vector it holds. Then, when you actually store any data in your vectors, each of those needs to allocate a block of memory from the free store. Depending on how your free store is implemented, you'll typically have a minimum block size -- frequently 32 or even 64 bytes. Worse, the heap typically has some overhead of its own, so it'll add some more memory onto each block, for its own book-keeping (e.g., it might use a linked list of blocks, adding another pointer worth of data to each allocation).
Just for grins, let's assume you average four vectors of four bytes apiece, and that your heap manager has a 32-byte minimum block size and one extra pointer (or int) for its bookkeeping (giving a real minimum of 36 bytes per block). Multiplying that out, I get 204 bytes apiece -- close enough to your 209 to believe that's reasonably close to what you're dealing with.
The question at that point is how to deal with the problem. One possibility is to try to work behind the scenes. All the containers in the standard library use allocators to get their memory. While they default allocator gets memory directly from the free store, you can substitute a different one if you choose. If you do some looking around, you can find any number of alternative allocators, many/most of which are to help with exactly the situation you're in -- reducing wasted memory when allocating lots of small objects. A couple to look at would be the Boost Pool Allocator and the Loki small object allocator.
Another possibility (that can be combined with the first) would be to quit using a vector<vector<char> > at all, and replace it with something like:
char partitions[16];
struct parts {
int part0 : 4;
int part1 : 4;
int part2 : 4;
int part3 : 4;
int part4 : 4;
int part5 : 4;
int part6 : 4
int part7 : 4;
};
For the moment, I'm assuming a maximum of 8 partitions -- if it could be 16, you can add more to parts. This should probably reduce memory usage quite a bit more, but (as-is) will affect your other code. You could also wrap this up into a small class of its own that provides 2D-style addressing to minimize impact on the rest of your code.
If you store a near constant amount of objects, then I suggest to use a 2-dimensional array.
The most likely reason for the memory consumption is debug data. STL implementations usually store A LOT of debug data. Never profile an application with debug flags on.
...This is a bit of a side conversation, but boost::multi_array was suggested as an alternative to the OP's use of nested vectors. My finding was that multi_array was using a similar amount of memory when applied to the OP's operating parameters.
I derived this code from the example at Boost.MultiArray. On my machine, this showed multi_array using about 10x more memory than ideally required assuming that the 16 bytes are arranged in a simple rectangular geometry.
To evaluate the memory usage, I checked the system monitor while the program was running and I compiled with
( export CXXFLAGS="-Wall -DNDEBUG -O3" ; make main && ./main )
Here's the code...
#include <iostream>
#include <vector>
#include "boost/multi_array.hpp"
#include <tr1/array>
#include <cassert>
#define USE_CUSTOM_ARRAY 0 // compare memory usage of my custom array vs. boost::multi_array
using std::cerr;
using std::vector;
#ifdef USE_CUSTOM_ARRAY
template< typename T, int YSIZE, int XSIZE >
class array_2D
{
std::tr1::array<char,YSIZE*XSIZE> data;
public:
T & operator () ( int y, int x ) { return data[y*XSIZE+x]; } // preferred accessor (avoid pointers)
T * operator [] ( int index ) { return &data[index*XSIZE]; } // alternative accessor (mimics boost::multi_array syntax)
};
#endif
int main ()
{
int COUNT = 1024*1024;
#if USE_CUSTOM_ARRAY
vector< array_2D<char,4,4> > A( COUNT );
typedef int index;
#else
typedef boost::multi_array<char,2> array_type;
typedef array_type::index index;
vector<array_type> A( COUNT, array_type(boost::extents[4][4]) );
#endif
// Assign values to the elements
int values = 0;
for ( int n=0; n<COUNT; n++ )
for(index i = 0; i != 4; ++i)
for(index j = 0; j != 4; ++j)
A[n][i][j] = values++;
// Verify values
int verify = 0;
for ( int n=0; n<COUNT; n++ )
for(index i = 0; i != 4; ++i)
for(index j = 0; j != 4; ++j)
{
assert( A[n][i][j] == (char)((verify++)&0xFF) );
#if USE_CUSTOM_ARRAY
assert( A[n][i][j] == A[n](i,j) ); // testing accessors
#endif
}
cerr <<"spinning...\n";
while ( 1 ) {} // wait here (so you can check memory usage in the system monitor)
return 0;
}
On my system, sizeof(vector) is 24. This probably corresponds to 3 8-byte members: capacity, size, and pointer. Additionally, you need to consider the actual allocations which would be between 1 and 16 bytes (plus allocation overhead) for the inner vector and between 24 and 384 bytes for the outer vector ( sizeof(vector) * partitions.capacity() ).
I wrote a program to sum this up...
for ( int Y=1; Y<=16; Y++ )
{
const int X = 16/Y;
if ( X*Y != 16 ) continue; // ignore imperfect geometries
Partition a;
a.partitions = vector< vector<char> >( Y, vector<char>(X) );
int sum = sizeof(a); // main structure
sum += sizeof(vector<char>) * a.partitions.capacity(); // outer vector
for ( int i=0; i<(int)a.partitions.size(); i++ )
sum += sizeof(char) * a.partitions[i].capacity(); // inner vector
cerr <<"X="<<X<<", Y="<<Y<<", size = "<<sum<<"\n";
}
The results show how much memory (not including allocation overhead) is need for each simple geometry...
X=16, Y=1, size = 80
X=8, Y=2, size = 104
X=4, Y=4, size = 152
X=2, Y=8, size = 248
X=1, Y=16, size = 440
Look at the how the "sum" is calculated to see what all of the components are.
The results posted are based on my 64-bit architecture. If you have a 32-bit architecture the sizes would be almost half as much -- but still a lot more than what you had expected.
In conclusion, std::vector<> is not very space efficient for doing a whole bunch of very small allocations. If your application is required to be efficient, then you should use a different container.
My approach to solving this would probably be to allocate the 16 chars with
std::tr1::array<char,16>
and wrap that with a custom class that maps 2D coordinates onto the array allocation.
Below is a very crude way of doing this, just as an example to get you started. You would have to change this to meet your specific needs -- especially the ability to specify the geometry dynamically.
template< typename T, int YSIZE, int XSIZE >
class array_2D
{
std::tr1::array<char,YSIZE*XSIZE> data;
public:
T & operator () ( int y, int x ) { return data[y*XSIZE+x]; } // preferred accessor (avoid pointers)
T * operator [] ( int index ) { return &data[index*XSIZE]; } // alternative accessor (mimics boost::multi_array syntax)
};
16 bytes is a complete and total waste. You're storing a hell of a lot of data about very small objects. A vector of vector is the wrong solution to use. You should log sizeof(vector) - it's not insignificant, as it performs a substantial function. On my compiler, sizeof(vector) is 20. So each Partition is 4 + 4 + 16 + 20 + 20*number of inner partitions + memory overheads like the vectors not being the perfect size.
You're only storing 16 bytes of data, and wasting ridiculous amounts of memory allocating them in the most segregated, highest overhead way you could possibly think of. The vector doesn't use a lot of memory - you have a terrible design.